High-quality data from entire web is not enough for GPT-5 AB training. How will the problem be solved?

April 2, 2024  18:23

Developers of advanced artificial intelligence (AI) models have faced an unexpected problem: there is not enough quality material to train AI models. The situation is exacerbated by the fact that some resources block access to their data by AI. According to the researchers, attempts to teach AB using materials from other models and other "synthetic content" can lead to "big problems".

Scientists and AI developers are concerned that in the next two years there may not be enough high-quality texts to continue teaching large language models, slowing the field's progress. OpenAI, the company behind the ChatGPT chatbot, is already considering training GPT-5 on transcriptions of public YouTube videos.

Internet data

AI language models collect text from the web—scientific research, news, Wikipedia articles—and break it down into individual words or word parts, using them to learn to respond like a human. The more data entered, the better the result. that's what OpenAI is built on, which has helped the company become one of the industry leaders.

According to Pablo Villalobos, an AI researcher at Epoch Research Institute, GPT-4 was trained on 12 trillion tokens of data, while an AI like GPT-5 requires 60-100 trillion tokens. If all the high-quality text and graphics data available on the web were collected, another 10 to 20 trillion tokens, or perhaps more, would be needed to train GPT-5, and it is not yet clear where to get them. Two years ago, Villalobos and other researchers had already warned that by mid-2024 there was a 50% chance that AI would no longer have enough data for training, and by 2026 that chance would reach 90%.

According to scientists, most of the data on the web is not suitable for AI training because it contains irrelevant text or does not add new information to existing data. Only a small fraction of the material is suitable for that purpose, about one-tenth, collected by the non-profit organization Common Crawl, whose web archive is widely used by AI developers.

At the same time, major platforms such as social networks and media block access to their data, and the public is unwilling to provide access to their private correspondence to train language models. Mark Zuckerberg sees a huge advantage in AI development that AI has access to the data available on Meta platforms, including text, images and videos, although it is difficult to say how much of this material can be considered quality.

Curriculum Techniques and the Data Marketplace

Startup DatologyAI is trying to combat the lack of content by using a "curriculum" technique, in which the AI is "fed" data in a specific order that helps make connections between them. A 2022 paper by Ari Morkos, former employee of Meta Platform and Google DeepMind and now founder of DatologyAI, states that this approach can achieve comparable results in AI training while cutting the input data in half. However, other studies have not confirmed these data.

Sam Altman also stated that OpenAI is developing new methods for AI training. According to reports, the company is discussing the possibility of creating a data market where the cost of specific materials for each model will be determined and the fair price paid for them. The same idea is being discussed at Google, but there is still no concrete progress in this direction, so AB development companies are trying to get everything they can, including audiovisual materials. According to OpenAI sources, they are going to decode them with the Whisper speech recognition tool.

High quality synthetic data

Researchers at OpenAI and Anthropic are experimenting with so-called "high-quality synthetic data." In a recent interview, Jared Kaplan, Anthropic's chief scientist, said that "this kind of data generated in-house" could be useful and used in the latest versions of Claude. An OpenAI spokesperson also confirmed that such developments are being made.

Many researchers studying the data gap problem don't believe they can solve it, but Villalobos is optimistic and believes there are many more discoveries to come. "The biggest uncertainty is that we don't know what revolutionary discoveries are yet to come," he said.

According to Ari Morkos, "lack of data is one of the most important problems in the industry." However, it is not only the lack of data that inhibits its development. there is also a shortage of the chips needed to run large language models, and industry leaders are also concerned about data center and power shortages.

  • Archive