Studies Show AI Will Consume All Human-Generated Data On The Internet By 2026

A recent estimate indicates that AI might deplete all of the Internet's text data within the next few years.

Cover image via Shantanu Kumar/Pexels & Mikechie Esparagoza/Pexels

Follow us on Instagram, TikTok, and WhatsApp for the latest stories and breaking news.

A new study has warned that artificial intelligence (AI) systems could consume all of the Internet's existing knowledge by 2026

Researchers have published a paper indicating that large language models (LLMs) such as ChatGPT will completely go through the free Internet as soon as 2026.

AI models like GPT-4, which powers ChatGPT, and Claude 3 Opus rely on the vast amounts of text available online to improve.

To develop better models, tech companies will need to find alternative data sources. This might involve creating synthetic data, using lower-quality sources, or more concerningly, accessing private data stored on servers containing messages and emails.

Here's how the researchers came up with the conclusion:

Image used for illustration purposes only.

Image via Sankret Mishra/Pexels

To estimate the amount of text available online, researchers turned to Google's web index and found about 250 billion web pages, each with around 7,000 bytes of text.

They then analysed the flow of data across the web through IP traffic and online user activity to predict the growth of this data.

Their findings showed that high-quality information from reliable sources could run out by 2032 at the latest. Low-quality text data might be used up between 2030 and 2050. As for image data, it could be completely consumed between 2030 and 2060.

AI requires vast amounts of data to keep improving

Training data fuels AI systems' growth, allowing them to identify increasingly complex patterns within their neural networks. ChatGPT was trained on approximately 570 GB of text data, which equates to around 300 billion words sourced from e-books, online articles, Wikipedia, and other online sources.

AI algorithms trained on insufficient or low-quality data tend to produce unreliable outputs. This was prevalent in the early days of LLM AIs which spewed utter nonsense as their response. For example, a recent chat with an AI went viral for producing bad recipes that included non-edible material like glue.

Though there are alternatives to this conundrum, they are far from perfect solutions

Image used for illustration purposes only.

Image via Matheus Bertelli/Pexels

One way to address the need for data to train AI models is by using synthetic, artificially generated data. This method has proven effective in training systems used in gaming, coding, and mathematical applications.

On the other hand, if companies try to gather intellectual property or private information without permission, legal issues could arise, according to experts.

Follow SAYS Tech on Facebook, Instagram, & TikTok for the latest in tech in Malaysia and the world!

Read more #tech stories:

Don't miss out! Here are some more trending stories on SAYS:

You may be interested in: