Data. The first constraint

Data. The first constraint

Do you know why we have so many new models for English or Spanish or Mandarin.

It's the data.

By now, the training recipes, models, architectures which work etc are all open source and well understood. We just need to feed the data.

So when me and my team started to build the Telugu LLM we thought, ok great, all we need to do is feed it data. Lets find the Telugu data online and train it and we are done.

Imagine our shock when we found out how little Telugu data was there online.

Telugu is one of the world's most significant and fastest-growing languages, holding a prestigious status as a "Classical Language" of India.

Global & National Standing (2025)

  • Global Rank: Approximately the 14th to 18th most spoken native language in the world.
  • India Rank: The 4th most spoken language in India, following Hindi, Bengali, and Marathi.
  • Total Speakers: Estimated at roughly 96 million (with nearly 82–83 million native speakers and ~13 million second-language speakers).

We have a great culture and a big film industry and a history of great poets and great written works. So we assumed we will find all the data we want online and we just need to train it.

But in the world of AI and Big Data, Telugu is considered a low-resource language despite its massive population. While Spanish and French have trillions of tokens available, Telugu’s digital footprint is significantly smaller, though growing rapidly.

💡
Turns out, for LLMs, you don't just need great classics. You need a lot of content even if of not of great quality like the Internet provides for other languages like English.

As of late 2025, here is the breakdown of Telugu's presence in major online text datasets:

1. AI Training Datasets (Tokens)

In the context of Large Language Models (LLMs), "tokens" represent the chunks of text used for training.

  • The Gap: Global models like Llama 3 are trained on 15 trillion tokens, but Telugu makes up less than 0.02% of the standard web-crawl data (Common Crawl).
  • High-Quality Corpus: Projects like IndicCorp v2 (by AI4Bharat) have curated roughly 20 billion tokens across 22 Indian languages. For Telugu specifically, the high-quality, cleaned text available for researchers is estimated at around 1 to 2 billion tokens.

There is literally no data for training an LLM.

Biswajit and team scoured the Internet to find open source data we can use for the training. We finally ended up with this list:

Dataset Words (Million) Gemma Tokens (Million) Source
newsarticles-navatelangana-swecha 89 85 Viswam
newsarticles-prajasakti-swecha 63 60 Viswam
tewiki-latest-pages-articles 183 233 Viswam
sangraha-te 5,340 5,409 sangraha
fineweb-2-te 2,178 2,136 fineweb-2
HPLT2.0_cleaned-te 2,372 2,450 HPLT2.0_cleaned
book_data_hf 41 57 Viswam
CulturaX-te 1,895 1,906 CulturaX
Total ~12,161 ~12,336 All

Note: For multilingual datasets, only the Telugu portion was extracted and used. Since we are targeting to build a Telugu-only model, we have considered only Telugu data from each dataset.

That's it. Thats all the tokens on the open web. If anyone knows of some other sources, please comment or reach out so we can add to this list.