20 Dec 2025 3 min read

From Scratch to Intelligence: Pre-training the Telugu Foundation Model

After perfecting our tokenizer, we reached the most compute-intensive part of the journey: Pre-training.

Pre-training is where the model "reads" the entire Telugu corpus and learns the statistical relationships between tokens. It’s about learning the logic, grammar, and nuances of the language. For this phase, we adjusted some standard recipes and we utilized cutting-edge optimization techniques and hardware to push the boundaries of what's possible on a single GPU.

The Architecture: A Transformer Optimized for Telugu

Our model, nanochat-telugu-d20, is a Transformer-based decoder-only architecture. We carefully balanced the depth and width to ensure stable gradients and efficient feature representation.

Model Blueprint:

Depth: 20 Layers
Model Dimension ($d_{model}$): 1,280 (using a 64x aspect ratio relative to depth)
Attention Heads: 10 heads (Head dimension of 128)
Sequence Length: 2,048 tokens
Vocabulary Size: 65,536 tokens (perfectly aligned with our custom tokenizer)

We used Group Query Attention (GQA) in a 1:1 ratio, effectively keeping it as standard multi-head attention to ensure high-fidelity reasoning in this foundational stage.

Optimization: The Muon + AdamW Hybrid

One of the most exciting parts of this run was our optimizer strategy. Instead of sticking purely to AdamW, we implemented a Hybrid Optimizer approach to maximize learning efficiency.

Muon Optimizer (for Linear Layers): We used the innovative Muon optimizer for all matrix parameters. Muon is a "geometry-aware" optimizer that uses Newton-Schulz iterations to keep the updates orthogonal. In plain English: it makes the model learn faster and more efficiently from fewer samples.
AdamW (for Embeddings): We kept the standard AdamW for the embedding and unembedding layers to handle the high-cardinality sparse updates.

Hyperparameter Snapshot:

Matrix Learning Rate: 0.02 (Muon)
Embedding Learning Rate: 0.2 (AdamW)
Warmdown: We used a 20% warmdown period, decaying the learning rate to zero at the end of the 21,400 steps.

Chinchilla Scaling & Hardware Performance

We followed the Chinchilla Scaling Law, which suggests an optimal ratio of 20 tokens per parameter. This ensures we aren't "over-training" a small model or "under-training" a large one—we are hitting the compute-optimal "sweet spot."

The Heavy Lifting (Hardware):

GPU: 1 × NVIDIA H100 (SXM variant)
System: A massive 2TB RAM setup with 112 CPU cores to feed the GPU without bottlenecks.

Training Efficiency:

Metric	Result
Total Training Time	22 Hours
Total Training FLOPs	~39.18 ExaFLOPs
Throughput	8,836 tokens/second
Model FLOP Utilization (MFU)	~50%

Achieving 50% MFU on a single H100 while using complex optimizers like Muon is a testament to the efficiency of the torch.compile stack we utilized.

Summary:
Model Architecture:

The model architecture is derived from the depth parameter:

Depth: 20 layers
Model Dimension: 1,280 (depth × 64 aspect ratio)
Number of Attention Heads: 10 (calculated from model dimension with head dimension of 128)
Key-Value Heads: 10 (1:1 GQA ratio, Group Query Attention disabled)
Vocabulary Size: 65,536 tokens
Maximum Sequence Length: 2,048 tokens

Training Configuration:

Run Name: nanochat-telugu-d20
Device Batch Size: 16 tokens per device
Total Batch Size: 524,288 tokens (achieved through gradient accumulation)
Target Param-Data Ratio: 20 (following Chinchilla scaling law)
Number of Iterations: Automatically calculated to maintain the target data-to-parameter ratio

Optimization Hyperparameters:

Embedding Learning Rate: 0.2 (AdamW optimizer)
Unembedding Learning Rate: 0.004 (AdamW optimizer)
Matrix Learning Rate: 0.02 (Muon optimizer for linear layers)
Weight Decay: 0.0
Gradient Clipping: 1.0
Learning Rate Schedule:
- Warmup ratio: 0.0 (no warmup)
- Warmdown ratio: 0.2 (20% of training for warmdown)
- Final LR fraction: 0.0 (decay to zero)
Muon Momentum Schedule: Linearly increases from 0.85 to 0.95 over first 300 steps

Evaluation and Monitoring:

Validation Evaluation: Every 250 steps
Evaluation Tokens: 10,485,760 tokens (10M tokens) per validation
Core Metric Evaluation: Every 2,000 steps
Core Metric Max Examples: 500 examples per task
Sampling: Every 2,000 steps (generates Telugu text samples)

Hardware:

GPU: NVIDIA H100 80GB HBM3
CUDA Version: 12.8
CPU: 112 cores (224 logical cores)
System Memory: ~2TB RAM

Measuring Intelligence: Evaluation Suite

How do we know the model is actually learning? Every 2,000 steps, we ran the model through a comprehensive suite of Telugu-translated benchmarks. This isn't just a simple loss check; it's a "fitness test" for reasoning.

Pre-training Benchmarks:

We evaluated across 10 diverse categories including Commonsense QA, Mathematical Reasoning (GSM8K), and Reading Comprehension (SQuAD).

Pre-training Final Evaluation Results (Step 21,400):

Task	Centered Score
ARC Challenge	0.0133
ARC Easy	0.0800
BoolQ	-0.1184
Commonsense QA	0.1437
COPA	-0.0800
HellaSwag	0.2400
HellaSwag Zero-shot	0.2067
Jeopardy	0.0000
PIQA	0.0200
SQuAD	0.0400
Winograd	0.0300
Core Metric	0.0523

The core metric represents the average performance across all evaluation tasks, providing a comprehensive measure of the model's capabilities across different reasoning and understanding domains.

By the end of the 22-hour run, the model reached a Final Training Loss of 1.38 and a Validation Bits-per-Byte (bpb) of 0.458. These numbers indicate the model has successfully captured the structural patterns of the Telugu language.

Summary & Next Steps

In just 22 hours on a single GPU, we have created a foundational model that understands Telugu pretty well.

What’s next?

Pre-training is just the beginning. The model now knows "how to speak." Next, we move to Mid-training. Mid training was conducted to bridge the gap between pre-training and Supervised Fine-Tuning (SFT) to teach it "how to follow instructions."