From Scratch to Intelligence: Pre-training the Telugu Foundation Model
After perfecting our tokenizer, we reached the most compute-intensive part of the journey: Pre-training.
Pre-training is where the model "reads" the entire Telugu corpus and learns the statistical relationships between tokens. It’s about learning the logic, grammar, and nuances of the language. For this phase, we adjusted some standard recipes and we utilized cutting-edge optimization techniques and hardware to push the boundaries of what's possible on a single GPU.
The Architecture: A Transformer Optimized for Telugu
Our model, nanochat-telugu-d20, is a Transformer-based decoder-only architecture. We carefully balanced the depth and width to ensure stable gradients and efficient feature representation.
Model Blueprint:
- Depth: 20 Layers
- Model Dimension ($d_{model}$): 1,280 (using a 64x aspect ratio relative to depth)
- Attention Heads: 10 heads (Head dimension of 128)
- Sequence Length: 2,048 tokens
- Vocabulary Size: 65,536 tokens (perfectly aligned with our custom tokenizer)
We used Group Query Attention (GQA) in a 1:1 ratio, effectively keeping it as standard multi-head attention to ensure high-fidelity reasoning in this foundational stage.
Optimization: The Muon + AdamW Hybrid
One of the most exciting parts of this run was our optimizer strategy. Instead of sticking purely to AdamW, we implemented a Hybrid Optimizer approach to maximize learning efficiency.
- Muon Optimizer (for Linear Layers): We used the innovative Muon optimizer for all matrix parameters. Muon is a "geometry-aware" optimizer that uses Newton-Schulz iterations to keep the updates orthogonal. In plain English: it makes the model learn faster and more efficiently from fewer samples.
- AdamW (for Embeddings): We kept the standard AdamW for the embedding and unembedding layers to handle the high-cardinality sparse updates.
Hyperparameter Snapshot:
- Matrix Learning Rate: 0.02 (Muon)
- Embedding Learning Rate: 0.2 (AdamW)
- Warmdown: We used a 20% warmdown period, decaying the learning rate to zero at the end of the 21,400 steps.
Chinchilla Scaling & Hardware Performance
We followed the Chinchilla Scaling Law, which suggests an optimal ratio of 20 tokens per parameter. This ensures we aren't "over-training" a small model or "under-training" a large one—we are hitting the compute-optimal "sweet spot."
The Heavy Lifting (Hardware):
- GPU: 1 × NVIDIA H100 (SXM variant)
- System: A massive 2TB RAM setup with 112 CPU cores to feed the GPU without bottlenecks.
Training Efficiency:
| Metric | Result |
| Total Training Time | 22 Hours |
| Total Training FLOPs | ~39.18 ExaFLOPs |
| Throughput | 8,836 tokens/second |
| Model FLOP Utilization (MFU) | ~50% |
Achieving 50% MFU on a single H100 while using complex optimizers like Muon is a testament to the efficiency of the torch.compile stack we utilized.
Summary:
Model Architecture:
The model architecture is derived from the depth parameter:
- Depth: 20 layers
- Model Dimension: 1,280 (depth × 64 aspect ratio)
- Number of Attention Heads: 10 (calculated from model dimension with head dimension of 128)
- Key-Value Heads: 10 (1:1 GQA ratio, Group Query Attention disabled)
- Vocabulary Size: 65,536 tokens
- Maximum Sequence Length: 2,048 tokens
Training Configuration:
- Run Name: nanochat-telugu-d20
- Device Batch Size: 16 tokens per device
- Total Batch Size: 524,288 tokens (achieved through gradient accumulation)
- Target Param-Data Ratio: 20 (following Chinchilla scaling law)
- Number of Iterations: Automatically calculated to maintain the target data-to-parameter ratio
Optimization Hyperparameters:
- Embedding Learning Rate: 0.2 (AdamW optimizer)
- Unembedding Learning Rate: 0.004 (AdamW optimizer)
- Matrix Learning Rate: 0.02 (Muon optimizer for linear layers)
- Weight Decay: 0.0
- Gradient Clipping: 1.0
- Learning Rate Schedule:
- Warmup ratio: 0.0 (no warmup)
- Warmdown ratio: 0.2 (20% of training for warmdown)
- Final LR fraction: 0.0 (decay to zero)
- Muon Momentum Schedule: Linearly increases from 0.85 to 0.95 over first 300 steps
Evaluation and Monitoring:
- Validation Evaluation: Every 250 steps
- Evaluation Tokens: 10,485,760 tokens (10M tokens) per validation
- Core Metric Evaluation: Every 2,000 steps
- Core Metric Max Examples: 500 examples per task
- Sampling: Every 2,000 steps (generates Telugu text samples)
Hardware:
- GPU: NVIDIA H100 80GB HBM3
- CUDA Version: 12.8
- CPU: 112 cores (224 logical cores)
- System Memory: ~2TB RAM
Measuring Intelligence: Evaluation Suite
How do we know the model is actually learning? Every 2,000 steps, we ran the model through a comprehensive suite of Telugu-translated benchmarks. This isn't just a simple loss check; it's a "fitness test" for reasoning.
Pre-training Benchmarks:
We evaluated across 10 diverse categories including Commonsense QA, Mathematical Reasoning (GSM8K), and Reading Comprehension (SQuAD).
Pre-training Final Evaluation Results (Step 21,400):
| Task | Centered Score |
|---|---|
| ARC Challenge | 0.0133 |
| ARC Easy | 0.0800 |
| BoolQ | -0.1184 |
| Commonsense QA | 0.1437 |
| COPA | -0.0800 |
| HellaSwag | 0.2400 |
| HellaSwag Zero-shot | 0.2067 |
| Jeopardy | 0.0000 |
| PIQA | 0.0200 |
| SQuAD | 0.0400 |
| Winograd | 0.0300 |
| Core Metric | 0.0523 |
The core metric represents the average performance across all evaluation tasks, providing a comprehensive measure of the model's capabilities across different reasoning and understanding domains.
By the end of the 22-hour run, the model reached a Final Training Loss of 1.38 and a Validation Bits-per-Byte (bpb) of 0.458. These numbers indicate the model has successfully captured the structural patterns of the Telugu language.
Summary & Next Steps
In just 22 hours on a single GPU, we have created a foundational model that understands Telugu pretty well.
What’s next?
Pre-training is just the beginning. The model now knows "how to speak." Next, we move to Mid-training. Mid training was conducted to bridge the gap between pre-training and Supervised Fine-Tuning (SFT) to teach it "how to follow instructions."