The Final Polish: Supervised Fine-Tuning (SFT) of Akshara
If Pre-training gave Akshara its knowledge and Mid-training gave it its professional structure, then Supervised Fine-Tuning (SFT) is the final high-stakes training that turns a model into a truly reliable assistant.
In this final stage, we refined Akshara’s ability to handle specific, high-value tasks, from summarization and paraphrasing to creative storytelling, ensuring that every interaction feels natural, accurate, and culturally grounded in Telugu.
What is Supervised Fine-Tuning?
SFT is the process of training the model on a small, high-quality dataset of "demonstrations." Each data point is a pair consisting of a User Prompt and a Target Assistant Response.
Unlike the earlier stages, where the model learns the structure of language, SFT teaches the model the behavior it should exhibit. It learns that when a user asks for a summary, it should be concise; when a user asks for a story, it should be imaginative.
Phase 1: The Specialist Trials
Before building the final "all-in-one" model, we conducted a series of "Specialist Trials." We wanted to see how well our mid-trained checkpoint (nanochat-telugu-d20-mid-smoltalk-100k) could master individual domains.
1. The Summarization Specialist
Using the TeSum dataset from IIIT Hyderabad, we trained a version specifically for abstractive summarization.
- Duration: ~4.97 hours
- Achievement: Reached a final validation loss of 0.85, proving the model could identify the "meat" of a Telugu article and condense it effectively.
2. The Creative Writer
We focused on a massive Telugu stories dataset to push the limits of creative prose.
- Duration: ~10 hours
- Achievement: Reached a final training loss of 0.80. This version became highly proficient in narrative pacing and "literary" Telugu.
3. The News Reporter
We trained a specialist on Telugu news articles, alternating between title generation and full article writing.
- Duration: ~13 hours
- Result: This trial was essential for ensuring the model maintained a neutral, journalistic tone for factual queries.
Phase 2: The Final SFT (Akshara-v1)
After analyzing the specialist runs, we combined all these high-quality datasets into one master training mixture for the final checkpoint: nanochat-telugu-d20-sft-v1.
The Master Mixture:
- Instruction Following: SmolTalk (10k high-quality pairs).
- Identity: Custom-crafted conversations to lock in Akshara's personality.
- Summarization: TeSum and XL-Sum (Telugu portion).
- Paraphrase: IndicParaphrase for linguistic versatility.
- Creative: Telugu Tiny Stories (200k examples).
- Information: AYA Telugu news articles.
Training Performance on NVIDIA A30:
The final SFT was a marathon, requiring nearly 28.35 hours of continuous training.
| Metric | Result |
| Total Steps | 45,102 |
| Total Time | ~28.35 Hours |
| Data Type | bfloat16 |
| Final Training Loss | 0.99 |
| Final Validation Loss | 0.89 |
By using the Muon optimizer for matrix layers and AdamW for embeddings, we maintained a highly efficient learning rate schedule that allowed the model to converge deeply into the nuances of these combined tasks.
The Result: A Unified Telugu Assistant
The final model, Akshara-v1, is more than the sum of its parts. It doesn't just "complete" text; it answers.
Whether you need a summary of a complex Telugu news article, a creative story for a child, or just a friendly conversation about local culture, Akshara-v1 responds with a level of fluency and safety that represents a new standard for Telugu LLMs.
Key Capabilities of Akshara-v1:
- Multi-tasking: Switches seamlessly between creative writing and factual summarizing.
- Identity-Aware: Knows its name and purpose without hallucinating corporate origins.
- Linguistic Nuance: Handles complex Telugu grammar (like Sandhi and Samasa) more naturally than English-centric models.
Series Conclusion: The Akshara Journey
From training a Rust-based BPE tokenizer on 60 billion characters to the 22-hour H100 pre-training run, and finally this 28-hour SFT marathon, building Akshara has been a masterclass in efficiency and localization.
We have proven that by focusing on data quality, utilizing cutting-edge optimizers like Muon, and respecting the specific structure of the Telugu language, we can build a world-class foundational model that is accessible and affordable.
The future of Indic AI isn't just "scaling up"—it's "tuning in."
Appendix:
First Set of Trials (Task-Specific Fine-Tuning):
These initial trials focused on fine-tuning the model on specific tasks individually, using the nanochat-telugu-d20-mid-smoltalk-100k checkpoint as the base.
1. SmolTalk Fine-Tuning (nanochat-telugu-d20-sft-smoltalk-10k):
- Base Checkpoint:
nanochat-telugu-d20-mid-smoltalk-100k - Dataset: SmolTalk (10,000 examples)
- Training Configuration:
- Device Batch Size: 2
- Data Type: bfloat16
- Matrix Learning Rate: 0.02
- Embedding Learning Rate: 0.2
- Unembedding Learning Rate: 0.004
- Initial Learning Rate Fraction: 0.02
- Weight Decay: 0
- Number of Epochs: 1
- Target Examples Per Step: 32
- Evaluation Frequency: Every 100 steps
- Evaluation Metrics Frequency: Every 200 steps
- Training Script:
scripts.chat_sft_telugu
- Training Details:
- Hardware: 1 × NVIDIA A30 GPU
- Total Training Steps: 311 steps
- Total Training Time: 1,566 seconds (~26 minutes)
- Final Training Loss: 0.82
- Final Validation Loss: 0.84
- Final Learning Rate Multiplier: 0.0064
2. Summarization Fine-Tuning (nanochat-telugu-d20-sft-summary):
- Base Checkpoint:
nanochat-telugu-d20-mid-smoltalk-100k - Dataset: TeSum (IIIT Hyderabad summarization dataset)
- Training Configuration:
- Device Batch Size: 2
- Data Type: bfloat16
- Matrix Learning Rate: 0.02
- Embedding Learning Rate: 0.2
- Unembedding Learning Rate: 0.004
- Initial Learning Rate Fraction: 0.02
- Weight Decay: 0
- Number of Epochs: 1
- Target Examples Per Step: 32
- Evaluation Frequency: Every 100 steps
- Evaluation Metrics Frequency: Every 200 steps
- Training Script:
scripts.chat_sft_telugu_summary
- Training Details:
- Hardware: 1 × NVIDIA A30 GPU
- Total Training Steps: 4,054 steps
- Total Training Time: 17,880 seconds (~4.97 hours)
- Final Training Loss: 0.71
- Final Validation Loss: 0.85
- Final Learning Rate Multiplier: 0.00049
3. Paraphrase Fine-Tuning (nanochat-telugu-d20-sft-paraphrase):
- Base Checkpoint:
nanochat-telugu-d20-mid-smoltalk-100k - Dataset: IndicParaphrase (Telugu portion)
- Training Configuration:
- Device Batch Size: 2
- Data Type: bfloat16
- Matrix Learning Rate: 0.02
- Embedding Learning Rate: 0.2
- Unembedding Learning Rate: 0.004
- Initial Learning Rate Fraction: 0.02
- Weight Decay: 0
- Number of Epochs: 1
- Target Examples Per Step: 32
- Evaluation Frequency: Every 100 steps
- Evaluation Metrics Frequency: Every 200 steps
- Training Script:
scripts.chat_sft_telugu_paraphrase
- Training Details:
- Hardware: 1 × NVIDIA A30 GPU
- Total Training Steps: 21,760 steps
- Total Training Time: 49,289 seconds (~13.69 hours)
- Final Training Loss: 0.93
- Final Validation Loss: 0.88
- Final Learning Rate Multiplier: 0.000092
4. News Articles Fine-Tuning (nanochat-telugu-d20-sft-smoltalk-news-articles):
- Base Checkpoint:
nanochat-telugu-d20-mid-smoltalk-100k - Dataset: News articles (writes articles for given titles and writes titles for given articles)
- Training Configuration:
- Device Batch Size: 2
- Data Type: bfloat16
- Matrix Learning Rate: 0.02
- Embedding Learning Rate: 0.2
- Unembedding Learning Rate: 0.004
- Initial Learning Rate Fraction: 0.02
- Weight Decay: 0
- Number of Epochs: 1
- Target Examples Per Step: 32
- Evaluation Frequency: Every 100 steps
- Evaluation Metrics Frequency: Every 200 steps
- Training Script:
scripts.chat_sft_telugu_news
- Training Details:
- Hardware: 1 × NVIDIA A30 GPU
- Total Training Steps: 16,902 steps
- Total Training Time: 46,983 seconds (~13.05 hours)
- Final Training Loss: 1.13
- Final Validation Loss: 0.84
- Final Learning Rate Multiplier: 0.00012
5. Stories Fine-Tuning (nanochat-telugu-d20-sft-smoltalk-stories):
- Base Checkpoint:
nanochat-telugu-d20-mid-smoltalk-100k - Dataset: Telugu stories dataset
- Training Configuration:
- Device Batch Size: 2
- Data Type: bfloat16
- Matrix Learning Rate: 0.02
- Embedding Learning Rate: 0.2
- Unembedding Learning Rate: 0.004
- Initial Learning Rate Fraction: 0.02
- Weight Decay: 0
- Number of Epochs: 1
- Target Examples Per Step: 32
- Evaluation Frequency: Every 100 steps
- Evaluation Metrics Frequency: Every 200 steps
- Training Script:
scripts.chat_sft_telugu_stories
- Training Details:
- Hardware: 1 × NVIDIA A30 GPU
- Total Training Steps: 11,704 steps
- Total Training Time: 35,952 seconds (~9.99 hours)
- Final Training Loss: 0.80
- Final Validation Loss: 0.85
- Final Learning Rate Multiplier: 0.00017
Second Trial (Combined Dataset Fine-Tuning):
Final SFT Checkpoint (nanochat-telugu-d20-sft-v1):
- Base Checkpoint:
nanochat-telugu-d20-mid-v1-akshara - Combined Datasets:
- SmolTalk: Instruction-following conversations
- Chat Personalised: Telugu identity conversations for bot personality
- IndicParaphrase: Telugu paraphrase generation
- Custom News: AYA Telugu news articles (article and title generation)
- Custom Summary (TeSum): IIIT Hyderabad summarization dataset
- Custom Summary (XL-Sum): Multilingual summarization dataset (Telugu portion)
- Custom Story: Telugu tiny stories (200k examples)
- Training Configuration:
- Device Batch Size: 2
- Data Type: bfloat16
- Matrix Learning Rate: 0.02
- Embedding Learning Rate: 0.2
- Unembedding Learning Rate: 0.004
- Initial Learning Rate Fraction: 0.02
- Weight Decay: 0
- Number of Epochs: 1
- Target Examples Per Step: 32
- Evaluation Frequency: Every 100 steps
- Evaluation Metrics Frequency: Every 200 steps
- Training Script:
scripts.chat_sft_telugu_v1
- Training Details:
- Hardware: 1 × NVIDIA A30 GPU
- Total Training Steps: 45,102 steps
- Total Training Time: 102,047 seconds (~28.35 hours)
- Final Training Loss: 0.99
- Final Validation Loss: 0.89
- Final Learning Rate Multiplier: 0.000044