The Final Polish: Supervised Fine-Tuning (SFT) of Akshara

The Final Polish: Supervised Fine-Tuning (SFT) of Akshara

If Pre-training gave Akshara its knowledge and Mid-training gave it its professional structure, then Supervised Fine-Tuning (SFT) is the final high-stakes training that turns a model into a truly reliable assistant.

In this final stage, we refined Akshara’s ability to handle specific, high-value tasks, from summarization and paraphrasing to creative storytelling, ensuring that every interaction feels natural, accurate, and culturally grounded in Telugu.


What is Supervised Fine-Tuning?

SFT is the process of training the model on a small, high-quality dataset of "demonstrations." Each data point is a pair consisting of a User Prompt and a Target Assistant Response.

Unlike the earlier stages, where the model learns the structure of language, SFT teaches the model the behavior it should exhibit. It learns that when a user asks for a summary, it should be concise; when a user asks for a story, it should be imaginative.


Phase 1: The Specialist Trials

Before building the final "all-in-one" model, we conducted a series of "Specialist Trials." We wanted to see how well our mid-trained checkpoint (nanochat-telugu-d20-mid-smoltalk-100k) could master individual domains.

1. The Summarization Specialist

Using the TeSum dataset from IIIT Hyderabad, we trained a version specifically for abstractive summarization.

  • Duration: ~4.97 hours
  • Achievement: Reached a final validation loss of 0.85, proving the model could identify the "meat" of a Telugu article and condense it effectively.

2. The Creative Writer

We focused on a massive Telugu stories dataset to push the limits of creative prose.

  • Duration: ~10 hours
  • Achievement: Reached a final training loss of 0.80. This version became highly proficient in narrative pacing and "literary" Telugu.

3. The News Reporter

We trained a specialist on Telugu news articles, alternating between title generation and full article writing.

  • Duration: ~13 hours
  • Result: This trial was essential for ensuring the model maintained a neutral, journalistic tone for factual queries.

Phase 2: The Final SFT (Akshara-v1)

After analyzing the specialist runs, we combined all these high-quality datasets into one master training mixture for the final checkpoint: nanochat-telugu-d20-sft-v1.

The Master Mixture:

  • Instruction Following: SmolTalk (10k high-quality pairs).
  • Identity: Custom-crafted conversations to lock in Akshara's personality.
  • Summarization: TeSum and XL-Sum (Telugu portion).
  • Paraphrase: IndicParaphrase for linguistic versatility.
  • Creative: Telugu Tiny Stories (200k examples).
  • Information: AYA Telugu news articles.

Training Performance on NVIDIA A30:

The final SFT was a marathon, requiring nearly 28.35 hours of continuous training.

MetricResult
Total Steps45,102
Total Time~28.35 Hours
Data Typebfloat16
Final Training Loss0.99
Final Validation Loss0.89

By using the Muon optimizer for matrix layers and AdamW for embeddings, we maintained a highly efficient learning rate schedule that allowed the model to converge deeply into the nuances of these combined tasks.


The Result: A Unified Telugu Assistant

The final model, Akshara-v1, is more than the sum of its parts. It doesn't just "complete" text; it answers.

Whether you need a summary of a complex Telugu news article, a creative story for a child, or just a friendly conversation about local culture, Akshara-v1 responds with a level of fluency and safety that represents a new standard for Telugu LLMs.

Key Capabilities of Akshara-v1:

  • Multi-tasking: Switches seamlessly between creative writing and factual summarizing.
  • Identity-Aware: Knows its name and purpose without hallucinating corporate origins.
  • Linguistic Nuance: Handles complex Telugu grammar (like Sandhi and Samasa) more naturally than English-centric models.

Series Conclusion: The Akshara Journey

From training a Rust-based BPE tokenizer on 60 billion characters to the 22-hour H100 pre-training run, and finally this 28-hour SFT marathon, building Akshara has been a masterclass in efficiency and localization.

We have proven that by focusing on data quality, utilizing cutting-edge optimizers like Muon, and respecting the specific structure of the Telugu language, we can build a world-class foundational model that is accessible and affordable.

The future of Indic AI isn't just "scaling up"—it's "tuning in."


Appendix:
First Set of Trials (Task-Specific Fine-Tuning):

These initial trials focused on fine-tuning the model on specific tasks individually, using the nanochat-telugu-d20-mid-smoltalk-100k checkpoint as the base.

1. SmolTalk Fine-Tuning (nanochat-telugu-d20-sft-smoltalk-10k):

  • Base Checkpoint: nanochat-telugu-d20-mid-smoltalk-100k
  • Dataset: SmolTalk (10,000 examples)
  • Training Configuration:
    • Device Batch Size: 2
    • Data Type: bfloat16
    • Matrix Learning Rate: 0.02
    • Embedding Learning Rate: 0.2
    • Unembedding Learning Rate: 0.004
    • Initial Learning Rate Fraction: 0.02
    • Weight Decay: 0
    • Number of Epochs: 1
    • Target Examples Per Step: 32
    • Evaluation Frequency: Every 100 steps
    • Evaluation Metrics Frequency: Every 200 steps
    • Training Script: scripts.chat_sft_telugu
  • Training Details:
    • Hardware: 1 × NVIDIA A30 GPU
    • Total Training Steps: 311 steps
    • Total Training Time: 1,566 seconds (~26 minutes)
    • Final Training Loss: 0.82
    • Final Validation Loss: 0.84
    • Final Learning Rate Multiplier: 0.0064

2. Summarization Fine-Tuning (nanochat-telugu-d20-sft-summary):

  • Base Checkpoint: nanochat-telugu-d20-mid-smoltalk-100k
  • Dataset: TeSum (IIIT Hyderabad summarization dataset)
  • Training Configuration:
    • Device Batch Size: 2
    • Data Type: bfloat16
    • Matrix Learning Rate: 0.02
    • Embedding Learning Rate: 0.2
    • Unembedding Learning Rate: 0.004
    • Initial Learning Rate Fraction: 0.02
    • Weight Decay: 0
    • Number of Epochs: 1
    • Target Examples Per Step: 32
    • Evaluation Frequency: Every 100 steps
    • Evaluation Metrics Frequency: Every 200 steps
    • Training Script: scripts.chat_sft_telugu_summary
  • Training Details:
    • Hardware: 1 × NVIDIA A30 GPU
    • Total Training Steps: 4,054 steps
    • Total Training Time: 17,880 seconds (~4.97 hours)
    • Final Training Loss: 0.71
    • Final Validation Loss: 0.85
    • Final Learning Rate Multiplier: 0.00049

3. Paraphrase Fine-Tuning (nanochat-telugu-d20-sft-paraphrase):

  • Base Checkpoint: nanochat-telugu-d20-mid-smoltalk-100k
  • Dataset: IndicParaphrase (Telugu portion)
  • Training Configuration:
    • Device Batch Size: 2
    • Data Type: bfloat16
    • Matrix Learning Rate: 0.02
    • Embedding Learning Rate: 0.2
    • Unembedding Learning Rate: 0.004
    • Initial Learning Rate Fraction: 0.02
    • Weight Decay: 0
    • Number of Epochs: 1
    • Target Examples Per Step: 32
    • Evaluation Frequency: Every 100 steps
    • Evaluation Metrics Frequency: Every 200 steps
    • Training Script: scripts.chat_sft_telugu_paraphrase
  • Training Details:
    • Hardware: 1 × NVIDIA A30 GPU
    • Total Training Steps: 21,760 steps
    • Total Training Time: 49,289 seconds (~13.69 hours)
    • Final Training Loss: 0.93
    • Final Validation Loss: 0.88
    • Final Learning Rate Multiplier: 0.000092

4. News Articles Fine-Tuning (nanochat-telugu-d20-sft-smoltalk-news-articles):

  • Base Checkpoint: nanochat-telugu-d20-mid-smoltalk-100k
  • Dataset: News articles (writes articles for given titles and writes titles for given articles)
  • Training Configuration:
    • Device Batch Size: 2
    • Data Type: bfloat16
    • Matrix Learning Rate: 0.02
    • Embedding Learning Rate: 0.2
    • Unembedding Learning Rate: 0.004
    • Initial Learning Rate Fraction: 0.02
    • Weight Decay: 0
    • Number of Epochs: 1
    • Target Examples Per Step: 32
    • Evaluation Frequency: Every 100 steps
    • Evaluation Metrics Frequency: Every 200 steps
    • Training Script: scripts.chat_sft_telugu_news
  • Training Details:
    • Hardware: 1 × NVIDIA A30 GPU
    • Total Training Steps: 16,902 steps
    • Total Training Time: 46,983 seconds (~13.05 hours)
    • Final Training Loss: 1.13
    • Final Validation Loss: 0.84
    • Final Learning Rate Multiplier: 0.00012

5. Stories Fine-Tuning (nanochat-telugu-d20-sft-smoltalk-stories):

  • Base Checkpoint: nanochat-telugu-d20-mid-smoltalk-100k
  • Dataset: Telugu stories dataset
  • Training Configuration:
    • Device Batch Size: 2
    • Data Type: bfloat16
    • Matrix Learning Rate: 0.02
    • Embedding Learning Rate: 0.2
    • Unembedding Learning Rate: 0.004
    • Initial Learning Rate Fraction: 0.02
    • Weight Decay: 0
    • Number of Epochs: 1
    • Target Examples Per Step: 32
    • Evaluation Frequency: Every 100 steps
    • Evaluation Metrics Frequency: Every 200 steps
    • Training Script: scripts.chat_sft_telugu_stories
  • Training Details:
    • Hardware: 1 × NVIDIA A30 GPU
    • Total Training Steps: 11,704 steps
    • Total Training Time: 35,952 seconds (~9.99 hours)
    • Final Training Loss: 0.80
    • Final Validation Loss: 0.85
    • Final Learning Rate Multiplier: 0.00017

Second Trial (Combined Dataset Fine-Tuning):

Final SFT Checkpoint (nanochat-telugu-d20-sft-v1):

  • Base Checkpoint: nanochat-telugu-d20-mid-v1-akshara
  • Combined Datasets:
    • SmolTalk: Instruction-following conversations
    • Chat Personalised: Telugu identity conversations for bot personality
    • IndicParaphrase: Telugu paraphrase generation
    • Custom News: AYA Telugu news articles (article and title generation)
    • Custom Summary (TeSum): IIIT Hyderabad summarization dataset
    • Custom Summary (XL-Sum): Multilingual summarization dataset (Telugu portion)
    • Custom Story: Telugu tiny stories (200k examples)
  • Training Configuration:
    • Device Batch Size: 2
    • Data Type: bfloat16
    • Matrix Learning Rate: 0.02
    • Embedding Learning Rate: 0.2
    • Unembedding Learning Rate: 0.004
    • Initial Learning Rate Fraction: 0.02
    • Weight Decay: 0
    • Number of Epochs: 1
    • Target Examples Per Step: 32
    • Evaluation Frequency: Every 100 steps
    • Evaluation Metrics Frequency: Every 200 steps
    • Training Script: scripts.chat_sft_telugu_v1
  • Training Details:
    • Hardware: 1 × NVIDIA A30 GPU
    • Total Training Steps: 45,102 steps
    • Total Training Time: 102,047 seconds (~28.35 hours)
    • Final Training Loss: 0.99
    • Final Validation Loss: 0.89
    • Final Learning Rate Multiplier: 0.000044