Bridging the Gap: The Mid-Training Phase of Akshara

Mid-training is the critical bridge that transforms a raw, next-token-predicting engine into a structured, instruction-following assistant.

For our Telugu model, nanochat-telugu-d20, mid-training was about more than just data—it was about shaping a personality, mastering summarization, and learning to follow complex prompts without losing its foundational Telugu roots.


What is Mid-Training?

Mid-training is a relatively new stage in the LLM pipeline, popularized by models like Llama 3 and Olmo. It sits between broad pre-training and task-specific Supervised Fine-Tuning (SFT).

The goal is to:

  1. Reduce the Syntactic Gap: Transition from predicting "what comes next in a book" to "how to answer a user's question."
  2. Enhance Specialized Skills: Deepen capabilities in specific domains like news reporting, creative writing, and summarization.
  3. Preserve General Knowledge: Unlike direct fine-tuning, which can cause "catastrophic forgetting," mid-training uses a diverse mixture of data to keep the model’s broad knowledge intact.

Trial 1: The First Spark of Instruction

We started with a "mini-run" using SmolTalk—a translated version of 100,000 conversational data points.

  • Checkpoint: nanochat-telugu-d20-mid-smoltalk-100k
  • The Result: In just 1.57 hours on an NVIDIA A30, the model’s loss dropped to 0.95. More importantly, its Bits-per-Byte (bpb) dropped to 0.276.
Technical Note: Bits-per-Byte (bpb) is a tokenizer-agnostic metric. It tells us how efficiently the model is compressing the information. A lower bpb means the model has developed a much stronger "internal model" of the language structure.

Trial 2: The Akshara Final Mixture

For the final mid-training phase, we created a comprehensive data "cocktail." This wasn't just raw text; we mapped diverse datasets into a standardized User-Assistant conversation format.

The Ingredients of Akshara:

  • SmolTalk (Conversational): General instruction following.
  • Custom News (Domain Knowledge): Writing articles based on titles and vice-versa.
  • Summarization (TeSum & XL-Sum): We used five distinct Telugu prompts (e.g., "ఈ వ్యాసం యొక్క సారాంశం రాయండి") to teach the model how to condense long-form text.
  • IndicParaphrase (Linguistic Flexibility): Using 10 different prompts to teach the model how to rewrite sentences while keeping the same meaning.
  • Stories (Creative Writing): 200,000 "tiny stories" in Telugu to boost creativity.
  • Identity & Personality (The Soul): A custom dataset designed to teach the bot who it is, what its name is, and its basic safety guardrails. We included this dataset twice in the mixture to ensure the model’s identity remained "sticky."

Hardware and Optimization

We continued using the Muon optimizer for matrix parameters. Muon’s ability to stabilize updates and improve "tail-end" learning—where the model learns rarer words and complex instructions—was vital during this stage.

Mid-Training Performance (Trial 2):

MetricResult
Hardware1 × NVIDIA A30 GPU
Total Training Time~6.83 Hours
Total Training FLOPs~1.92 ExaFLOPs
Final Training Loss0.89
Final Validation bpb0.268

Despite the complex data mixture, the training throughput remained a steady 172 tokens/second, with the model reaching its target loss in roughly 1,000 steps.


The Outcome: The Akshara Foundation

The result of this phase is the Akshara mid-training checkpoint (nanochat-telugu-d20-mid-v1-akshara).

The model is no longer just a "Telugu predictor." It can now:

  • Identify itself and its purpose.
  • Summarize a news report.
  • Paraphrase a complex sentence.
  • Write a creative story.

It has successfully transitioned from a language model to a task-capable agent, setting the stage for the final phase: Supervised Fine-Tuning (SFT) and RLHF.


Appendix:
First Trial:

  • Checkpoint: nanochat-telugu-d20-mid-smoltalk-100k
  • Dataset: SmolTalk
    • Dataset: Translated 100,000 data points from the base dataset HuggingFaceTB/smol-smoltalk
    • Format Conversion: Uses the dataset's native messages format directly. Each conversation contains alternating user and assistant messages, with an optional system message at the beginning. Messages are validated to ensure proper alternation (user, assistant, user, assistant, ...) and that each conversation has at least 2 messages after any system message.
    • Purpose: Initial experimentation with instruction-following tasks

Training Configuration:

  • Model: nanochat_telugu-560M
  • Data Type: bfloat16
  • Max Sequence Length: 2,048 tokens
  • Matrix Learning Rate: 0.02
  • Embedding Learning Rate: 0.2
  • Initial Learning Rate Fraction: 1.0
  • Evaluation Frequency: Every 150 steps
  • Evaluation Tokens: 10,485,760 tokens
  • Training Script: scripts.mid_train_telugu

Training Details:

  • Hardware: 1 × NVIDIA A30 GPU
  • Device Batch Size: 4
  • Total Training Steps: 262 steps
  • Total Training Time: 5,655.47 seconds (~1.57 hours)
  • Total Training FLOPs: 479,639,957,384,724,500 (~479.64 petaFLOPs)
  • Training Throughput: 365 tokens/second
  • Average Step Time: 22.43 seconds
  • Final Training Loss: 0.95
  • Final Validation Bits Per Byte (bpb): 0.276
  • Final Learning Rate Multiplier: 0.025

Second Trial (Final Mid-Training):

  • Checkpoint: nanochat-telugu-d20-mid-v1-akshara

The final mid-training checkpoint combined multiple datasets to create a comprehensive training mixture. Each dataset was converted to a standardized user-assistant conversation format as follows:

  • SmolTalk:

    • Dataset: Translated Telugu SmolTalk dataset from HuggingFaceTB/smol-smoltalk
    • Format Conversion: Uses the dataset's native messages format directly. Each conversation contains alternating user and assistant messages, with an optional system message at the beginning. Messages are validated to ensure proper alternation (user, assistant, user, assistant, ...).
    • Purpose: General instruction-following and conversational capabilities
  • News Articles (CustomNewsDataset):

    • Dataset: Publicly available news article datasets and additional news article datasets provided by Viswam
    • Format Conversion:
      • Input format: Dataset contains inputs (user message) and targets (assistant message) fields
      • Conversion: Direct mapping where inputs → user message, targets → assistant message
      • No prompts added; the dataset already contains formatted user-assistant pairs
    • Purpose: News article generation tasks (writing articles for given titles and writing titles for given articles)
  • Summarization Task:

    • TeSum Dataset (CustomSummaryDataset):

      • Dataset: Provided by IIIT Hyderabad
      • Format Conversion:
        • Input format: JSONL files with cleaned_text (array of sentences) and summary (array of sentences)
        • Processing: Arrays are joined into strings
        • Prompts Used (randomly selected per example, seeded for reproducibility):
          1. "కింది వ్యాసాన్ని సంక్షిప్తంగా సారాంశం చెప్పండి"
          2. "ఈ వ్యాసం యొక్క సారాంశం రాయండి"
          3. "కింది వ్యాసాన్ని సంగ్రహించండి"
          4. "వ్యాసం యొక్క ముఖ్యాంశాలను సంక్షిప్తంగా రాయండి"
          5. "ఈ వ్యాసాన్ని సంక్షిప్తంగా సారాంశం చెయ్యండి"
        • Conversion: User message = {selected_prompt}:\n\n{cleaned_text}, Assistant message = {summary}
      • Purpose: Abstractive summarization of Telugu articles
    • XL-Sum Dataset (CustomSummaryXlsumDataset):

      • Dataset: csebuetnlp/xlsum - A large-scale multilingual abstractive summarization dataset covering 45 languages, including Telugu
      • Format Conversion:
        • Input format: Parquet or JSONL files with text (string) and summary (string) fields
        • Prompts Used (same as TeSum, randomly selected per example, seeded for reproducibility):
          1. "కింది వ్యాసాన్ని సంక్షిప్తంగా సారాంశం చెప్పండి"
          2. "ఈ వ్యాసం యొక్క సారాంశం రాయండి"
          3. "కింది వ్యాసాన్ని సంగ్రహించండి"
          4. "వ్యాసం యొక్క ముఖ్యాంశాలను సంక్షిప్తంగా రాయండి"
          5. "ఈ వ్యాసాన్ని సంక్షిప్తంగా సారాంశం చెయ్యండి"
        • Conversion: User message = {selected_prompt}:\n\n{text}, Assistant message = {summary}
      • Purpose: Multilingual summarization capabilities with Telugu focus
  • Paraphrase (IndicParaphraseParser):

    • Dataset: IndicParaphrase dataset (Telugu portion)
    • Format Conversion:
      • Input format: Dataset with input/inputs and target/targets fields
      • Prompts Used (randomly selected per example, seeded for reproducibility):
        1. "ఈ వాక్యాన్ని వేరే పదాలతో మళ్లీ రాయండి"
        2. "ఈ వాక్యాన్ని మరొక విధంగా వ్యక్తపరచండి"
        3. "ఈ వాక్యానికి సమానమైన అర్థం కలిగిన వేరే వాక్యం రాయండి"
        4. "ఈ వాక్యాన్ని పునర్వ్యాఖ్యానించండి."
        5. "ఈ వాక్యాన్ని వేరే పదాలతో అదే అర్థంతో రాయండి"
        6. "ఈ వాక్యాన్ని మరొక రూపంలో వ్యక్తపరచండి"
        7. "ఈ వాక్యానికి సమానార్థక వాక్యం రాయండి"
        8. "ఈ వాక్యాన్ని వేరే విధంగా రూపొందించండి"
        9. "ఈ వాక్యాన్ని మరొక శైలిలో రాయండి"
        10. "ఈ వాక్యాన్ని పునర్వివరణ చేయండి"
      • Conversion: User message = {selected_prompt}:\n\n{input}, Assistant message = {target}
    • Purpose: Paraphrase generation and text variation
  • Stories (CustomStoryDataset):

    • Dataset: Telugu tiny stories dataset (200k examples)
    • Format Conversion:
      • Input format: DatasetDict with messages field containing list of message objects
      • Conversion: Uses the dataset's native messages format directly. Messages are validated to ensure proper structure with role and content fields.
      • No prompts added; the dataset already contains formatted conversations
    • Purpose: Story generation and creative writing capabilities
  • Personality and Identity Dataset (CustomJSON):

    • Dataset: Custom dataset generated specifically for establishing bot personality, identity, and basic guardrails
    • Format Conversion:
      • Input format: JSON file with messages field containing conversation arrays
      • Conversion: Uses the dataset's native messages format directly. Each conversation must have at least 2 messages with alternating user and assistant roles.
      • No prompts added; the dataset already contains formatted identity and personality conversations
    • Purpose: Establishing bot personality, identity, and basic guardrails for safe and consistent responses
    • Note: This dataset is included twice in the training mixture to emphasize the importance of personality and identity consistency in the model's responses

Training Configuration:

  • Model: nanochat_telugu-560M
  • Data Type: bfloat16
  • Max Sequence Length: 2,048 tokens
  • Matrix Learning Rate: 0.02
  • Embedding Learning Rate: 0.2
  • Initial Learning Rate Fraction: 1.0
  • Evaluation Frequency: Every 150 steps
  • Evaluation Tokens: 10,485,760 tokens
  • Training Script: scripts.mid_train_telugu_v1

Training Details:

  • Hardware: 1 × NVIDIA A30 GPU
  • Device Batch Size: 2
  • Total Training Steps: 1,047 steps
  • Total Training Time: 24,586.11 seconds (~6.83 hours)
  • Total Training FLOPs: 1,916,729,142,678,651,000 (~1.92 exaFLOPs)
  • Training Throughput: 172 tokens/second
  • Average Step Time: 23.72 seconds
  • Final Training Loss: 0.89
  • Final Validation Bits Per Byte (bpb): 0.268
  • Final Learning Rate Multiplier: 0.031