Bridging the Gap: The Mid-Training Phase of Akshara
Mid-training is the critical bridge that transforms a raw, next-token-predicting engine into a structured, instruction-following assistant.
For our Telugu model, nanochat-telugu-d20, mid-training was about more than just data—it was about shaping a personality, mastering summarization, and learning to follow complex prompts without losing its foundational Telugu roots.
What is Mid-Training?
Mid-training is a relatively new stage in the LLM pipeline, popularized by models like Llama 3 and Olmo. It sits between broad pre-training and task-specific Supervised Fine-Tuning (SFT).
The goal is to:
- Reduce the Syntactic Gap: Transition from predicting "what comes next in a book" to "how to answer a user's question."
- Enhance Specialized Skills: Deepen capabilities in specific domains like news reporting, creative writing, and summarization.
- Preserve General Knowledge: Unlike direct fine-tuning, which can cause "catastrophic forgetting," mid-training uses a diverse mixture of data to keep the model’s broad knowledge intact.
Trial 1: The First Spark of Instruction
We started with a "mini-run" using SmolTalk—a translated version of 100,000 conversational data points.
- Checkpoint:
nanochat-telugu-d20-mid-smoltalk-100k - The Result: In just 1.57 hours on an NVIDIA A30, the model’s loss dropped to 0.95. More importantly, its Bits-per-Byte (bpb) dropped to 0.276.
Technical Note: Bits-per-Byte (bpb) is a tokenizer-agnostic metric. It tells us how efficiently the model is compressing the information. A lower bpb means the model has developed a much stronger "internal model" of the language structure.
Trial 2: The Akshara Final Mixture
For the final mid-training phase, we created a comprehensive data "cocktail." This wasn't just raw text; we mapped diverse datasets into a standardized User-Assistant conversation format.
The Ingredients of Akshara:
- SmolTalk (Conversational): General instruction following.
- Custom News (Domain Knowledge): Writing articles based on titles and vice-versa.
- Summarization (TeSum & XL-Sum): We used five distinct Telugu prompts (e.g., "ఈ వ్యాసం యొక్క సారాంశం రాయండి") to teach the model how to condense long-form text.
- IndicParaphrase (Linguistic Flexibility): Using 10 different prompts to teach the model how to rewrite sentences while keeping the same meaning.
- Stories (Creative Writing): 200,000 "tiny stories" in Telugu to boost creativity.
- Identity & Personality (The Soul): A custom dataset designed to teach the bot who it is, what its name is, and its basic safety guardrails. We included this dataset twice in the mixture to ensure the model’s identity remained "sticky."
Hardware and Optimization
We continued using the Muon optimizer for matrix parameters. Muon’s ability to stabilize updates and improve "tail-end" learning—where the model learns rarer words and complex instructions—was vital during this stage.
Mid-Training Performance (Trial 2):
| Metric | Result |
| Hardware | 1 × NVIDIA A30 GPU |
| Total Training Time | ~6.83 Hours |
| Total Training FLOPs | ~1.92 ExaFLOPs |
| Final Training Loss | 0.89 |
| Final Validation bpb | 0.268 |
Despite the complex data mixture, the training throughput remained a steady 172 tokens/second, with the model reaching its target loss in roughly 1,000 steps.
The Outcome: The Akshara Foundation
The result of this phase is the Akshara mid-training checkpoint (nanochat-telugu-d20-mid-v1-akshara).
The model is no longer just a "Telugu predictor." It can now:
- Identify itself and its purpose.
- Summarize a news report.
- Paraphrase a complex sentence.
- Write a creative story.
It has successfully transitioned from a language model to a task-capable agent, setting the stage for the final phase: Supervised Fine-Tuning (SFT) and RLHF.
Appendix:
First Trial:
- Checkpoint:
nanochat-telugu-d20-mid-smoltalk-100k - Dataset: SmolTalk
- Dataset: Translated 100,000 data points from the base dataset HuggingFaceTB/smol-smoltalk
- Format Conversion: Uses the dataset's native
messagesformat directly. Each conversation contains alternating user and assistant messages, with an optional system message at the beginning. Messages are validated to ensure proper alternation (user, assistant, user, assistant, ...) and that each conversation has at least 2 messages after any system message. - Purpose: Initial experimentation with instruction-following tasks
Training Configuration:
- Model:
nanochat_telugu-560M - Data Type: bfloat16
- Max Sequence Length: 2,048 tokens
- Matrix Learning Rate: 0.02
- Embedding Learning Rate: 0.2
- Initial Learning Rate Fraction: 1.0
- Evaluation Frequency: Every 150 steps
- Evaluation Tokens: 10,485,760 tokens
- Training Script:
scripts.mid_train_telugu
Training Details:
- Hardware: 1 × NVIDIA A30 GPU
- Device Batch Size: 4
- Total Training Steps: 262 steps
- Total Training Time: 5,655.47 seconds (~1.57 hours)
- Total Training FLOPs: 479,639,957,384,724,500 (~479.64 petaFLOPs)
- Training Throughput: 365 tokens/second
- Average Step Time: 22.43 seconds
- Final Training Loss: 0.95
- Final Validation Bits Per Byte (bpb): 0.276
- Final Learning Rate Multiplier: 0.025
Second Trial (Final Mid-Training):
- Checkpoint:
nanochat-telugu-d20-mid-v1-akshara
The final mid-training checkpoint combined multiple datasets to create a comprehensive training mixture. Each dataset was converted to a standardized user-assistant conversation format as follows:
-
SmolTalk:
- Dataset: Translated Telugu SmolTalk dataset from HuggingFaceTB/smol-smoltalk
- Format Conversion: Uses the dataset's native
messagesformat directly. Each conversation contains alternating user and assistant messages, with an optional system message at the beginning. Messages are validated to ensure proper alternation (user, assistant, user, assistant, ...). - Purpose: General instruction-following and conversational capabilities
-
News Articles (CustomNewsDataset):
- Dataset: Publicly available news article datasets and additional news article datasets provided by Viswam
- Format Conversion:
- Input format: Dataset contains
inputs(user message) andtargets(assistant message) fields - Conversion: Direct mapping where
inputs→ user message,targets→ assistant message - No prompts added; the dataset already contains formatted user-assistant pairs
- Input format: Dataset contains
- Purpose: News article generation tasks (writing articles for given titles and writing titles for given articles)
-
Summarization Task:
-
TeSum Dataset (CustomSummaryDataset):
- Dataset: Provided by IIIT Hyderabad
- Format Conversion:
- Input format: JSONL files with
cleaned_text(array of sentences) andsummary(array of sentences) - Processing: Arrays are joined into strings
- Prompts Used (randomly selected per example, seeded for reproducibility):
- "కింది వ్యాసాన్ని సంక్షిప్తంగా సారాంశం చెప్పండి"
- "ఈ వ్యాసం యొక్క సారాంశం రాయండి"
- "కింది వ్యాసాన్ని సంగ్రహించండి"
- "వ్యాసం యొక్క ముఖ్యాంశాలను సంక్షిప్తంగా రాయండి"
- "ఈ వ్యాసాన్ని సంక్షిప్తంగా సారాంశం చెయ్యండి"
- Conversion: User message =
{selected_prompt}:\n\n{cleaned_text}, Assistant message ={summary}
- Input format: JSONL files with
- Purpose: Abstractive summarization of Telugu articles
-
XL-Sum Dataset (CustomSummaryXlsumDataset):
- Dataset: csebuetnlp/xlsum - A large-scale multilingual abstractive summarization dataset covering 45 languages, including Telugu
- Format Conversion:
- Input format: Parquet or JSONL files with
text(string) andsummary(string) fields - Prompts Used (same as TeSum, randomly selected per example, seeded for reproducibility):
- "కింది వ్యాసాన్ని సంక్షిప్తంగా సారాంశం చెప్పండి"
- "ఈ వ్యాసం యొక్క సారాంశం రాయండి"
- "కింది వ్యాసాన్ని సంగ్రహించండి"
- "వ్యాసం యొక్క ముఖ్యాంశాలను సంక్షిప్తంగా రాయండి"
- "ఈ వ్యాసాన్ని సంక్షిప్తంగా సారాంశం చెయ్యండి"
- Conversion: User message =
{selected_prompt}:\n\n{text}, Assistant message ={summary}
- Input format: Parquet or JSONL files with
- Purpose: Multilingual summarization capabilities with Telugu focus
-
-
Paraphrase (IndicParaphraseParser):
- Dataset: IndicParaphrase dataset (Telugu portion)
- Format Conversion:
- Input format: Dataset with
input/inputsandtarget/targetsfields - Prompts Used (randomly selected per example, seeded for reproducibility):
- "ఈ వాక్యాన్ని వేరే పదాలతో మళ్లీ రాయండి"
- "ఈ వాక్యాన్ని మరొక విధంగా వ్యక్తపరచండి"
- "ఈ వాక్యానికి సమానమైన అర్థం కలిగిన వేరే వాక్యం రాయండి"
- "ఈ వాక్యాన్ని పునర్వ్యాఖ్యానించండి."
- "ఈ వాక్యాన్ని వేరే పదాలతో అదే అర్థంతో రాయండి"
- "ఈ వాక్యాన్ని మరొక రూపంలో వ్యక్తపరచండి"
- "ఈ వాక్యానికి సమానార్థక వాక్యం రాయండి"
- "ఈ వాక్యాన్ని వేరే విధంగా రూపొందించండి"
- "ఈ వాక్యాన్ని మరొక శైలిలో రాయండి"
- "ఈ వాక్యాన్ని పునర్వివరణ చేయండి"
- Conversion: User message =
{selected_prompt}:\n\n{input}, Assistant message ={target}
- Input format: Dataset with
- Purpose: Paraphrase generation and text variation
-
Stories (CustomStoryDataset):
- Dataset: Telugu tiny stories dataset (200k examples)
- Format Conversion:
- Input format: DatasetDict with
messagesfield containing list of message objects - Conversion: Uses the dataset's native
messagesformat directly. Messages are validated to ensure proper structure withroleandcontentfields. - No prompts added; the dataset already contains formatted conversations
- Input format: DatasetDict with
- Purpose: Story generation and creative writing capabilities
-
Personality and Identity Dataset (CustomJSON):
- Dataset: Custom dataset generated specifically for establishing bot personality, identity, and basic guardrails
- Format Conversion:
- Input format: JSON file with
messagesfield containing conversation arrays - Conversion: Uses the dataset's native
messagesformat directly. Each conversation must have at least 2 messages with alternating user and assistant roles. - No prompts added; the dataset already contains formatted identity and personality conversations
- Input format: JSON file with
- Purpose: Establishing bot personality, identity, and basic guardrails for safe and consistent responses
- Note: This dataset is included twice in the training mixture to emphasize the importance of personality and identity consistency in the model's responses
Training Configuration:
- Model:
nanochat_telugu-560M - Data Type: bfloat16
- Max Sequence Length: 2,048 tokens
- Matrix Learning Rate: 0.02
- Embedding Learning Rate: 0.2
- Initial Learning Rate Fraction: 1.0
- Evaluation Frequency: Every 150 steps
- Evaluation Tokens: 10,485,760 tokens
- Training Script:
scripts.mid_train_telugu_v1
Training Details:
- Hardware: 1 × NVIDIA A30 GPU
- Device Batch Size: 2
- Total Training Steps: 1,047 steps
- Total Training Time: 24,586.11 seconds (~6.83 hours)
- Total Training FLOPs: 1,916,729,142,678,651,000 (~1.92 exaFLOPs)
- Training Throughput: 172 tokens/second
- Average Step Time: 23.72 seconds
- Final Training Loss: 0.89
- Final Validation Bits Per Byte (bpb): 0.268
- Final Learning Rate Multiplier: 0.031