Sarvam 1: The First Indian Language Large Language Model (LLM)
Overview of Indian Language LLM Landscape
The development of large language models (LLMs) has shown impressive capabilities across various tasks, but the majority of these advancements have focused on English and other widely spoken languages. This English-centric approach leaves a significant technological gap for billions of Indian language speakers. Although some models have adapted to Indic languages through additional pretraining, fully multilingual models like BLOOM remain rare, with limited effectiveness due to poor token efficiency in Indic scripts and a scarcity of high-quality training data.
What is Sarvam 1 ?
Sarvam-1 is a groundbreaking 2-billion parameter language model specifically optimized for ten major Indian languages and English. Built from the ground up with careful data curation, Sarvam-1 achieves exceptional performance despite its relatively compact size, addressing critical challenges in Indic language modeling:
1. Token Efficiency: Multilingual models often have high token requirements for Indic scripts, sometimes needing 4-8 tokens per word compared to English’s average of 1.4 tokens per word. Sarvam-1’s tokenizer provides significantly improved efficiency, achieving a token fertility rate of 1.4-2.1 across all supported languages.
2. Data Quality: While some web-crawled Indic language data exists, it often lacks quality and depth. Sarvam-1 addresses this with a high-quality, synthetic training corpus of 2 trillion tokens across ten major Indic languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
Benchmark Performance
Despite its compact size, Sarvam-1 performs exceptionally well across standard benchmarks, excelling in Indic language tasks and achieving state-of-the-art performance for models in its class. It outperforms larger models such as Gemma-2-2B and Llama-3.2-3B on multiple benchmarks, including MMLU, Arc-Challenge, and IndicGenBench, and even compares favorably to Llama 3.1 8B. Sarvam-1 is particularly notable for its 4-6x faster inference speed compared to larger models, making it ideal for practical applications, including deployment on edge devices. The model is available for download on Hub.
Sarvam 2T: An Indic Pretraining Corpus
Addressing the Data Gap in Indian Language Modeling
A significant challenge in developing high-quality language models for Indian languages is the lack of training data. Existing datasets like Sangraha are often limited in depth, diversity, and quality. Sarvam’s solution to this problem is Sarvam-2T, a massive 2-trillion-token corpus tailored for Indic languages.
Data Quality and Composition
Sarvam-2T was meticulously designed to support both monolingual and multilingual tasks with balanced representation across domains. Notable characteristics include:
Document Quality: Sarvam-2T’s documents are on average twice as long as typical web data and contain three times more high-quality samples, with significantly reduced repetition rates and improved coherence.
Content Distribution: Scientific and technical content makes up eight times more of the dataset compared to typical datasets. This diverse distribution enhances the model’s abilities in complex reasoning and specialized knowledge tasks.
Sensitive Content Control: Sensitive topics are reduced by half to create a balanced yet responsible training corpus.
Unique Composition for Better Language Representation
Hindi comprises around 20% of the corpus, while the remaining 80% is distributed almost equally among the other supported languages. For robustness, Sarvam-2T includes English tokens and a substantial amount of code across various programming languages, balancing language tasks and coding capabilities.
Model Design and Architecture
Tokenizer Optimization
Sarvam-1’s custom tokenizer, featuring 68,096 tokens (with 4,096 reserved for future use), achieves low fertility rates across all supported languages, making it highly efficient. With a higher information density per token, Sarvam-2T’s effective training signal is equivalent to 6-8 trillion tokens when normalized, surpassing conventional tokenizers.
Architecture Overview
Following established best practices, Sarvam-1’s architecture is both deep and thin, a design choice backed by recent research to enhance model effectiveness. Key hyperparameters include:
• Hidden size: 2048
• Intermediate size:11,008
• Attention heads: 16
• Hidden layers: 28
• Key-value heads: 8
• Maximum position embeddings: 8,192
Sarvam-1 employs SwiGLU as its hidden activation function and rotary positional embeddings (RoPE) with a theta value of 10,000, utilizing grouped-query attention and bfloat16 mixed-precision for improved inference efficiency.
Training Infrastructure
Sarvam-1 was trained on Yotta’s Shakti cluster, with a total of 1,024 GPUs over a five-day period using NVIDIA’s NeMo framework, which provided critical optimizations for large-scale model training.
Evaluation Metrics and Results
Academic Benchmarks Adapted for Indic Languages
Evaluating large language models for Indic languages is challenging due to limited standardized benchmarks. Sarvam-1’s evaluation combines adapted academic benchmarks and Indic-relevant tasks to gauge its effectiveness. Benchmarks include:
MMLU (Massive Multitask Language Understanding): Tests broad knowledge across domains through multiple-choice questions.
ARC-Challenge (AI2 Reasoning Challenge): Grade-school level questions for assessing reasoning capabilities.
BoolQ: A binary question-answering task that tests general knowledge and reasoning.
TriviaQA: Measures factual retrieval through a multiple-choice adaptation.
Sarvam-1 performs well in Indic languages according to these metrics, frequently surpassing Llama 3.1 8B. It achieves an average score of 86.11 on TriviaQA for Indic languages, well above Llama 3.1 8B’s 61.47. It consistently achieves competitive scores (38.22 and 46.71, respectively) on the MMLU and ARC-Challenge, with BoolQ scores hovering around 62 in all languages.
IndicGenBench Performance
Sarvam-1 was also evaluated using Google’s IndicGenBench, which includes the following tasks:
CrossSum: Cross-lingual summarization from English to Indic languages.
Flores: English-to-Indic language translation.
XORQA: Cross-lingual question answering.
XQUAD: A question-answering dataset entirely in Indic languages.
In Flores English-to-Indic translation, Sarvam-1 achieves an impressive chrF++ score of 46.81, outperforming baseline models, including Llama 3.1 8B, which scored 34.23. The model also demonstrates strong performance in cross-lingual question answering (XORQA) and summarization (CrossSum), underscoring its versatility and robustness in multilingual applications.
Use Case Example: Translation
To illustrate Sarvam-1’s practical utility, the model was fine-tuned on English-to-Indic translation tasks using the BPCC dataset and evaluated on IN22-Gen. Sarvam-1’s BLEU scores (~20) are comparable to significantly larger models, with 4-6x faster inference speeds, making it cost-effective for production environments and ideal for deployment in real-world applications, including on edge devices.
Future Prospects and Community Engagement
Sarvam-1 represents a pivotal step toward developing advanced Indic language models, demonstrating strong performance, high efficiency, and extensive applicability. With superior inference efficiency and robust handling of Indic languages, Sarvam-1 is particularly suitable for practical applications, empowering developers to build diverse solutions across sectors.
Acknowledgments
Sarvam AI extends sincere gratitude to its partners and supporters:
NVIDIA: For providing expertise with the NeMo codebase, which was essential in optimizing large-scale model training.
Yotta: For offering access to the Shakti GPU cluster, instrumental in training Sarvam-1 at scale.
AI4Bharat: For its contributions to open-source language resources, which greatly supported the development of Sarvam-1.
Sarvam-1’s development represents a collaborative effort to bridge the linguistic gap for Indian language speakers, empowering diverse applications and encouraging community innovation. We’re excited to see how the community harnesses Sarvam-1’s capabilities!
Sarvam AI and Yotta Partnership: Unveiling Generative AI Solutions
Strategic Collaboration with Yotta
Sarvam AI, a Lightspeed-backed AI startup, has partnered with AI infrastructure provider Yotta to introduce its first suite of generative AI tools and models, targeting both enterprises and developers. Sarvam’s successful $41 million investment round from the previous year was followed by this debut.
Key Offerings and Innovations
Sarvam’s product lineup includes:
- Sarvam Agents: Voice-enabled, multilingual, action-oriented business agents that integrate seamlessly through telephone, WhatsApp, or in-app, available in ten Indian languages.
- Sarvam Models and Developer API: A suite of models accessible to developers and enterprises, complemented by open-source solutions.
- Shuka 1.0: India’s first open-source AudioLM, extending the Llama 8B model to support Indian language voice-in and text-out functionalities, outperforming frontier models.
- A1 for Legal Firms: A generative AI tool tailored for legal professionals.
The Sarvam 2B model, exclusively trained on Yotta’s infrastructure, marks a milestone for AI developed and deployed within India. Sarvam’s models operate on high-performance GPUs from Yotta, an arrangement made possible through Yotta’s collaboration with Nvidia to power AI services in India.
Yotta’s Data Center Capabilities
Supported by the Hiranandani Group, Yotta operates Tier III and IV data centers in India, providing both colocation and hyperscale services along with cloud and managed service offerings. Yotta’s facilities boast a current capacity of 33 megawatts, with plans for expansion up to 890 megawatts. This infrastructure enables high-performance AI solutions like Sarvam’s, hosted currently on Yotta and expected soon on GCP and Azure.
Unique Features and Market Impact
Sarvam AI’s solutions, including the cost-effective Sarvam Agents (offered at Rs. 1 per minute compared to Rs. 10 per minute for human agents), cater to industries like BFSI, healthcare, legal tech, and other enterprises. Sarvam also launched India’s first open-source, small Indic Language Model (LLM), trained on an extensive 4 trillion-token internal dataset, tailored for efficient language representation across 10 Indian languages.
Partnership with Meta: Enhanced WhatsApp Enterprise Solutions
In collaboration with Meta, Sarvam introduced a customized WhatsApp stack for enterprises. This offering provides conversation history, business instructions, targeted marketing, consumer feedback, and payment acceptance capabilities. WhatsApp, with a reach of 500 million users in India and 20 million monthly active transactors, offers a powerful platform for businesses.
Founders and Vision
Founded in July 2023 by Vivek Raghavan and Pratyush Kumar (formerly of AI4Bharat), Sarvam aims to develop specialized, smaller AI models for enterprise-specific applications. As larger models like GPT-4 and GPT-5 focus on scale, Sarvam’s smaller models prioritize efficiency, cost-effectiveness, and lower latency for high-frequency use cases, contributing to sustainable AI practices.
Future Prospects and Industry Applications
Sarvam’s commercial offerings are set to redefine AI deployment in Indian enterprises, with applications across sectors and a commitment to advancing generative AI technology that aligns with regional needs and infrastructure.
more click BB