Instructions to use Pranay2412/sarvam-30b-W4A16-Autoround with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Pranay2412/sarvam-30b-W4A16-Autoround with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Pranay2412/sarvam-30b-W4A16-Autoround", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Pranay2412/sarvam-30b-W4A16-Autoround", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Pranay2412/sarvam-30b-W4A16-Autoround with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Pranay2412/sarvam-30b-W4A16-Autoround" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Pranay2412/sarvam-30b-W4A16-Autoround", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Pranay2412/sarvam-30b-W4A16-Autoround
- SGLang
How to use Pranay2412/sarvam-30b-W4A16-Autoround with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Pranay2412/sarvam-30b-W4A16-Autoround" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Pranay2412/sarvam-30b-W4A16-Autoround", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Pranay2412/sarvam-30b-W4A16-Autoround" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Pranay2412/sarvam-30b-W4A16-Autoround", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Pranay2412/sarvam-30b-W4A16-Autoround with Docker Model Runner:
docker model run hf.co/Pranay2412/sarvam-30b-W4A16-Autoround
Sarvam-30B W4A16 (AutoRound INT4)
Model Description
This is a W4A16 INT4 quantized version of Sarvam-30B, a Mixture-of-Experts (MoE) model with 128 routed experts (6 active per token) plus 1 shared expert. Weights are compressed with AutoRound via llm-compressor and stored in the compressed-tensors pack-quantized format for efficient serving on NVIDIA A100 (Marlin INT4 kernels).
| Property | Value |
|---|---|
| Base Model | sarvamai/sarvam-30b |
| Architecture | SarvamMoEForCausalLM |
| Parameters (total) | ~30B |
| Layers | 19 (layer 0 dense; layers 1–18 MoE) |
| Hidden Size | 4096 |
| Attention Heads | 64 (4 KV heads, GQA) |
| Experts | 128 routed + 1 shared |
| Active Experts/Token | 6 |
| Native Max Context | 131,072 tokens |
| Quantized Size | ~25 GB (6 shards) |
Compression Technique
Method: AutoRound W4A16 (uniform INT4)
AutoRound is a block-wise weight-rounding optimisation method. Instead of a one-shot quantisation pass (as in AWQ or GPTQ), AutoRound iteratively adjusts rounding decisions using signed-gradient optimisation over calibration activations, minimising layer output error at 4-bit precision.
Quantization Configuration
| Component | Precision | Strategy | Details |
|---|---|---|---|
Routed expert MLP weights (gate_proj, up_proj, down_proj) |
INT4 (4-bit) | Per-group, symmetric | Group size 128, static scales (memoryless_minmax observer) |
| Activations | FP16/BF16 (16-bit) | — | W4A16: weights quantised, activations at full precision |
| Format | compressed-tensors / pack-quantized |
— | vLLM Marlin INT4 kernel path on Ampere+ |
AutoRound Hyperparameters
| Parameter | Value |
|---|---|
| Scheme | W4A16 |
| Iterations | 200 per quantised block |
| Batch Size | 1 (MoE memory constraint) |
| Targets | All Linear layers not in ignore list |
| Torch Compile | Disabled |
| Tooling | llm-compressor AutoRoundModifier + oneshot() |
| MoE calibration | moe_calibrate_all_experts=True (all 128 experts see calibration data) |
Layers/Modules Kept at Full Precision
The following modules are not quantised to preserve quality on reasoning, attention, and always-on paths:
lm_head(output projection)- Layer 0 (the only dense, non-MoE decoder layer)
- All attention layers (
query_key_value,dense, layernorms) - All shared expert layers (
shared_experts.gate_proj,up_proj,down_proj)
Why AutoRound?
We chose AutoRound W4A16 over alternatives for three reasons:
- Accuracy at 4-bit: AutoRound's iterative block-wise rounding consistently outperforms one-shot methods (AWQ, GPTQ) on MMLU, math, and Indic benchmarks at INT4 — critical for a model evaluated on reasoning and multilingual tasks.
- Energy on A100: W4A16 uses the native Marlin INT4 kernel on Ampere (A100), halving weight memory bandwidth vs FP8. FP8 W8A8 on A100 is emulated (no native FP8 tensor cores), so INT4 delivers real compute and memory savings during decode.
Calibration Datasets
Calibration uses 512 samples at 2048 tokens each, rendered with the native Sarvam chat template (thinking enabled). Samples are packed to uniform length to satisfy AutoRound's batch stacking requirements.
| Source | Share | Domain |
|---|---|---|
sarvamai/indivibe (chat, code, math, stem) |
50% | Indic multilingual — chat, code, math, STEM |
| HuggingFaceH4/ultrachat_200k | ~37.5% | English general instruction / knowledge / writing |
| openai/gsm8k | ~12.5% | English math word problems |
| Seed prompts (multilingual fallback) | ≤6 | Hindi, Telugu, Tamil, Bengali, Marathi, Kannada + English |
| Calibration Setting | Value |
|---|---|
| Samples | 512 |
| Sequence length | 2,048 tokens |
| Indic fraction | 0.50 |
| Random seed | 42 |
| Shuffle | Disabled (deterministic order) |
Recipe (recipe.yaml)
The llm-compressor recipe used to produce this checkpoint:
default_stage:
default_modifiers:
AutoRoundModifier:
targets: [Linear]
ignore: ['re:.*lm_head', 're:^model\.layers\.0(?:\..*)?$', 're:.*\.attention\..*', 're:.*self_attn.*',
're:.*shared_expert.*']
scheme: W4A16
bypass_divisibility_checks: false
iters: 200
enable_torch_compile: false
batch_size: 1
Repository Files
| File | Purpose |
|---|---|
model-0000{1-6}-of-00006.safetensors |
Quantised model weights (6 shards, ~25 GB total) |
model.safetensors.index.json |
Shard index mapping parameter names → files |
config.json |
Model config (includes quantization_config) |
vllm_config.yaml |
vLLM serving parameters (use with --config) |
recipe.yaml |
llm-compressor recipe used to produce this model |
configuration_sarvam_moe.py |
Custom HuggingFace config class |
modeling_sarvam_moe.py |
Custom HuggingFace model class |
sarvam.py |
vLLM Sarvam MoE model implementation |
tokenizer.json / tokenizer_config.json |
Tokeniser |
special_tokens_map.json |
Special token mappings |
chat_template.jinja |
Chat template (thinking-enabled) |
generation_config.json |
Default generation settings |
hotpatch_vllm.py |
Optional vLLM registry patch for older vLLM builds |
Inference
vLLM (Recommended)
Evaluation uses vLLM 0.19.1 (or latest). From the model root directory:
vllm serve --config vllm_config.yaml
Equivalent explicit invocation:
vllm serve . \
--trust-remote-code \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.85 \
--max-model-len 55000 \
--enable-chunked-prefill \
--dtype auto
Included vllm_config.yaml:
model: .
trust_remote_code: true
tensor_parallel_size: 1
gpu_memory_utilization: 0.85
max_model_len: 55000
dtype: auto
enable_chunked_prefill: true
OpenAI-compatible API example:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": ".",
"messages": [{"role": "user", "content": "What is the capital of India?"}],
"temperature": 1.0,
"top_p": 1.0,
"max_tokens": 2048
}'
Note: This model uses the
compressed-tensorsW4A16 pack-quantized format with a custom Sarvam MoE architecture. vLLM is the recommended and tested inference backend. llama.cpp support for this exact format may be limited or unavailable; use vLLM for competition-aligned serving.
Evaluation Generation Parameters
Recommended task-specific sampling settings for evaluation:
| Benchmark Suite | temperature | top_p | max_new_tokens |
|---|---|---|---|
| Math500 / MMLU / GPQA / AIME / reasoning | 1.0 | 1.0 | 50,000 |
| Writing Bench | 0.7 | 0.8 | 16,000 (top_k=20) |
| Agentic (BrowseComp / SWE-bench / τ²-bench) | 0.5 | 1.0 | 32,768 |
Default vLLM serving context is max_model_len: 55000 (see vllm_config.yaml). If no config file is provided, organisers use gpu-memory-utilization: 0.85 and max-model-len: 50000.
Known Issues
trust_remote_coderequired: Sarvam MoE uses customconfiguration_sarvam_moe.pyandmodeling_sarvam_moe.py. Always pass--trust-remote-code(vLLM) or equivalent.- Chunked prefill recommended: Enable
enable_chunked_prefill: true(included invllm_config.yaml) for long-context stability. - Older vLLM builds: vLLM ≥ 0.19 includes native Sarvam MoE support. For older versions, run
hotpatch_vllm.pyto registerSarvamMoEForCausalLMin the vLLM model registry and downloadsarvam.py. - llama.cpp compatibility: The
compressed-tensorsW4A16 MoE format is not guaranteed to load in llama.cpp. Use vLLM for reliable inference. - Partial quantisation: Attention, shared experts, layer 0, and
lm_headremain at full precision. Effective compression is on routed expert MLP weights only (~98% of expert parameter bytes). - Mixed-precision MoE not supported: Uniform W4A16 is required; mixed W4/W8 schemes inside MoE experts cause vLLM load failures.
License
Apache 2.0 — same as the original Sarvam-30B model.
- Downloads last month
- -
Model tree for Pranay2412/sarvam-30b-W4A16-Autoround
Base model
sarvamai/sarvam-30b