You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Sarvam-30B W4A16 (AutoRound INT4)

Model Description

This is a W4A16 INT4 quantized version of Sarvam-30B, a Mixture-of-Experts (MoE) model with 128 routed experts (6 active per token) plus 1 shared expert. Weights are compressed with AutoRound via llm-compressor and stored in the compressed-tensors pack-quantized format for efficient serving on NVIDIA A100 (Marlin INT4 kernels).

Property Value
Base Model sarvamai/sarvam-30b
Architecture SarvamMoEForCausalLM
Parameters (total) ~30B
Layers 19 (layer 0 dense; layers 1–18 MoE)
Hidden Size 4096
Attention Heads 64 (4 KV heads, GQA)
Experts 128 routed + 1 shared
Active Experts/Token 6
Native Max Context 131,072 tokens
Quantized Size ~25 GB (6 shards)

Compression Technique

Method: AutoRound W4A16 (uniform INT4)

AutoRound is a block-wise weight-rounding optimisation method. Instead of a one-shot quantisation pass (as in AWQ or GPTQ), AutoRound iteratively adjusts rounding decisions using signed-gradient optimisation over calibration activations, minimising layer output error at 4-bit precision.

Quantization Configuration

Component Precision Strategy Details
Routed expert MLP weights (gate_proj, up_proj, down_proj) INT4 (4-bit) Per-group, symmetric Group size 128, static scales (memoryless_minmax observer)
Activations FP16/BF16 (16-bit) W4A16: weights quantised, activations at full precision
Format compressed-tensors / pack-quantized vLLM Marlin INT4 kernel path on Ampere+

AutoRound Hyperparameters

Parameter Value
Scheme W4A16
Iterations 200 per quantised block
Batch Size 1 (MoE memory constraint)
Targets All Linear layers not in ignore list
Torch Compile Disabled
Tooling llm-compressor AutoRoundModifier + oneshot()
MoE calibration moe_calibrate_all_experts=True (all 128 experts see calibration data)

Layers/Modules Kept at Full Precision

The following modules are not quantised to preserve quality on reasoning, attention, and always-on paths:

  • lm_head (output projection)
  • Layer 0 (the only dense, non-MoE decoder layer)
  • All attention layers (query_key_value, dense, layernorms)
  • All shared expert layers (shared_experts.gate_proj, up_proj, down_proj)

Why AutoRound?

We chose AutoRound W4A16 over alternatives for three reasons:

  1. Accuracy at 4-bit: AutoRound's iterative block-wise rounding consistently outperforms one-shot methods (AWQ, GPTQ) on MMLU, math, and Indic benchmarks at INT4 — critical for a model evaluated on reasoning and multilingual tasks.
  2. Energy on A100: W4A16 uses the native Marlin INT4 kernel on Ampere (A100), halving weight memory bandwidth vs FP8. FP8 W8A8 on A100 is emulated (no native FP8 tensor cores), so INT4 delivers real compute and memory savings during decode.

Calibration Datasets

Calibration uses 512 samples at 2048 tokens each, rendered with the native Sarvam chat template (thinking enabled). Samples are packed to uniform length to satisfy AutoRound's batch stacking requirements.

Source Share Domain
sarvamai/indivibe (chat, code, math, stem) 50% Indic multilingual — chat, code, math, STEM
HuggingFaceH4/ultrachat_200k ~37.5% English general instruction / knowledge / writing
openai/gsm8k ~12.5% English math word problems
Seed prompts (multilingual fallback) ≤6 Hindi, Telugu, Tamil, Bengali, Marathi, Kannada + English
Calibration Setting Value
Samples 512
Sequence length 2,048 tokens
Indic fraction 0.50
Random seed 42
Shuffle Disabled (deterministic order)

Recipe (recipe.yaml)

The llm-compressor recipe used to produce this checkpoint:

default_stage:
  default_modifiers:
    AutoRoundModifier:
      targets: [Linear]
      ignore: ['re:.*lm_head', 're:^model\.layers\.0(?:\..*)?$', 're:.*\.attention\..*', 're:.*self_attn.*',
        're:.*shared_expert.*']
      scheme: W4A16
      bypass_divisibility_checks: false
      iters: 200
      enable_torch_compile: false
      batch_size: 1

Repository Files

File Purpose
model-0000{1-6}-of-00006.safetensors Quantised model weights (6 shards, ~25 GB total)
model.safetensors.index.json Shard index mapping parameter names → files
config.json Model config (includes quantization_config)
vllm_config.yaml vLLM serving parameters (use with --config)
recipe.yaml llm-compressor recipe used to produce this model
configuration_sarvam_moe.py Custom HuggingFace config class
modeling_sarvam_moe.py Custom HuggingFace model class
sarvam.py vLLM Sarvam MoE model implementation
tokenizer.json / tokenizer_config.json Tokeniser
special_tokens_map.json Special token mappings
chat_template.jinja Chat template (thinking-enabled)
generation_config.json Default generation settings
hotpatch_vllm.py Optional vLLM registry patch for older vLLM builds

Inference

vLLM (Recommended)

Evaluation uses vLLM 0.19.1 (or latest). From the model root directory:

vllm serve --config vllm_config.yaml

Equivalent explicit invocation:

vllm serve . \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 55000 \
  --enable-chunked-prefill \
  --dtype auto

Included vllm_config.yaml:

model: .
trust_remote_code: true
tensor_parallel_size: 1
gpu_memory_utilization: 0.85
max_model_len: 55000
dtype: auto
enable_chunked_prefill: true

OpenAI-compatible API example:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": ".",
    "messages": [{"role": "user", "content": "What is the capital of India?"}],
    "temperature": 1.0,
    "top_p": 1.0,
    "max_tokens": 2048
  }'

Note: This model uses the compressed-tensors W4A16 pack-quantized format with a custom Sarvam MoE architecture. vLLM is the recommended and tested inference backend. llama.cpp support for this exact format may be limited or unavailable; use vLLM for competition-aligned serving.

Evaluation Generation Parameters

Recommended task-specific sampling settings for evaluation:

Benchmark Suite temperature top_p max_new_tokens
Math500 / MMLU / GPQA / AIME / reasoning 1.0 1.0 50,000
Writing Bench 0.7 0.8 16,000 (top_k=20)
Agentic (BrowseComp / SWE-bench / τ²-bench) 0.5 1.0 32,768

Default vLLM serving context is max_model_len: 55000 (see vllm_config.yaml). If no config file is provided, organisers use gpu-memory-utilization: 0.85 and max-model-len: 50000.

Known Issues

  1. trust_remote_code required: Sarvam MoE uses custom configuration_sarvam_moe.py and modeling_sarvam_moe.py. Always pass --trust-remote-code (vLLM) or equivalent.
  2. Chunked prefill recommended: Enable enable_chunked_prefill: true (included in vllm_config.yaml) for long-context stability.
  3. Older vLLM builds: vLLM ≥ 0.19 includes native Sarvam MoE support. For older versions, run hotpatch_vllm.py to register SarvamMoEForCausalLM in the vLLM model registry and download sarvam.py.
  4. llama.cpp compatibility: The compressed-tensors W4A16 MoE format is not guaranteed to load in llama.cpp. Use vLLM for reliable inference.
  5. Partial quantisation: Attention, shared experts, layer 0, and lm_head remain at full precision. Effective compression is on routed expert MLP weights only (~98% of expert parameter bytes).
  6. Mixed-precision MoE not supported: Uniform W4A16 is required; mixed W4/W8 schemes inside MoE experts cause vLLM load failures.

License

Apache 2.0 — same as the original Sarvam-30B model.

Downloads last month
-
Safetensors
Model size
7B params
Tensor type
F32
·
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Pranay2412/sarvam-30b-W4A16-Autoround

Quantized
(27)
this model