Instructions to use Pranay2412/sarvam-30b-W4A16-Autoround with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Pranay2412/sarvam-30b-W4A16-Autoround with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Pranay2412/sarvam-30b-W4A16-Autoround", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Pranay2412/sarvam-30b-W4A16-Autoround", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Pranay2412/sarvam-30b-W4A16-Autoround with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Pranay2412/sarvam-30b-W4A16-Autoround"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pranay2412/sarvam-30b-W4A16-Autoround",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Pranay2412/sarvam-30b-W4A16-Autoround

SGLang

How to use Pranay2412/sarvam-30b-W4A16-Autoround with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Pranay2412/sarvam-30b-W4A16-Autoround" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pranay2412/sarvam-30b-W4A16-Autoround",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Pranay2412/sarvam-30b-W4A16-Autoround" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pranay2412/sarvam-30b-W4A16-Autoround",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Pranay2412/sarvam-30b-W4A16-Autoround with Docker Model Runner:
```
docker model run hf.co/Pranay2412/sarvam-30b-W4A16-Autoround
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Sarvam-30B W4A16 (AutoRound INT4)

Model Description

This is a W4A16 INT4 quantized version of Sarvam-30B, a Mixture-of-Experts (MoE) model with 128 routed experts (6 active per token) plus 1 shared expert. Weights are compressed with AutoRound via llm-compressor and stored in the compressed-tensors pack-quantized format for efficient serving on NVIDIA A100 (Marlin INT4 kernels).

Property	Value
Base Model	sarvamai/sarvam-30b
Architecture	`SarvamMoEForCausalLM`
Parameters (total)	~30B
Layers	19 (layer 0 dense; layers 1–18 MoE)
Hidden Size	4096
Attention Heads	64 (4 KV heads, GQA)
Experts	128 routed + 1 shared
Active Experts/Token	6
Native Max Context	131,072 tokens
Quantized Size	~25 GB (6 shards)

Compression Technique

Method: AutoRound W4A16 (uniform INT4)

AutoRound is a block-wise weight-rounding optimisation method. Instead of a one-shot quantisation pass (as in AWQ or GPTQ), AutoRound iteratively adjusts rounding decisions using signed-gradient optimisation over calibration activations, minimising layer output error at 4-bit precision.

Quantization Configuration

Component	Precision	Strategy	Details
Routed expert MLP weights (`gate_proj`, `up_proj`, `down_proj`)	INT4 (4-bit)	Per-group, symmetric	Group size 128, static scales (`memoryless_minmax` observer)
Activations	FP16/BF16 (16-bit)	—	W4A16: weights quantised, activations at full precision
Format	`compressed-tensors` / `pack-quantized`	—	vLLM Marlin INT4 kernel path on Ampere+

AutoRound Hyperparameters

Parameter	Value
Scheme	`W4A16`
Iterations	200 per quantised block
Batch Size	1 (MoE memory constraint)
Targets	All `Linear` layers not in ignore list
Torch Compile	Disabled
Tooling	`llm-compressor` `AutoRoundModifier` + `oneshot()`
MoE calibration	`moe_calibrate_all_experts=True` (all 128 experts see calibration data)

Layers/Modules Kept at Full Precision

The following modules are not quantised to preserve quality on reasoning, attention, and always-on paths:

lm_head (output projection)
Layer 0 (the only dense, non-MoE decoder layer)
All attention layers (query_key_value, dense, layernorms)
All shared expert layers (shared_experts.gate_proj, up_proj, down_proj)

Why AutoRound?

We chose AutoRound W4A16 over alternatives for three reasons:

Accuracy at 4-bit: AutoRound's iterative block-wise rounding consistently outperforms one-shot methods (AWQ, GPTQ) on MMLU, math, and Indic benchmarks at INT4 — critical for a model evaluated on reasoning and multilingual tasks.
Energy on A100: W4A16 uses the native Marlin INT4 kernel on Ampere (A100), halving weight memory bandwidth vs FP8. FP8 W8A8 on A100 is emulated (no native FP8 tensor cores), so INT4 delivers real compute and memory savings during decode.

Calibration Datasets

Calibration uses 512 samples at 2048 tokens each, rendered with the native Sarvam chat template (thinking enabled). Samples are packed to uniform length to satisfy AutoRound's batch stacking requirements.

Source	Share	Domain
sarvamai/indivibe (`chat`, `code`, `math`, `stem`)	50%	Indic multilingual — chat, code, math, STEM
HuggingFaceH4/ultrachat_200k	~37.5%	English general instruction / knowledge / writing
openai/gsm8k	~12.5%	English math word problems
Seed prompts (multilingual fallback)	≤6	Hindi, Telugu, Tamil, Bengali, Marathi, Kannada + English

Calibration Setting	Value
Samples	512
Sequence length	2,048 tokens
Indic fraction	0.50
Random seed	42
Shuffle	Disabled (deterministic order)

Recipe (`recipe.yaml`)

The llm-compressor recipe used to produce this checkpoint:

default_stage:
  default_modifiers:
    AutoRoundModifier:
      targets: [Linear]
      ignore: ['re:.*lm_head', 're:^model\.layers\.0(?:\..*)?$', 're:.*\.attention\..*', 're:.*self_attn.*',
        're:.*shared_expert.*']
      scheme: W4A16
      bypass_divisibility_checks: false
      iters: 200
      enable_torch_compile: false
      batch_size: 1

Repository Files

File	Purpose
`model-0000{1-6}-of-00006.safetensors`	Quantised model weights (6 shards, ~25 GB total)
`model.safetensors.index.json`	Shard index mapping parameter names → files
`config.json`	Model config (includes `quantization_config`)
`vllm_config.yaml`	vLLM serving parameters (use with `--config`)
`recipe.yaml`	llm-compressor recipe used to produce this model
`configuration_sarvam_moe.py`	Custom HuggingFace config class
`modeling_sarvam_moe.py`	Custom HuggingFace model class
`sarvam.py`	vLLM Sarvam MoE model implementation
`tokenizer.json` / `tokenizer_config.json`	Tokeniser
`special_tokens_map.json`	Special token mappings
`chat_template.jinja`	Chat template (thinking-enabled)
`generation_config.json`	Default generation settings
`hotpatch_vllm.py`	Optional vLLM registry patch for older vLLM builds

Inference

vLLM (Recommended)

Evaluation uses vLLM 0.19.1 (or latest). From the model root directory:

vllm serve --config vllm_config.yaml

Equivalent explicit invocation:

vllm serve . \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 55000 \
  --enable-chunked-prefill \
  --dtype auto

Included vllm_config.yaml:

model: .
trust_remote_code: true
tensor_parallel_size: 1
gpu_memory_utilization: 0.85
max_model_len: 55000
dtype: auto
enable_chunked_prefill: true

OpenAI-compatible API example:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": ".",
    "messages": [{"role": "user", "content": "What is the capital of India?"}],
    "temperature": 1.0,
    "top_p": 1.0,
    "max_tokens": 2048
  }'

Note: This model uses the compressed-tensors W4A16 pack-quantized format with a custom Sarvam MoE architecture. vLLM is the recommended and tested inference backend. llama.cpp support for this exact format may be limited or unavailable; use vLLM for competition-aligned serving.

Evaluation Generation Parameters

Recommended task-specific sampling settings for evaluation:

Benchmark Suite	temperature	top_p	max_new_tokens
Math500 / MMLU / GPQA / AIME / reasoning	1.0	1.0	50,000
Writing Bench	0.7	0.8	16,000 (`top_k=20`)
Agentic (BrowseComp / SWE-bench / τ²-bench)	0.5	1.0	32,768

Default vLLM serving context is max_model_len: 55000 (see vllm_config.yaml). If no config file is provided, organisers use gpu-memory-utilization: 0.85 and max-model-len: 50000.

Known Issues

trust_remote_code required: Sarvam MoE uses custom configuration_sarvam_moe.py and modeling_sarvam_moe.py. Always pass --trust-remote-code (vLLM) or equivalent.
Chunked prefill recommended: Enable enable_chunked_prefill: true (included in vllm_config.yaml) for long-context stability.
Older vLLM builds: vLLM ≥ 0.19 includes native Sarvam MoE support. For older versions, run hotpatch_vllm.py to register SarvamMoEForCausalLM in the vLLM model registry and download sarvam.py.
llama.cpp compatibility: The compressed-tensors W4A16 MoE format is not guaranteed to load in llama.cpp. Use vLLM for reliable inference.
Partial quantisation: Attention, shared experts, layer 0, and lm_head remain at full precision. Effective compression is on routed expert MLP weights only (~98% of expert parameter bytes).
Mixed-precision MoE not supported: Uniform W4A16 is required; mixed W4/W8 schemes inside MoE experts cause vLLM load failures.

License

Apache 2.0 — same as the original Sarvam-30B model.

Downloads last month: -

Safetensors

Model size

7B params

Tensor type

F32

I64

I32

BF16

Model tree for Pranay2412/sarvam-30b-W4A16-Autoround

Base model

sarvamai/sarvam-30b

Quantized

(27)

this model