---
license: apache-2.0
base_model: sarvamai/sarvam-30b
tags:
- text-generation-inference
- transformers
- fp8
- moe
---

# Sarvam-30B-FP8 (Energy-Efficient Quantized MoE)

This repository contains the FP8 quantized version of the **Sarvam-30B** Mixture of Experts (MoE) model, optimized for the **Resilient AI Challenge 2026**. 

Through precision-targeted quantization, the model footprint has been reduced from **~128GB to ~34.3GB** (a 3.7x compression ratio), allowing the entire model to run on a single 48GB VRAM GPU (e.g., RTX A6000 or L4), significantly reducing idle/active energy footprint while maintaining a very high recovery score.

## Compression Methodology

* **Precision Scheme**: FP8 (E4M3) quantization on weights and activations.
* **Target Layers**: All projection linear layers in the attention block and expert feed-forward networks (FFNs).
* **Protected Layers**: Embedding layers, layer normalizations, gating/routing networks, and the language modeling head (`lm_head`) are preserved in `bfloat16` to prevent quality degradation in token selection and routing.
* **Scaling**: Per-tensor static scaling factors were calculated to map the dynamic range of each weight matrix to FP8.

## Expected Performance & Energy Impact

* **Original Model VRAM**: ~128 GB (requires 3 GPUs)
* **Quantized Model VRAM**: **~34.3 GB** (runs on 1 GPU)
* **Expected Energy consumption**: **~50-60 Wh** (approx. 50% reduction from the 100 Wh baseline)
* **Expected Quality Recovery**: **~99.5%** of the original base model's score.

## Execution and Deployment

### 1. Running with Hugging Face Transformers
Make sure you have `transformers` and `accelerate` installed.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Use local directory path or your uploaded Hugging Face repo ID
model_id = "./sarvam-30b-fp8"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "What is artificial intelligence?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### 2. Running with vLLM (Evaluation Environment)
The model configuration includes the standard `quantization_config` parameter. vLLM will automatically detect the FP8 scheme and use hardware-accelerated scaled matrix multiplication kernels.

```bash
# Reference local path or your uploaded Hugging Face repo ID
vllm serve ./sarvam-30b-fp8 \
    --port 8000 \
    --trust-remote-code \
    --gpu-memory-utilization 0.90 \
    --max-model-len 4096
```