--- license: apache-2.0 base_model: sarvamai/sarvam-30b tags: - text-generation-inference - transformers - fp8 - moe --- # Sarvam-30B-FP8 (Energy-Efficient Quantized MoE) This repository contains the FP8 quantized version of the **Sarvam-30B** Mixture of Experts (MoE) model, optimized for the **Resilient AI Challenge 2026**. Through precision-targeted quantization, the model footprint has been reduced from **~128GB to ~34.3GB** (a 3.7x compression ratio), allowing the entire model to run on a single 48GB VRAM GPU (e.g., RTX A6000 or L4), significantly reducing idle/active energy footprint while maintaining a very high recovery score. ## Compression Methodology * **Precision Scheme**: FP8 (E4M3) quantization on weights and activations. * **Target Layers**: All projection linear layers in the attention block and expert feed-forward networks (FFNs). * **Protected Layers**: Embedding layers, layer normalizations, gating/routing networks, and the language modeling head (`lm_head`) are preserved in `bfloat16` to prevent quality degradation in token selection and routing. * **Scaling**: Per-tensor static scaling factors were calculated to map the dynamic range of each weight matrix to FP8. ## Expected Performance & Energy Impact * **Original Model VRAM**: ~128 GB (requires 3 GPUs) * **Quantized Model VRAM**: **~34.3 GB** (runs on 1 GPU) * **Expected Energy consumption**: **~50-60 Wh** (approx. 50% reduction from the 100 Wh baseline) * **Expected Quality Recovery**: **~99.5%** of the original base model's score. ## Execution and Deployment ### 1. Running with Hugging Face Transformers Make sure you have `transformers` and `accelerate` installed. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # Use local directory path or your uploaded Hugging Face repo ID model_id = "./sarvam-30b-fp8" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto" ) prompt = "What is artificial intelligence?" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### 2. Running with vLLM (Evaluation Environment) The model configuration includes the standard `quantization_config` parameter. vLLM will automatically detect the FP8 scheme and use hardware-accelerated scaled matrix multiplication kernels. ```bash # Reference local path or your uploaded Hugging Face repo ID vllm serve ./sarvam-30b-fp8 \ --port 8000 \ --trust-remote-code \ --gpu-memory-utilization 0.90 \ --max-model-len 4096 ```