---
license: apache-2.0
base_model: Qwen/Qwen2.5-32B
tags:
- auto-round
- quantized
- w4a16
- intel
- vllm
---

# Qwen3.5-27B-heretic-v3-autoround-w4a16

Quantized version of Qwen3.5-27B-heretic-v3 using Intel AutoRound (W4A16).

## Quantization Details
- **Method**: AutoRound (Weight-only INT4)
- **Precision**: W4A16 (4-bit weights, 16-bit activations)
- **Framework**: Intel Neural Compressor

## Performance
- **Context Length**: 150k tokens
- **Speed**: ~63 tokens/sec on 2x RTX 3090
- **KV Cache**: 97,216 tokens

## Quality Benchmarks
| Test | Result |
|------|--------|
| Logic (widgets) | ✅ Correct |
| Math (derivatives) | ✅ Correct |
| Coding | ✅ Clean |
| Tricky reasoning | ✅ Pass |

## Usage

### vLLM
```bash
python -m vllm.entrypoints.openai.api_server \
  --model ./Qwen3.5-27B-heretic-v3-autoround-w4a16 \
  --host 0.0.0.0 \
  --port 1234 \
  --dtype bfloat16 \
  --max-model-len 150000 \
  --quantization auto-round \
  --allow-deprecated-quantization \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95
```

### Python
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16"
)
```

## Hardware Requirements
- **Minimum**: 2x GPU with 24GB VRAM each (for 150k context)
- **Recommended**: 2x RTX 3090 / 4090 or equivalent

## Credits
- Base model: Qwen Team
- Quantization: Intel AutoRound
- Fine-tuning: Heretic v3