--- license: apache-2.0 base_model: Qwen/Qwen2.5-32B tags: - auto-round - quantized - w4a16 - intel - vllm --- # Qwen3.5-27B-heretic-v3-autoround-w4a16 Quantized version of Qwen3.5-27B-heretic-v3 using Intel AutoRound (W4A16). ## Quantization Details - **Method**: AutoRound (Weight-only INT4) - **Precision**: W4A16 (4-bit weights, 16-bit activations) - **Framework**: Intel Neural Compressor ## Performance - **Context Length**: 150k tokens - **Speed**: ~63 tokens/sec on 2x RTX 3090 - **KV Cache**: 97,216 tokens ## Quality Benchmarks | Test | Result | |------|--------| | Logic (widgets) | ✅ Correct | | Math (derivatives) | ✅ Correct | | Coding | ✅ Clean | | Tricky reasoning | ✅ Pass | ## Usage ### vLLM ```bash python -m vllm.entrypoints.openai.api_server \ --model ./Qwen3.5-27B-heretic-v3-autoround-w4a16 \ --host 0.0.0.0 \ --port 1234 \ --dtype bfloat16 \ --max-model-len 150000 \ --quantization auto-round \ --allow-deprecated-quantization \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.95 ``` ### Python ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained( "groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16" ) ``` ## Hardware Requirements - **Minimum**: 2x GPU with 24GB VRAM each (for 150k context) - **Recommended**: 2x RTX 3090 / 4090 or equivalent ## Credits - Base model: Qwen Team - Quantization: Intel AutoRound - Fine-tuning: Heretic v3