mssfj's picture
Upload README.md with huggingface_hub
aa84aea verified
|
Raw
History Blame Contribute Delete
1.7 kB
metadata
language:
  - en
license: apache-2.0
base_model: Qwen/Qwen3.5-9B
tags:
  - qwen
  - gptq
  - quantized
  - math
  - causal-lm
library_name: transformers
pipeline_tag: text-generation

Qwen3.5-9B-GPTQ-INT8

This model is a GPTQ-quantized version of Qwen/Qwen3.5-9B with a normalized text-only config.json.

Quantization

  • Method: GPTQ
  • Bits: 8
  • Group size: 128
  • desc_act: False
  • damp_percent: 0.1
  • Calibration preset: math_qa_cot
  • Calibration dataset: zwhe99/DeepMath-103K split train
  • Max calibration samples: 128
  • Max sequence length: 16384

Reproduction

uv run python quantization/quantize_qwen35_9b_gptq.py \
  --model-name Qwen/Qwen3.5-9B \
  --output-dir /workspace/lowbit-math-reasoning/experiments/models/Qwen3.5-9B-GPTQ-INT8 \
  --dataset-name zwhe99/DeepMath-103K \
  --dataset-config '' \
  --dataset-split train \
  --calibration-preset math_qa_cot \
  --question-column question \
  --answer-column r1_solution_1 \
  --text-column r1_solution_1 \
  --max-calibration-samples 128 \
  --max-seq-len 16384 \
  --bits 8 \
  --group-size 128 \
  --damp-percent 0.1

The current quantization script rewrites config.json after save_pretrained() so the exported checkpoint uses the same text-only qwen3_5_text layout as the working INT4 checkpoint.

Validation

This normalized-config checkpoint was re-evaluated on GSM8K and matched the original INT8 accuracy while improving throughput substantially.

  • Original INT8: EM 0.96, 105.98 tok/s
  • Fixed-config INT8: EM 0.96, 150.84 tok/s

Notes

  • This repository contains quantized weights only.
  • The checkpoint is intended for text-only evaluation.
  • vLLM loads this checkpoint as gptq_marlin.