--- language: - en library_name: vllm pipeline_tag: text-generation tags: - text-generation - conversational - compressed-tensors - awq - w4a16 - quantized base_model: TheDrummer/Precog-24B-v1 base_model_relation: quantized quantized_by: TheHouseOfTheDude license: other --- # Precog-24B-v1 — **Quantized** (compressed-tensors for vLLM) This repository provides **quantized runtime builds** of **[TheDrummer/Precog-24B-v1](https://huggingface.co/TheDrummer/Precog-24B-v1)**, repackaged for **vLLM** using the **compressed-tensors** format. > **TL;DR** > - **Quantized** W4A16 (INT4 weights / A16 activations) for vLLM via `--quantization compressed-tensors`. > - Calibration: **512** chat samples, **2048** max sequence length, from `neuralmagic/LLM_compression_calibration`. > - Weight-only AWQ (group size **128**, **symmetric** INT4), targeting Linear layers; `lm_head` left high-precision. --- ## Revisions & Branches > The **`main`** branch is a landing page (model card + links). Runnable artifacts live in per-quant branches. - **main** — placeholder / landing page - **W4A16** — 4-bit weights / 16-bit activations (compressed-tensors) - **W8A16** — 8-bit weights / 16-bit activations (compressed-tensors) **Quick links** - main: https://huggingface.co/TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors/tree/main - W4A16: https://huggingface.co/TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors/tree/W4A16 - W8A16: https://huggingface.co/TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors/tree/W8A16 --- ## Repository contents (per revision) - Sharded **quantized** weights (`*.safetensors`) + index (`model.safetensors.index.json`) - `config.json` with **compressed-tensors** metadata (`weight_format`, `quantization`, `quantization_config`, etc.) - Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, merges/vocab as applicable) - Optional: `chat_template.jinja` (inherits the parent finetune’s chat style) > Exact file lists may differ between branches — see **Files and versions** for each revision. --- ## Quantization & calibration details (from the attached script) All settings below are extracted from the provided quantization script. ### Method & scheme - **Flow:** `llmcompressor` **oneshot** pipeline with an **AWQModifier**. - **Targets:** `["Linear"]` (weight-only quantization). - **Ignored layers:** `["lm_head"]` kept in higher precision. - **Weights:** **INT4** (`num_bits=4`, `type="int"`, `symmetric=True`) using **group** strategy with **group_size=128** (Marlin-friendly). - **Weights:** **INT8** (`num_bits=8`, `type="int"`, `symmetric=True`) using **group** strategy with **group_size=128** (Most likely BitBLAS Kernel Activation on Ampre). - **Activations:** not quantized (A16 at runtime; FP16/BF16). - **Recipe object:** `QuantizationScheme` + `QuantizationArgs` embedded in an AWQ modifier. - **Save:** `save_compressed=True` so vLLM can load the **compressed-tensors** layout directly. ### Calibration dataset & preprocessing - **Dataset:** `neuralmagic/LLM_compression_calibration`, split **"train"**. - **Sample size:** **NUM_CALIBRATION_SAMPLES = 512** (random subset, **seed=42**). - **Sequence length:** **MAX_SEQUENCE_LENGTH = 2048** (truncate, **no padding**, `add_special_tokens=False`). - **Chat rendering:** each sample’s `messages` list is rendered with `tokenizer.apply_chat_template(..., tokenize=False)` to reflect real chat formatting. - **Batch processing:** preprocessing and tokenization done in batches with multi-proc mapping. ### One-shot compression call - `oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer)` on the preprocessed dataset. > These choices aim to preserve long-form dialog behavior by calibrating on **chat-templated** text at **2048** tokens, with group-wise symmetric INT4 quantization for stable, high-throughput serving. --- ## Context length - **Calibration context:** up to **2048 tokens** per sample (per the script). - **Model context window:** inherited from **TheDrummer/Precog-24B-v1**. Quantization does **not** alter rope/positional embeddings; it only changes weight representation. --- ## Quickstart — vLLM (compressed-tensors) Install vLLM (latest recommended): pip install vllm Serve (adjust to your hardware): CUDA_VISIBLE_DEVICES=0,1,2,3 \ vllm serve TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors \ --quantization compressed-tensors \ --tensor-parallel-size 4 \ --max-model-len 2048 \ --gpu-memory-utilization 0.70 \ --dtype bfloat16 Example Chat Completions: curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors", "messages": [ {"role":"system","content":"You are Precog — helpful, precise, and safe."}, {"role":"user","content":"List three strategies to reduce KV-cache memory growth at long context."} ], "max_tokens": 512, "temperature": 0.7, "top_p": 0.95 }' > **Note:** `compressed-tensors` is a **vLLM runtime** format. Loading directly with vanilla 🤗 Transformers is **not supported**. > For Transformers, use a compatible export (e.g., GPTQ/AWQ for Transformers) or the full-precision parent model. --- ## Prompting / chat template This package follows the **parent finetune’s** chat conventions. If a `chat_template.jinja` file is present, libraries that support `apply_chat_template` will automatically format messages. Guidelines: - Keep a concise **system** message (style, constraints, tone). - Structure **user** prompts clearly; enumerate steps for multi-part tasks. --- ## Intended use & notes - General **instruction-following**, long-form drafting, and summarization - RAG/agent pipelines (pair with a retriever/tool layer) > Always review the parent/base model license and evaluate on your domain before production use. --- ## Lineage - **Finetuned parent:** https://huggingface.co/TheDrummer/Precog-24B-v1 - **This repo:** **Quantized child** of the finetune (**compressed-tensors** for vLLM) --- ## Hardware tips - 24B models benefit from **multi-GPU** tensor parallel for throughput. - Long context is **KV-cache** heavy — tune `--max-model-len` and batch size. - Prefer **BF16** on GPUs with native support; otherwise **FP16**. - Enable P2P/NVLink where available; consider CUDA Graphs if stable. --- ## Changelog - **v1 (current)** — Initial **compressed-tensors W4A16** quantization of `TheDrummer/Precog-24B-v1` with **512-sample / 2048-token** AWQ calibration; vLLM-ready packaging.