---
language:
- en
library_name: vllm
pipeline_tag: text-generation
tags:
  - text-generation
  - conversational
  - compressed-tensors
  - awq
  - w4a16
  - quantized
base_model: TheDrummer/Precog-24B-v1
base_model_relation: quantized
quantized_by: TheHouseOfTheDude
license: other
---

# Precog-24B-v1 — **Quantized** (compressed-tensors for vLLM)

This repository provides **quantized runtime builds** of  
**[TheDrummer/Precog-24B-v1](https://huggingface.co/TheDrummer/Precog-24B-v1)**, repackaged for **vLLM** using the **compressed-tensors** format.

> **TL;DR**
> - **Quantized** W4A16 (INT4 weights / A16 activations) for vLLM via `--quantization compressed-tensors`.
> - Calibration: **512** chat samples, **2048** max sequence length, from `neuralmagic/LLM_compression_calibration`.
> - Weight-only AWQ (group size **128**, **symmetric** INT4), targeting Linear layers; `lm_head` left high-precision.

---

## Revisions & Branches

> The **`main`** branch is a landing page (model card + links). Runnable artifacts live in per-quant branches.

- **main** — placeholder / landing page  
- **W4A16** — 4-bit weights / 16-bit activations (compressed-tensors)
- **W8A16** — 8-bit weights / 16-bit activations (compressed-tensors)

**Quick links**

- main: https://huggingface.co/TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors/tree/main  
- W4A16: https://huggingface.co/TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors/tree/W4A16
- W8A16: https://huggingface.co/TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors/tree/W8A16

---

## Repository contents (per revision)

- Sharded **quantized** weights (`*.safetensors`) + index (`model.safetensors.index.json`)  
- `config.json` with **compressed-tensors** metadata (`weight_format`, `quantization`, `quantization_config`, etc.)  
- Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, merges/vocab as applicable)  
- Optional: `chat_template.jinja` (inherits the parent finetune’s chat style)

> Exact file lists may differ between branches — see **Files and versions** for each revision.

---

## Quantization & calibration details (from the attached script)

All settings below are extracted from the provided quantization script.

### Method & scheme
- **Flow:** `llmcompressor` **oneshot** pipeline with an **AWQModifier**.  
- **Targets:** `["Linear"]` (weight-only quantization).  
- **Ignored layers:** `["lm_head"]` kept in higher precision.  
- **Weights:** **INT4** (`num_bits=4`, `type="int"`, `symmetric=True`) using **group** strategy with **group_size=128** (Marlin-friendly).
- **Weights:** **INT8** (`num_bits=8`, `type="int"`, `symmetric=True`) using **group** strategy with **group_size=128** (Most likely BitBLAS Kernel Activation on Ampre).  
- **Activations:** not quantized (A16 at runtime; FP16/BF16).  
- **Recipe object:** `QuantizationScheme` + `QuantizationArgs` embedded in an AWQ modifier.  
- **Save:** `save_compressed=True` so vLLM can load the **compressed-tensors** layout directly.

### Calibration dataset & preprocessing
- **Dataset:** `neuralmagic/LLM_compression_calibration`, split **"train"**.  
- **Sample size:** **NUM_CALIBRATION_SAMPLES = 512** (random subset, **seed=42**).  
- **Sequence length:** **MAX_SEQUENCE_LENGTH = 2048** (truncate, **no padding**, `add_special_tokens=False`).  
- **Chat rendering:** each sample’s `messages` list is rendered with `tokenizer.apply_chat_template(..., tokenize=False)` to reflect real chat formatting.  
- **Batch processing:** preprocessing and tokenization done in batches with multi-proc mapping.

### One-shot compression call
- `oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer)` on the preprocessed dataset.

> These choices aim to preserve long-form dialog behavior by calibrating on **chat-templated** text at **2048** tokens, with group-wise symmetric INT4 quantization for stable, high-throughput serving.

---

## Context length

- **Calibration context:** up to **2048 tokens** per sample (per the script).  
- **Model context window:** inherited from **TheDrummer/Precog-24B-v1**. Quantization does **not** alter rope/positional embeddings; it only changes weight representation.

---

## Quickstart — vLLM (compressed-tensors)

Install vLLM (latest recommended):

    pip install vllm

Serve (adjust to your hardware):

    CUDA_VISIBLE_DEVICES=0,1,2,3 \
    vllm serve TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors \
      --quantization compressed-tensors \
      --tensor-parallel-size 4 \
      --max-model-len 2048 \
      --gpu-memory-utilization 0.70 \
      --dtype bfloat16

Example Chat Completions:

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors",
        "messages": [
          {"role":"system","content":"You are Precog — helpful, precise, and safe."},
          {"role":"user","content":"List three strategies to reduce KV-cache memory growth at long context."}
        ],
        "max_tokens": 512,
        "temperature": 0.7,
        "top_p": 0.95
      }'

> **Note:** `compressed-tensors` is a **vLLM runtime** format. Loading directly with vanilla 🤗 Transformers is **not supported**.  
> For Transformers, use a compatible export (e.g., GPTQ/AWQ for Transformers) or the full-precision parent model.

---

## Prompting / chat template

This package follows the **parent finetune’s** chat conventions. If a `chat_template.jinja` file is present, libraries that support `apply_chat_template` will automatically format messages.

Guidelines:
- Keep a concise **system** message (style, constraints, tone).  
- Structure **user** prompts clearly; enumerate steps for multi-part tasks.

---

## Intended use & notes

- General **instruction-following**, long-form drafting, and summarization  
- RAG/agent pipelines (pair with a retriever/tool layer)

> Always review the parent/base model license and evaluate on your domain before production use.

---

## Lineage

- **Finetuned parent:** https://huggingface.co/TheDrummer/Precog-24B-v1  
- **This repo:** **Quantized child** of the finetune (**compressed-tensors** for vLLM)

---

## Hardware tips

- 24B models benefit from **multi-GPU** tensor parallel for throughput.  
- Long context is **KV-cache** heavy — tune `--max-model-len` and batch size.  
- Prefer **BF16** on GPUs with native support; otherwise **FP16**.  
- Enable P2P/NVLink where available; consider CUDA Graphs if stable.

---

## Changelog

- **v1 (current)** — Initial **compressed-tensors W4A16** quantization of `TheDrummer/Precog-24B-v1` with **512-sample / 2048-token** AWQ calibration; vLLM-ready packaging.