How to use from
Hermes Agent
Start the llama.cpp server
# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
Run Hermes
hermes
Quick Links

中文介绍

模型简介

本模型基于 Google 的 gemma-4-31b-it 指令微调模型构建,是一个去限制(decensored)版本,并进一步通过基于 Imatrix 的自适应混合量化策略进行深度压缩优化。

父模型通过 Heretic v1.2.0+custom 工具链,并结合 Arbitrary-Rank Ablation(ARA)方法进行处理,实现对齐约束的移除,从而提升模型输出自由度。

在此基础上,本模型引入了一套“结构感知 + 数据驱动”的量化策略,在大幅降低显存占用的同时,尽可能保留模型的推理能力与语言能力。

核心特性

去限制(Decensored) 移除了原始模型中的对齐限制,使模型具备更高的表达自由度和响应范围。 Imatrix 驱动量化 使用权重重要性矩阵(Imatrix)对模型参数进行评分,实现数据驱动的量化分配。 自适应混合精度 不同模块采用不同量化精度,而非统一量化,从而在压缩率与性能之间取得最佳平衡。 极致压缩 大部分参数压缩至 3-bit(IQ3_S),在可用性范围内实现极限体积优化。

量化策略说明

输出层与 Embedding(强保护) 以下模块使用 Q5_K 量化: lm_head output token_embd / embed_tokens

原因: 这些层直接决定模型输出质量与稳定性,量化过低会导致明显退化。

Attention 机制(核心保护) 以下模块统一使用 IQ4_NL: q_proj / k_proj / v_proj attn_q / attn_k / attn_v o_proj / attn_out

原因: Attention 是模型推理能力的核心,IQ4_NL 在 4-bit 下结合 imatrix 可最大化保留性能。

FFN(MLP)分层量化

基于 Imatrix 计算每个权重的重要性(score = mean(abs(weight))),并进行排序:

前 15%(最重要权重):IQ4_NL 后 85%(冗余权重):IQ3_S

效果: 在极大压缩模型体积的同时,尽量避免关键能力损失。

高精度保留层(不量化)

以下参数不参与低比特量化,保持 FP16 或 FP32:

norm(归一化层) bias rope / position

Imatrix 重要性计算

每个权重张量的重要性通过以下方式计算:

score = mean(abs(weight))

该方法能够有效反映权重能量分布,是一种稳定且计算成本低的近似指标。

量化分布(典型)

Q5_K:极少(仅输出相关层) IQ4_NL:约 15% - 30%(Attention + 高重要性 FFN) IQ3_S:约 70% - 85%(低重要性权重)

推理性能特点

显存占用显著低于传统 Q4_K / Q5_K 混合方案 推理能力接近 FP16 水平(尤其在推理与代码任务中) 适合 16GB / 24GB 显存环境部署 支持长上下文推理

使用要求(极其重要)

必须使用 Imatrix 进行量化或加载,否则模型性能会严重下降。

示例命令:

./llama-quantize --imatrix model.imatrix.gguf --override-kv "$(cat tensor_types.txt | paste -sd ',' -)" input.gguf output.gguf iq3_s

原因: IQ4_NL 与 IQ3_S 属于 imatrix-aware 量化方法,若缺失 imatrix,将导致权重缩放失真,模型表现明显劣化。

已知局限

在极端压缩下: 长链推理能力可能略有下降 数学精度略有损失 强依赖 Imatrix 不适用于训练,仅适用于推理

适用场景

对话系统 代码生成 推理任务 知识问答

风险提示

本模型为去限制版本,可能生成未过滤或潜在不安全内容。使用者需自行评估风险并承担相应责任。

English Description

Model Overview

This model is a decensored derivative of gemma-4-31b-it, further optimized using an Imatrix-driven adaptive mixed-precision quantization pipeline.

The base model was modified using Heretic v1.2.0+custom with the Arbitrary-Rank Ablation (ARA) method, removing alignment constraints and increasing output freedom.

On top of that, this release introduces a structure-aware and data-driven quantization strategy to significantly reduce memory footprint while preserving reasoning and language capabilities.

Key Features

Decensored Model Alignment restrictions have been removed, enabling broader and less constrained outputs. Imatrix-driven Quantization A weight importance matrix (Imatrix) is used to guide quantization decisions. Adaptive Mixed Precision Different layers use different quantization formats instead of a uniform scheme. Extreme Compression Most parameters are compressed to 3-bit (IQ3_S), achieving high compression ratios.

Quantization Strategy

Output & Embedding Protection The following layers use Q5_K: lm_head output embed_tokens

Reason: These layers are highly sensitive to quantization and directly affect output quality.

Attention Protection The following layers use IQ4_NL: q_proj / k_proj / v_proj attention outputs

Reason: Attention layers are critical for reasoning performance.

FFN Rank-based Quantization

Weights are ranked using:

score = mean(abs(weight))

Then assigned as follows:

Top 15%: IQ4_NL Remaining 85%: IQ3_S High-Precision Layers

The following are kept in FP16/FP32:

normalization layers bias positional encodings

Quantization Distribution

Q5_K: minimal IQ4_NL: ~15–30% IQ3_S: ~70–85%

Inference Characteristics

Significantly reduced VRAM usage Near-FP16 reasoning performance Suitable for 16GB–24GB GPUs Supports long-context inference

Critical Requirement

Imatrix MUST be used during quantization or inference.

Example:

./llama-quantize --imatrix model.imatrix.gguf --override-kv "$(cat tensor_types.txt | paste -sd ',' -)" input.gguf output.gguf iq3_s

Without imatrix:

Quantization breaks Severe performance degradation

Limitations

Slight degradation in long-chain reasoning Reduced mathematical precision Strong dependency on imatrix Not suitable for training

Use Cases

Chat / roleplay Coding Reasoning tasks Knowledge QA

Disclaimer

This is a decensored model and may generate unfiltered or unsafe content. Users are responsible for evaluating and managing associated risks.

Downloads last month
2
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support