Instructions to use coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW", filename="gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW # Run inference directly in the terminal: llama-cli -hf coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW # Run inference directly in the terminal: llama-cli -hf coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW # Run inference directly in the terminal: ./llama-cli -hf coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW # Run inference directly in the terminal: ./build/bin/llama-cli -hf coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
Use Docker
docker model run hf.co/coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
- LM Studio
- Jan
- Ollama
How to use coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW with Ollama:
ollama run hf.co/coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
- Unsloth Studio
How to use coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW to start chatting
- Pi
How to use coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
Run Hermes
hermes
- Docker Model Runner
How to use coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW with Docker Model Runner:
docker model run hf.co/coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
- Lemonade
How to use coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW
Run and chat with the model
lemonade run user.gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW-{{QUANT_TAG}}List all available models
lemonade list
Run and chat with the model
lemonade run user.gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW-{{QUANT_TAG}}List all available models
lemonade list中文介绍
模型简介
本模型基于 Google 的 gemma-4-31b-it 指令微调模型构建,是一个去限制(decensored)版本,并进一步通过基于 Imatrix 的自适应混合量化策略进行深度压缩优化。
父模型通过 Heretic v1.2.0+custom 工具链,并结合 Arbitrary-Rank Ablation(ARA)方法进行处理,实现对齐约束的移除,从而提升模型输出自由度。
在此基础上,本模型引入了一套“结构感知 + 数据驱动”的量化策略,在大幅降低显存占用的同时,尽可能保留模型的推理能力与语言能力。
核心特性
去限制(Decensored) 移除了原始模型中的对齐限制,使模型具备更高的表达自由度和响应范围。 Imatrix 驱动量化 使用权重重要性矩阵(Imatrix)对模型参数进行评分,实现数据驱动的量化分配。 自适应混合精度 不同模块采用不同量化精度,而非统一量化,从而在压缩率与性能之间取得最佳平衡。 极致压缩 大部分参数压缩至 3-bit(IQ3_S),在可用性范围内实现极限体积优化。
量化策略说明
输出层与 Embedding(强保护) 以下模块使用 Q5_K 量化: lm_head output token_embd / embed_tokens
原因: 这些层直接决定模型输出质量与稳定性,量化过低会导致明显退化。
Attention 机制(核心保护) 以下模块统一使用 IQ4_NL: q_proj / k_proj / v_proj attn_q / attn_k / attn_v o_proj / attn_out
原因: Attention 是模型推理能力的核心,IQ4_NL 在 4-bit 下结合 imatrix 可最大化保留性能。
FFN(MLP)分层量化
基于 Imatrix 计算每个权重的重要性(score = mean(abs(weight))),并进行排序:
前 15%(最重要权重):IQ4_NL 后 85%(冗余权重):IQ3_S
效果: 在极大压缩模型体积的同时,尽量避免关键能力损失。
高精度保留层(不量化)
以下参数不参与低比特量化,保持 FP16 或 FP32:
norm(归一化层) bias rope / position
Imatrix 重要性计算
每个权重张量的重要性通过以下方式计算:
score = mean(abs(weight))
该方法能够有效反映权重能量分布,是一种稳定且计算成本低的近似指标。
量化分布(典型)
Q5_K:极少(仅输出相关层) IQ4_NL:约 15% - 30%(Attention + 高重要性 FFN) IQ3_S:约 70% - 85%(低重要性权重)
推理性能特点
显存占用显著低于传统 Q4_K / Q5_K 混合方案 推理能力接近 FP16 水平(尤其在推理与代码任务中) 适合 16GB / 24GB 显存环境部署 支持长上下文推理
使用要求(极其重要)
必须使用 Imatrix 进行量化或加载,否则模型性能会严重下降。
示例命令:
./llama-quantize --imatrix model.imatrix.gguf --override-kv "$(cat tensor_types.txt | paste -sd ',' -)" input.gguf output.gguf iq3_s
原因: IQ4_NL 与 IQ3_S 属于 imatrix-aware 量化方法,若缺失 imatrix,将导致权重缩放失真,模型表现明显劣化。
已知局限
在极端压缩下: 长链推理能力可能略有下降 数学精度略有损失 强依赖 Imatrix 不适用于训练,仅适用于推理
适用场景
对话系统 代码生成 推理任务 知识问答
风险提示
本模型为去限制版本,可能生成未过滤或潜在不安全内容。使用者需自行评估风险并承担相应责任。
English Description
Model Overview
This model is a decensored derivative of gemma-4-31b-it, further optimized using an Imatrix-driven adaptive mixed-precision quantization pipeline.
The base model was modified using Heretic v1.2.0+custom with the Arbitrary-Rank Ablation (ARA) method, removing alignment constraints and increasing output freedom.
On top of that, this release introduces a structure-aware and data-driven quantization strategy to significantly reduce memory footprint while preserving reasoning and language capabilities.
Key Features
Decensored Model Alignment restrictions have been removed, enabling broader and less constrained outputs. Imatrix-driven Quantization A weight importance matrix (Imatrix) is used to guide quantization decisions. Adaptive Mixed Precision Different layers use different quantization formats instead of a uniform scheme. Extreme Compression Most parameters are compressed to 3-bit (IQ3_S), achieving high compression ratios.
Quantization Strategy
Output & Embedding Protection The following layers use Q5_K: lm_head output embed_tokens
Reason: These layers are highly sensitive to quantization and directly affect output quality.
Attention Protection The following layers use IQ4_NL: q_proj / k_proj / v_proj attention outputs
Reason: Attention layers are critical for reasoning performance.
FFN Rank-based Quantization
Weights are ranked using:
score = mean(abs(weight))
Then assigned as follows:
Top 15%: IQ4_NL Remaining 85%: IQ3_S High-Precision Layers
The following are kept in FP16/FP32:
normalization layers bias positional encodings
Quantization Distribution
Q5_K: minimal IQ4_NL: ~15–30% IQ3_S: ~70–85%
Inference Characteristics
Significantly reduced VRAM usage Near-FP16 reasoning performance Suitable for 16GB–24GB GPUs Supports long-context inference
Critical Requirement
Imatrix MUST be used during quantization or inference.
Example:
./llama-quantize --imatrix model.imatrix.gguf --override-kv "$(cat tensor_types.txt | paste -sd ',' -)" input.gguf output.gguf iq3_s
Without imatrix:
Quantization breaks Severe performance degradation
Limitations
Slight degradation in long-chain reasoning Reduced mathematical precision Strong dependency on imatrix Not suitable for training
Use Cases
Chat / roleplay Coding Reasoning tasks Knowledge QA
Disclaimer
This is a decensored model and may generate unfiltered or unsafe content. Users are responsible for evaluating and managing associated risks.
- Downloads last month
- 2
We're not able to determine the quantization variants.
Pull the model
# Download Lemonade from https://lemonade-server.ai/lemonade pull coooold/gemma-4-31b-it-heretic-ara.i1-MP-3.98BPW