m-i/Qwen3.5-397B-A17B-2.416bit

This model m-i/Qwen3.5-397B-A17B-2.416bit was converted to MLX format from Qwen/Qwen3.5-397B-A17B using mlx-lm version 0.31.3.

Quant Predicate

import mlx.core as mx
mx.set_default_device(mx.cpu)
from mlx_vlm import convert


def qwen397b_predicate(path: str, module):
    """
    Multi-bit quantization predicate for Qwen/Qwen3.5-397B-A17B.

    Maps GGUF _exps patterns (MoE experts) to HF switch_mlp paths.
    """
    # ── SKIP: Critical for numerical stability ─────────────────────────────
    if any(kw in path for kw in [
        "layernorm", "mlp.gate.", "shared_expert_gate.",
        "A_log", "dt_bias", "conv1d", ".bias", ".scales",
    ]):
        return False

    # ── 2-bit: MoE routed experts (switch_mlp) ← THIS HANDLES _exps ───────
    # GGUF: ffn_{gate,up,down}_exps.weight → HF: switch_mlp.{gate,up,down}_proj.weight
    if "switch_mlp" in path:
        if any(proj in path for proj in ["gate_proj", "up_proj"]):
            return {"group_size": 128, "bits": 2, "mode": "affine"}  # IQ2_XXS equivalent
        if "down_proj" in path:
            # Optional: use 3-bit for down_proj to mirror IQ2_S > IQ2_XXS
            return {"group_size": 64, "bits": 2, "mode": "affine"} 

    # ── 4-bit: Token embeddings ────────────────────────────────────────────
    if "embed_tokens" in path:
        return {"group_size": 32, "bits": 4, "mode": "affine"}

    # ── 5-bit: Shared expert gate/up ───────────────────────────────────────
    if "shared_expert" in path and any(p in path for p in ["gate_proj", "up_proj"]):
        return {"group_size": 128, "bits": 5, "mode": "affine"}

    # ── 6-bit: Shared expert down ──────────────────────────────────────────
    if "shared_expert" in path and "down_proj" in path:
        return {"group_size": 32, "bits": 5, "mode": "affine"}

    # ── 5-bit: Linear/full attention projections ───────────────────────────
    if "linear_attn" in path and any(p in path for p in ["in_proj", "out_proj"]):
        return {"group_size": 128, "bits": 5, "mode": "affine"}
    if "self_attn" in path and any(p in path for p in ["q_proj", "k_proj", "v_proj", "o_proj"]):
        return {"group_size": 128, "bits": 5, "mode": "affine"}

    # ── 8-bit: SSM dynamics & fallback ─────────────────────────────────────
    if any(kw in path for kw in ["in_proj_a", "in_proj_b", "ssm_alpha", "ssm_beta"]):
        return {"group_size": 128, "bits": 6, "mode": "affine"}
    if "lm_head" in path:
        return {"group_size": 128, "bits": 6, "mode": "affine"}

    return {"group_size": 128, "bits": 6, "mode": "affine"}



repo = "Qwen/Qwen3.5-397B-A17B"
upload_repo = "m-i/Qwen3.5-397B-A17B-2.416bit"

convert(repo, quantize=True, upload_repo=upload_repo, quant_predicate=qwen397b_predicate)

Use with mlx

pip install mlx-vlm

from mlx_lm import load, generate

model, tokenizer = load("m-i/Qwen3.5-397B-A17B-2.416bit")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Downloads last month: 26

Safetensors

Model size

34B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for m-i/Qwen3.5-397B-A17B-2.416bit

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(67)

this model