ubergarm's picture
uploading custom Q3_K imatirx quantized mix
9a168a7
|
Raw
History Blame
2.18 kB
metadata
quantized_by: ubergarm
pipeline_tag: text-generation
base_model: Qwen/Qwen3.5-397B-A17B
base_model_relation: quantized
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE
tags:
  - imatrix
  - conversational
  - qwen3_5_moe
  - ik_llama.cpp

WIP

There is not yet support in ik_llama.cpp though an open issue.

For now to help out with testing, used mainline llama.cpp to make imatrix (gguf format) if others would like to use it to make their own imatrix custom quants.

Check the logs/ directory for details on imatrix calculation.

I'll upload more if/when ik_llama.cpp support is merged.

It seems to inference very slowly on CPU-only and probably requires at least one GPU to handle attention/kv-cache/delta-net stuff as it is much faster even hybrid CPU+GPU.

Q3_K 179.97 GiB (3.90 BPW)

TODO Perplexity Calculations

👈 Secret Recipe
./build/bin/llama-quantize \
    --tensor-type ffn_down_exps=q4_K \
    --tensor-type ffn_gate_exps=q3_K \
    --tensor-type ffn_up_exps=q3_K \
    --token-embedding-type q4_K \
    --output-tensor-type q6_K \
    --imatrix /mnt/data/models/ubergarm/Qwen3.5-397B-A17B-GGUF/imatrix-Qwen3.5-397B-A17B-BF16-mainline.gguf \
    /mnt/data/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-BF16-00001-of-00017.gguf \
    /mnt/data/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-Q3_K.gguf \
    Q8_0 \
    128

References