Hy-MT2-30B-A3B GGUF Models

Chinese documentation: README.zh-CN.md

This repository contains GGUF conversions and quantizations of Tencent's tencent/Hy-MT2-30B-A3B translation model for llama.cpp-compatible inference.

Files

File Type / Quantization Size Notes
Hy-MT2-30B-A3B-BF16.gguf BF16 / FP16-class source GGUF ~56 GB Highest fidelity source GGUF; useful for re-quantization or maximum-quality inference
Hy-MT2-30B-A3B-Q2_K.gguf Q2_K ~11 GB Smallest/lowest-memory option; lowest quality
Hy-MT2-30B-A3B-Q3_K_M.gguf Q3_K_M ~14 GB Very low-memory option, better than Q2_K
Hy-MT2-30B-A3B-Q4_K_M.gguf Q4_K_M ~17 GB Recommended balanced/low-memory option
Hy-MT2-30B-A3B-Q5_K_M.gguf Q5_K_M ~20 GB Better quality than Q4_K_M with moderate extra memory
Hy-MT2-30B-A3B-Q6_K.gguf Q6_K ~24 GB Better quality, higher VRAM/RAM usage
Hy-MT2-30B-A3B-Q8_0.gguf Q8_0 ~30 GB Highest fidelity among the quantized files

SHA256 sidecar files are provided for the GGUF files when available.

Quantization notes

The Q2_K, Q3_K_M, and Q5_K_M files were quantized directly from the BF16 GGUF source, not requantized from another low-bit GGUF.

Because Hy-MT2-30B-A3B uses the hy_v3 architecture, conversion and quantization require llama.cpp tooling that supports hy_v3. A generic llama.cpp quantizer may fail with unknown model architecture: 'hy_v3'.

Important compatibility note

Hy-MT2-30B-A3B uses the hy_v3 architecture. It requires a llama.cpp build that supports this architecture. If your llama.cpp build does not support it, loading the GGUF may fail with:

unknown model architecture: 'hy_v3'

Use a compatible llama.cpp build/branch for Hy-V3/Hy-MT2 models.

Recommended llama.cpp server usage

Example:

./llama-server \
  -m Hy-MT2-30B-A3B-Q4_K_M.gguf \
  --alias tencent/Hy-MT2-30B-A3B-GGUF:Q4_K_M \
  --host 0.0.0.0 \
  --port 18080 \
  -c 131072 \
  --n-gpu-layers 60 \
  --jinja \
  -r '<eos:6124c78e>'

Notes:

  • --jinja is recommended so llama.cpp uses the chat template correctly.
  • -r '<eos:6124c78e>' is recommended as a reverse prompt / stop marker because this model may emit the textual EOS marker if the runtime does not treat it as a native EOS token.
  • Adjust --n-gpu-layers according to your GPU memory.
  • For long context, make sure you have enough VRAM/RAM. KV-cache quantization may be useful on smaller GPUs if supported by your runtime.
  • BF16/FP16-class GGUF requires much more RAM/VRAM than the quantized files.

Example OpenAI-compatible API request

curl http://127.0.0.1:18080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "messages": [
      {"role": "user", "content": "Translate to English: 今天天气很好。"}
    ],
    "temperature": 0,
    "max_tokens": 128
  }'

Expected output is a direct translation such as:

The weather is very nice today.

Source model and license

  • Upstream model: https://huggingface.co/tencent/Hy-MT2-30B-A3B
  • These GGUF files are converted and quantized from the upstream model.
  • Please follow the upstream model license and usage terms. A copy of the upstream LICENSE.txt is included when available.
Downloads last month
4,121
GGUF
Model size
30B params
Architecture
hy_v3
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for litigerking/Hy-MT2-30B-A3B-GGUF

Quantized
(13)
this model