qwen3.5-2b-id-meeting-summarizer (MLX 4-bit)

MLX-quantized port of acul3/qwen3.5-2b-id-meeting-summarizer for on-device inference on Apple Silicon (Mac, iPad, iPhone via mlx-swift-lm).

  • Quantization: Q4, group_size=64 (4.503 bits/weight effective)
  • Size: 1.0 GB on disk
  • Speed: ~30 tok/s on M-series Mac (M2 Pro tested). iPhone 14 Pro estimate ~10–15 tok/s.

Three tasks, same model

The fine-tune supports three prompt-selected tasks. See the source README for verbatim training prompts.

Task Format pass Notes
paragraph 100% single-paragraph Indonesian summary
title_generator 100% takes a summary as input, returns ≤7-word title
rich_summary 50% (raw) / 85–95% (with guards) full markdown breakdown with Overview / Conclusion / Action Items

Critical: always use repetition_penalty=1.1

Q4 quantization shifts the next-token distribution just enough that pure greedy decoding (do_sample=False, temp=0) falls into local repetition loops after the first few sentences. The source model at bf16/fp16 doesn't have this problem.

A repetition penalty of 1.1 fully fixes the loop without degrading content quality. This is required, not optional.

Usage (Python, mlx-lm)

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

model, tokenizer = load("acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4")

PROMPT = """You are a helpful assistant expert in writing.
You answer only with the result without explanation or pretext.
Please follow the instructions word by word obediently.
<transcript>
Audio Transcript:
{transcript}
</transcript>

Analyze and generate a summary based on the audio transcript above written in 1 paragraph.
... [see source README for the full verbatim prompt]
"""

# Qwen3.5's chat template emits <think></think> by default — fine-tune was
# trained on non-thinking outputs.
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": PROMPT.format(transcript=my_transcript)}],
    add_generation_prompt=True,
    tokenize=False,
    enable_thinking=False,
)

out = generate(
    model, tokenizer,
    prompt=prompt,
    max_tokens=2048,
    sampler=make_sampler(temp=0.0),
    logits_processors=make_logits_processors(repetition_penalty=1.1),  # required
)
print(out)

Usage (Swift, mlx-swift-lm)

import MLXLLM
import MLXLMCommon

let config = ModelConfiguration(
    id: "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4",
    defaultPrompt: "Ringkas teks berikut."
)
let container = try await #huggingFaceLoadModelContainer(configuration: config)

let session = ChatSession(
    container,
    generateParameters: .init(
        maxTokens: 2048,
        temperature: 0.0,
        repetitionPenalty: 1.1,  // required, see above
        repetitionContextSize: 64
    )
)
let summary = try await session.respond(to: prompt)

Conversion procedure

The source HF repo ships an Unsloth save artifact with two non-standard features that block mainline mlx-lm from converting directly:

  1. Triple-nested key prefix: weights are saved as model.language_model.language_model.language_model.X instead of model.X. Mlx-lm's loader can't map these.
  2. Unused vision tower: the base is a Qwen3.5 VL model, but the fine-tune used no vision data. The vision tower (297 weight tensors) is dead weight on disk.

The conversion script remaps text keys, drops vision tensors, flattens the VL config wrapper into a flat qwen3_5 text config, sets tie_word_embeddings: true (no lm_head in source), then runs mlx_lm.convert:

# Step 1: remap (see tools/convert-summarizer/remap_weights.py in the transkrip repo)
python remap_weights.py \
  --src ~/.cache/huggingface/hub/models--acul3--qwen3.5-2b-id-meeting-summarizer/snapshots/<hash> \
  --dst ./qwen35-2b-id-text-only-bf16

# Step 2: quantize with mainline mlx-lm
mlx_lm.convert \
  --hf-path ./qwen35-2b-id-text-only-bf16 \
  --mlx-path ./qwen35-2b-id-meeting-mlx-q4 \
  -q --q-bits 4 --q-group-size 64

License

apache-2.0 (inherited from source).

Acknowledgements

Downloads last month
162
Safetensors
Model size
0.3B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4

Finetuned
Qwen/Qwen3.5-2B
Quantized
(1)
this model