Instructions to use acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4
Run Hermes
hermes
- MLX LM
How to use acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4", "messages": [ {"role": "user", "content": "Hello"} ] }'
qwen3.5-2b-id-meeting-summarizer (MLX 4-bit)
MLX-quantized port of acul3/qwen3.5-2b-id-meeting-summarizer for on-device inference on Apple Silicon (Mac, iPad, iPhone via mlx-swift-lm).
- Quantization: Q4, group_size=64 (4.503 bits/weight effective)
- Size: 1.0 GB on disk
- Speed: ~30 tok/s on M-series Mac (M2 Pro tested). iPhone 14 Pro estimate ~10–15 tok/s.
Three tasks, same model
The fine-tune supports three prompt-selected tasks. See the source README for verbatim training prompts.
| Task | Format pass | Notes |
|---|---|---|
paragraph |
100% | single-paragraph Indonesian summary |
title_generator |
100% | takes a summary as input, returns ≤7-word title |
rich_summary |
50% (raw) / 85–95% (with guards) | full markdown breakdown with Overview / Conclusion / Action Items |
Critical: always use repetition_penalty=1.1
Q4 quantization shifts the next-token distribution just enough that pure greedy decoding (do_sample=False, temp=0) falls into local repetition loops after the first few sentences. The source model at bf16/fp16 doesn't have this problem.
A repetition penalty of 1.1 fully fixes the loop without degrading content quality. This is required, not optional.
Usage (Python, mlx-lm)
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors
model, tokenizer = load("acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4")
PROMPT = """You are a helpful assistant expert in writing.
You answer only with the result without explanation or pretext.
Please follow the instructions word by word obediently.
<transcript>
Audio Transcript:
{transcript}
</transcript>
Analyze and generate a summary based on the audio transcript above written in 1 paragraph.
... [see source README for the full verbatim prompt]
"""
# Qwen3.5's chat template emits <think></think> by default — fine-tune was
# trained on non-thinking outputs.
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": PROMPT.format(transcript=my_transcript)}],
add_generation_prompt=True,
tokenize=False,
enable_thinking=False,
)
out = generate(
model, tokenizer,
prompt=prompt,
max_tokens=2048,
sampler=make_sampler(temp=0.0),
logits_processors=make_logits_processors(repetition_penalty=1.1), # required
)
print(out)
Usage (Swift, mlx-swift-lm)
import MLXLLM
import MLXLMCommon
let config = ModelConfiguration(
id: "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4",
defaultPrompt: "Ringkas teks berikut."
)
let container = try await #huggingFaceLoadModelContainer(configuration: config)
let session = ChatSession(
container,
generateParameters: .init(
maxTokens: 2048,
temperature: 0.0,
repetitionPenalty: 1.1, // required, see above
repetitionContextSize: 64
)
)
let summary = try await session.respond(to: prompt)
Conversion procedure
The source HF repo ships an Unsloth save artifact with two non-standard features that block mainline mlx-lm from converting directly:
- Triple-nested key prefix: weights are saved as
model.language_model.language_model.language_model.Xinstead ofmodel.X. Mlx-lm's loader can't map these. - Unused vision tower: the base is a Qwen3.5 VL model, but the fine-tune used no vision data. The vision tower (297 weight tensors) is dead weight on disk.
The conversion script remaps text keys, drops vision tensors, flattens the VL config wrapper into a flat qwen3_5 text config, sets tie_word_embeddings: true (no lm_head in source), then runs mlx_lm.convert:
# Step 1: remap (see tools/convert-summarizer/remap_weights.py in the transkrip repo)
python remap_weights.py \
--src ~/.cache/huggingface/hub/models--acul3--qwen3.5-2b-id-meeting-summarizer/snapshots/<hash> \
--dst ./qwen35-2b-id-text-only-bf16
# Step 2: quantize with mainline mlx-lm
mlx_lm.convert \
--hf-path ./qwen35-2b-id-text-only-bf16 \
--mlx-path ./qwen35-2b-id-meeting-mlx-q4 \
-q --q-bits 4 --q-group-size 64
License
apache-2.0 (inherited from source).
Acknowledgements
- Fine-tune:
acul3/qwen3.5-2b-id-meeting-summarizer - Base:
unsloth/Qwen3.5-2B→ upstreamQwen/Qwen3.5-2B - Quantization:
mlx-lm
- Downloads last month
- 162
4-bit