Instructions to use acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4 with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4 with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4 with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4

Run Hermes

hermes

MLX LM

How to use acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4 with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

qwen3.5-2b-id-meeting-summarizer (MLX 4-bit)

MLX-quantized port of acul3/qwen3.5-2b-id-meeting-summarizer for on-device inference on Apple Silicon (Mac, iPad, iPhone via mlx-swift-lm).

Quantization: Q4, group_size=64 (4.503 bits/weight effective)
Size: 1.0 GB on disk
Speed: ~30 tok/s on M-series Mac (M2 Pro tested). iPhone 14 Pro estimate ~10–15 tok/s.

Three tasks, same model

The fine-tune supports three prompt-selected tasks. See the source README for verbatim training prompts.

Task	Format pass	Notes
`paragraph`	100%	single-paragraph Indonesian summary
`title_generator`	100%	takes a summary as input, returns ≤7-word title
`rich_summary`	50% (raw) / 85–95% (with guards)	full markdown breakdown with Overview / Conclusion / Action Items

Critical: always use `repetition_penalty=1.1`

Q4 quantization shifts the next-token distribution just enough that pure greedy decoding (do_sample=False, temp=0) falls into local repetition loops after the first few sentences. The source model at bf16/fp16 doesn't have this problem.

A repetition penalty of 1.1 fully fixes the loop without degrading content quality. This is required, not optional.

Usage (Python, mlx-lm)

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

model, tokenizer = load("acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4")

PROMPT = """You are a helpful assistant expert in writing.
You answer only with the result without explanation or pretext.
Please follow the instructions word by word obediently.
<transcript>
Audio Transcript:
{transcript}
</transcript>

Analyze and generate a summary based on the audio transcript above written in 1 paragraph.
... [see source README for the full verbatim prompt]
"""

# Qwen3.5's chat template emits <think></think> by default — fine-tune was
# trained on non-thinking outputs.
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": PROMPT.format(transcript=my_transcript)}],
    add_generation_prompt=True,
    tokenize=False,
    enable_thinking=False,
)

out = generate(
    model, tokenizer,
    prompt=prompt,
    max_tokens=2048,
    sampler=make_sampler(temp=0.0),
    logits_processors=make_logits_processors(repetition_penalty=1.1),  # required
)
print(out)

Usage (Swift, mlx-swift-lm)

import MLXLLM
import MLXLMCommon

let config = ModelConfiguration(
    id: "acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4",
    defaultPrompt: "Ringkas teks berikut."
)
let container = try await #huggingFaceLoadModelContainer(configuration: config)

let session = ChatSession(
    container,
    generateParameters: .init(
        maxTokens: 2048,
        temperature: 0.0,
        repetitionPenalty: 1.1,  // required, see above
        repetitionContextSize: 64
    )
)
let summary = try await session.respond(to: prompt)

Conversion procedure

The source HF repo ships an Unsloth save artifact with two non-standard features that block mainline mlx-lm from converting directly:

Triple-nested key prefix: weights are saved as model.language_model.language_model.language_model.X instead of model.X. Mlx-lm's loader can't map these.
Unused vision tower: the base is a Qwen3.5 VL model, but the fine-tune used no vision data. The vision tower (297 weight tensors) is dead weight on disk.

The conversion script remaps text keys, drops vision tensors, flattens the VL config wrapper into a flat qwen3_5 text config, sets tie_word_embeddings: true (no lm_head in source), then runs mlx_lm.convert:

# Step 1: remap (see tools/convert-summarizer/remap_weights.py in the transkrip repo)
python remap_weights.py \
  --src ~/.cache/huggingface/hub/models--acul3--qwen3.5-2b-id-meeting-summarizer/snapshots/<hash> \
  --dst ./qwen35-2b-id-text-only-bf16

# Step 2: quantize with mainline mlx-lm
mlx_lm.convert \
  --hf-path ./qwen35-2b-id-text-only-bf16 \
  --mlx-path ./qwen35-2b-id-meeting-mlx-q4 \
  -q --q-bits 4 --q-group-size 64

License

apache-2.0 (inherited from source).

Acknowledgements

Fine-tune: acul3/qwen3.5-2b-id-meeting-summarizer
Base: unsloth/Qwen3.5-2B → upstream Qwen/Qwen3.5-2B
Quantization: mlx-lm

Downloads last month: 162

Safetensors

Model size

0.3B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for acul3/qwen3.5-2b-id-meeting-summarizer-mlx-q4

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Finetuned

unsloth/Qwen3.5-2B

Adapter

acul3/qwen3.5-2b-id-meeting-summarizer

Quantized

(1)

this model