--- library_name: transformers language: - en pipeline_tag: text-generation model-index: - name: Mellum2 Instruct results: - task: type: text-generation name: Text Generation dataset: type: livecodebench name: LiveCodeBench v6 metrics: - name: pass@1 type: pass@1 value: 37.2 verified: false - task: type: text-generation name: Text Generation dataset: type: evalplus name: EvalPlus (HumanEval+ / MBPP+ mean) metrics: - name: pass@1 type: pass@1 value: 78.4 verified: false - task: type: text-generation name: Text Generation dataset: type: multipl-e name: MultiPL-E (7 languages) metrics: - name: pass@1 type: pass@1 value: 67.1 verified: false - task: type: text-generation name: Text Generation dataset: type: bfcl name: BFCL v3 metrics: - name: accuracy type: acc value: 66.3 verified: false - task: type: text-generation name: Text Generation dataset: type: bfcl name: BFCL v4 (macro-avg of 5 subtasks) metrics: - name: accuracy type: acc value: 44.2 verified: false - task: type: text-generation name: Text Generation dataset: type: aime name: "AIME 2025+2026 (mean, 30 questions each)" metrics: - name: exact match type: exact_match value: 41.7 verified: false - task: type: text-generation name: Text Generation dataset: type: gsm-plus name: GSM-Plus metrics: - name: exact match type: exact_match value: 80.5 verified: false - task: type: text-generation name: Text Generation dataset: type: mmlu-redux name: MMLU-Redux metrics: - name: accuracy type: acc value: 78.1 verified: false - task: type: text-generation name: Text Generation dataset: type: gpqa name: GPQA Diamond metrics: - name: accuracy type: acc value: 40.9 verified: false - task: type: text-generation name: Text Generation dataset: type: ifeval name: IFEval (prompt-level strict accuracy) metrics: - name: accuracy type: acc value: 75.8 verified: false - task: type: text-generation name: Text Generation dataset: type: mixeval name: MixEval metrics: - name: accuracy type: acc value: 62.2 verified: false - task: type: text-generation name: Text Generation dataset: type: bs-bench name: BS-Bench (detection rate) metrics: - name: detection rate type: detection_rate value: 18.0 verified: false - task: type: text-generation name: Text Generation dataset: type: harmbench name: "HarmBench (harmful rate, lower is better)" metrics: - name: harmful rate type: harmful_rate value: 23.1 verified: false - task: type: text-generation name: Text Generation dataset: type: xstest name: XSTest (safe compliance) metrics: - name: safe compliance type: safe_compliance value: 81.2 verified: false license: apache-2.0 --- Mellum # Mellum2 Instruct > [!Note] > Use this model when you want direct, low-latency answers without an explicit chain of thought — interactive chat, code assistance, tool use, and instruction following. If you need explicit reasoning before the answer (complex debugging, planning, multi-step agentic flows), use [Thinking](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking) instead. ## Mellum2 Instruct Highlights Mellum2 Instruct is a post-trained assistant model trained by JetBrains. The model uses a Mixture-of-Experts architecture with 64 experts and activates 8 experts per token. It uses a combination of sliding-window and full attention layers, with a context length of 131,072 tokens. It is produced from [`Mellum2-12B-A2.5B-Base`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Base) by supervised fine-tuning followed by reinforcement learning with verifiable rewards (RLVR) on math, executable coding, tool use, instruction following, reasoning, and knowledge tasks. Mellum2 Instruct answers directly, without an externalized chain of thought. ## Mellum2 Model Family This repository contains one checkpoint from the Mellum2 family. | Checkpoint | Description | |---|---| | [Base Pretrain](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Base-Pretrain) | Base checkpoint before long-context extension | | [Base](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Base) | Final base model | | [Instruct SFT](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-SFT) | Supervised instruction-tuned checkpoint | | [Thinking SFT](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-SFT) | Supervised thinking checkpoint | | Instruct | RL-tuned instruction model | | [Thinking](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking) | RL-tuned thinking model | ## Model Overview **Mellum2 Instruct** has the following features: - Number of Layers: 28 - Hidden Size: 2304 - Intermediate Size: 7168 - MoE Intermediate Size: 896 - Number of Experts: 64 - Number of Activated Experts: 8 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Context Length: 131,072 - Sliding Window: 1,024 - Vocabulary Size: 98,304 - Precision: bfloat16 ## Serving with vLLM ```sh # Without tool calling vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct --max-model-len 131072 # With tool calling vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct \ --max-model-len 131072 \ --enable-auto-tool-choice \ --tool-call-parser hermes ``` ## Quickstart Text-Only Input ```python from openai import OpenAI # Configured by environment variables client = OpenAI() messages = [ {"role": "user", "content": "Write a Python function to reverse a string."}, ] chat_response = client.chat.completions.create( model="JetBrains/Mellum2-12B-A2.5B-Instruct", messages=messages, max_tokens=81920, temperature=0.6, top_p=0.95, extra_body={ "top_k": 20, }, ) print("Chat response:", chat_response) ``` ## Evaluation Post-training evaluation for the instruct (no-thinking) variants. All values are percentages; higher is better except HarmBench, where lower is better. All values self-reported by JetBrains. | Benchmark | Mellum2 Instruct SFT | Mellum2 Instruct | Qwen3.5 (4B) | Qwen3.5 (9B) | OLMo-3 (7B) | Ministral 3 (14B) | Seed-Coder (8B) | | :----------------- | --------------------: | ----------------: | -----------: | -----------: | ----------: | ----------------: | --------------: | | **Coding** | | | | | | | | | LiveCodeBench v6 | 30.9 | 37.2 | 51.0 | 63.7 | 28.2 | 42.4 | 28.1 | | EvalPlus | 76.2 | 78.4 | 69.4 | 71.8 | 67.3 | 74.1 | 73.8 | | MultiPL-E | 64.6 | 67.1 | 51.0 | 67.1 | 36.1 | 71.5 | 77.0 | | **Tool Use** | | | | | | | | | BFCL v4 | 31.8 | 44.2 | 52.0 | 60.6 | 19.8 | 38.8 | — | | BFCL v3 | 43.1 | 66.3 | 64.1 | 70.5 | 41.9 | 52.7 | — | | **Math** | | | | | | | | | AIME | 29.9 | 41.7 | 38.3 | 58.3 | 40.0 | 33.3 | 0.0 | | GSM-Plus | 73.0 | 80.5 | 85.2 | 87.9 | 85.8 | 86.6 | 50.4 | | **Knowledge** | | | | | | | | | MMLU-Redux | 77.4 | 78.1 | 87.5 | 91.1 | 71.8 | 85.9 | 38.1 | | GPQA Diamond | 38.9 | 40.9 | 76.8 | 79.8 | 40.9 | 58.6 | 20.2 | | **Conversational** | | | | | | | | | IFEval | 69.3 | 75.8 | 82.1 | 83.9 | 83.2 | 67.3 | 56.2 | | JetBrains pairwise | 66.7 | 68.1 | 60.6 | 77.8 | 44.4 | 72.4 | 43.0 | | MixEval | 62.9 | 62.2 | 65.9 | 71.1 | 59.4 | 71.2 | 37.2 | | BS-Bench | 24.0 | 18.0 | 56.9 | 61.0 | 22.0 | 9.0 | 5.0 | | **Safety** | | | | | | | | | HarmBench (↓) | 8.4 | 23.1 | 20.3 | 20.9 | 14.7 | 56.5 | 40.0 | | XSTest | 78.3 | 81.2 | 93.2 | 91.2 | 91.2 | 96.8 | 86.3 | Notes: - **EvalPlus** is the mean of HumanEval+ and MBPP+. - **AIME** is the mean of AIME 2025 and AIME 2026 (30 questions each). - **BFCL v4** is the macro-average of five subtasks: v1, v2, v3, web search, memory. - **JetBrains pairwise** is win rate against `Qwen2.5-7B-Instruct` on an internal benchmark. - `—` indicates the model lacks native tool calling. For more details, see the [Mellum2 Technical Report](https://arxiv.org/abs/2605.31268). ## License Released under the Apache 2.0 license.