pavlichenko's picture
Update README.md
4ee5751 verified
|
Raw
History Blame Contribute Delete
10.5 kB
metadata
library_name: transformers
language:
  - en
pipeline_tag: text-generation
model-index:
  - name: Mellum2 Instruct
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: livecodebench
          name: LiveCodeBench v6
        metrics:
          - name: pass@1
            type: pass@1
            value: 37.2
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: evalplus
          name: EvalPlus (HumanEval+ / MBPP+ mean)
        metrics:
          - name: pass@1
            type: pass@1
            value: 78.4
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: multipl-e
          name: MultiPL-E (7 languages)
        metrics:
          - name: pass@1
            type: pass@1
            value: 67.1
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: bfcl
          name: BFCL v3
        metrics:
          - name: accuracy
            type: acc
            value: 66.3
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: bfcl
          name: BFCL v4 (macro-avg of 5 subtasks)
        metrics:
          - name: accuracy
            type: acc
            value: 44.2
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: aime
          name: AIME 2025+2026 (mean, 30 questions each)
        metrics:
          - name: exact match
            type: exact_match
            value: 41.7
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: gsm-plus
          name: GSM-Plus
        metrics:
          - name: exact match
            type: exact_match
            value: 80.5
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: mmlu-redux
          name: MMLU-Redux
        metrics:
          - name: accuracy
            type: acc
            value: 78.1
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: gpqa
          name: GPQA Diamond
        metrics:
          - name: accuracy
            type: acc
            value: 40.9
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: ifeval
          name: IFEval (prompt-level strict accuracy)
        metrics:
          - name: accuracy
            type: acc
            value: 75.8
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: mixeval
          name: MixEval
        metrics:
          - name: accuracy
            type: acc
            value: 62.2
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: bs-bench
          name: BS-Bench (detection rate)
        metrics:
          - name: detection rate
            type: detection_rate
            value: 18
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: harmbench
          name: HarmBench (harmful rate, lower is better)
        metrics:
          - name: harmful rate
            type: harmful_rate
            value: 23.1
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: xstest
          name: XSTest (safe compliance)
        metrics:
          - name: safe compliance
            type: safe_compliance
            value: 81.2
            verified: false
license: apache-2.0
Mellum

Mellum2 Instruct

Use this model when you want direct, low-latency answers without an explicit chain of thought — interactive chat, code assistance, tool use, and instruction following. If you need explicit reasoning before the answer (complex debugging, planning, multi-step agentic flows), use Thinking instead.

Mellum2 Instruct Highlights

Mellum2 Instruct is a post-trained assistant model trained by JetBrains.

The model uses a Mixture-of-Experts architecture with 64 experts and activates 8 experts per token. It uses a combination of sliding-window and full attention layers, with a context length of 131,072 tokens.

It is produced from Mellum2-12B-A2.5B-Base by supervised fine-tuning followed by reinforcement learning with verifiable rewards (RLVR) on math, executable coding, tool use, instruction following, reasoning, and knowledge tasks. Mellum2 Instruct answers directly, without an externalized chain of thought.

Mellum2 Model Family

This repository contains one checkpoint from the Mellum2 family.

Checkpoint Description
Base Pretrain Base checkpoint before long-context extension
Base Final base model
Instruct SFT Supervised instruction-tuned checkpoint
Thinking SFT Supervised thinking checkpoint
Instruct RL-tuned instruction model
Thinking RL-tuned thinking model

Model Overview

Mellum2 Instruct has the following features:

  • Number of Layers: 28
  • Hidden Size: 2304
  • Intermediate Size: 7168
  • MoE Intermediate Size: 896
  • Number of Experts: 64
  • Number of Activated Experts: 8
  • Number of Attention Heads (GQA): 32 for Q and 4 for KV
  • Context Length: 131,072
  • Sliding Window: 1,024
  • Vocabulary Size: 98,304
  • Precision: bfloat16

Serving with vLLM

# Without tool calling
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct --max-model-len 131072

# With tool calling
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct \
  --max-model-len 131072 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Quickstart

Text-Only Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {"role": "user", "content": "Write a Python function to reverse a string."},
]

chat_response = client.chat.completions.create(
    model="JetBrains/Mellum2-12B-A2.5B-Instruct",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    },
)
print("Chat response:", chat_response)

Evaluation

Post-training evaluation for the instruct (no-thinking) variants. All values are percentages; higher is better except HarmBench, where lower is better. All values self-reported by JetBrains.

Benchmark Mellum2 Instruct SFT Mellum2 Instruct Qwen3.5 (4B) Qwen3.5 (9B) OLMo-3 (7B) Ministral 3 (14B) Seed-Coder (8B)
Coding
LiveCodeBench v6 30.9 37.2 51.0 63.7 28.2 42.4 28.1
EvalPlus 76.2 78.4 69.4 71.8 67.3 74.1 73.8
MultiPL-E 64.6 67.1 51.0 67.1 36.1 71.5 77.0
Tool Use
BFCL v4 31.8 44.2 52.0 60.6 19.8 38.8
BFCL v3 43.1 66.3 64.1 70.5 41.9 52.7
Math
AIME 29.9 41.7 38.3 58.3 40.0 33.3 0.0
GSM-Plus 73.0 80.5 85.2 87.9 85.8 86.6 50.4
Knowledge
MMLU-Redux 77.4 78.1 87.5 91.1 71.8 85.9 38.1
GPQA Diamond 38.9 40.9 76.8 79.8 40.9 58.6 20.2
Conversational
IFEval 69.3 75.8 82.1 83.9 83.2 67.3 56.2
JetBrains pairwise 66.7 68.1 60.6 77.8 44.4 72.4 43.0
MixEval 62.9 62.2 65.9 71.1 59.4 71.2 37.2
BS-Bench 24.0 18.0 56.9 61.0 22.0 9.0 5.0
Safety
HarmBench (↓) 8.4 23.1 20.3 20.9 14.7 56.5 40.0
XSTest 78.3 81.2 93.2 91.2 91.2 96.8 86.3

Notes:

  • EvalPlus is the mean of HumanEval+ and MBPP+.
  • AIME is the mean of AIME 2025 and AIME 2026 (30 questions each).
  • BFCL v4 is the macro-average of five subtasks: v1, v2, v3, web search, memory.
  • JetBrains pairwise is win rate against Qwen2.5-7B-Instruct on an internal benchmark.
  • indicates the model lacks native tool calling.

For more details, see the Mellum2 Technical Report.

License

Released under the Apache 2.0 license.