Qwen3.5-2B	Qwen3.5-2B-NVFP4
Instruct (Non-Thinking) Mode
MMLU-Pro	55.3	54.5
MMLU-Redux	69.2	67.8
C-Eval	65.2	63.6
SuperGPQA	30.4	30.1
IFEval	61.2	59.5
MMMLU	56.9	55.4
Knowledge & STEM (Thinking)
MMLU-Pro	66.5	65.3
MMLU-Redux	79.6	77.6
C-Eval	73.2	72.2
SuperGPQA	37.5	36.8
GPQA	51.6	50.7
Instruction Following (Thinking）
IFEval	78.6	77.2
IFBench	41.3	40.8
MultiChallenge	33.7	33.2
Long Context (Thinking）
AA-LCR	25.6	25.2
LongBench v2	38.7	38.1
Reasoning (Thinking）
HMMT Feb 25	22.9	22.6
HMMT Nov 25	19.6	19.4
General Agent (Thinking）
BFCL-V4	43.6	42.8
TAU2-Bench	48.8	48.1
Multilingualism (Thinking）
MMMLU	63.1	61.9
MMLU-ProX	52.3	51.3
NOVA-63	46.4	45.6
INCLUDE	55.4	54.0
Global PIQA	69.3	66.7
PolyMATH	26.1	25.2
WMT24++	45.8	44.9
MAXIFE	60.6	59.5

* TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
* MMLU-ProX: we report the averaged accuracy on 29 languages.
* WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
* MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
* Experimental settings: top_p=0.95, top_k=20, presence_penalty=1.5, and temperature=1.0 were used.
* Empty cells (--) indicate scores not yet available or not applicable.

	Qwen3.5-2B	Qwen3.5-2B-NVFP4
STEM and Puzzle
MMMU	64.2/64.2	64.2/64.2
MMMU-Pro	50.3/47.7	50.3/47.7
Mathvista(mini)	76.7/73.9	76.7/73.9
DynaMath	73.6/69.6	73.6/69.6
ZEROBench	1/0	1/0
ZEROBench_sub	17.1/18.6	17.1/18.6
VlmsAreBlind	75.8/74.3	75.8/74.3
General VQA
RealWorldQA	74.5/71.2	74.5/71.2
MMStar	71.7/68.0	71.7/68.0
MMBench_EN-DEV-v1.1	83.3/81.3	83.3/81.3
SimpleVQA	38.5/39.5	38.5/39.5
HallusionBench	58.0/51.3	58.0/51.3
Text Recognition and Document Understanding
MMLongBench-Doc	45.4/38.8	45.4/38.8
AI2D_TEST	83.3/81.5	83.3/81.5
CC-OCR	72.9/75.8	72.9/75.8
OmniDocBench1.5	79.8/80.9	79.8/80.9
CharXiv(RQ)	58.8/52.6	58.8/52.6
OCRBench	84.5/85.4	84.5/85.4
Spatial Intelligence
RefCOCO(avg)	84.8/84.3	84.8/84.3
CountBench	91.4/86.8	91.4/86.8
ODInW13	35.9/40.5	35.9/40.5
ERQA	43.8/33.0	43.8/33.0
EmbSpatialBench	77.9/66.4	77.9/66.4
RefSpatialBench	32.9/30.0	32.9/30.0
Hypersim	12.4/12.4	12.4/12.4
SUNRGBD	28.7/25.6	28.7/25.6
Nuscene	6.9/8.5	6.9/8.5
Video Understanding
VideoMME_{(w sub.)}	75.6/--	75.6/--
VideoMME_{(w/o sub.)}	69.0/--	69.0/--
VideoMMMU	62.1/--	62.1/--
MLVU	76.2/--	76.2/--
MVBench	64.9/--	64.9/--
LVBench	57.1/--	57.1/--
MMVU	48.6/--	48.6/--
Visual Agent
ScreenSpot Pro	--/54.5	--/54.5
Medical VQA
SLAKE	74.4/67.5	74.4/67.5
PMC-VQA	48.8/54.0	48.8/54.0
MedXpertQA-MM	26.9/19.1	26.9/19.1

* Scores of Qwen3.5 models are reported as Thinking / Non-thinking.
* MathVision: our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within \boxed{}.” For other models, we report the higher score between runs with and without the \boxed{} formatting.
* Experimental settings: For the Video benchmarks, we used top_p=0.95, top_k=20, presence_penalty=1.5, and temperature=1.0. All other benchmarks adopted the same hyperparameter configuration but with temperature=0.6 under the thinking mode. Under the no-thinking mode, the inference hyperparameters were set to top_p=0.8, top_k=20, presence_penalty=1.5, and temperature=0.7.
* Empty cells (--) indicate scores not yet available or not applicable.