mratsim commited on Feb 17

Commit

2b24661

verified ·

1 Parent(s): b9912a5

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +2 -0
README.md +341 -0
calibrate_software_engineer.yaml +442 -0
chat_template.jinja +159 -0
config.json +223 -0
configuration_minimax_m2.py +200 -0
generation_config.json +8 -0
merges.txt +0 -0
model-00000-of-00126.safetensors +3 -0
model-00001-of-00126.safetensors +3 -0
model-00002-of-00126.safetensors +3 -0
model-00003-of-00126.safetensors +3 -0
model-00004-of-00126.safetensors +3 -0
model-00005-of-00126.safetensors +3 -0
model-00006-of-00126.safetensors +3 -0
model-00007-of-00126.safetensors +3 -0
model-00008-of-00126.safetensors +3 -0
model-00009-of-00126.safetensors +3 -0
model-00010-of-00126.safetensors +3 -0
model-00011-of-00126.safetensors +3 -0
model-00012-of-00126.safetensors +3 -0
model-00013-of-00126.safetensors +3 -0
model-00014-of-00126.safetensors +3 -0
model-00015-of-00126.safetensors +3 -0
model-00016-of-00126.safetensors +3 -0
model-00017-of-00126.safetensors +3 -0
model-00018-of-00126.safetensors +3 -0
model-00019-of-00126.safetensors +3 -0
model-00020-of-00126.safetensors +3 -0
model-00021-of-00126.safetensors +3 -0
model-00022-of-00126.safetensors +3 -0
model-00023-of-00126.safetensors +3 -0
model-00024-of-00126.safetensors +3 -0
model-00025-of-00126.safetensors +3 -0
model-00026-of-00126.safetensors +3 -0
model-00027-of-00126.safetensors +3 -0
model-00028-of-00126.safetensors +3 -0
model-00029-of-00126.safetensors +3 -0
model-00030-of-00126.safetensors +3 -0
model-00031-of-00126.safetensors +3 -0
model-00032-of-00126.safetensors +3 -0
model-00033-of-00126.safetensors +3 -0
model-00034-of-00126.safetensors +3 -0
model-00035-of-00126.safetensors +3 -0
model-00036-of-00126.safetensors +3 -0
model-00037-of-00126.safetensors +3 -0
model-00038-of-00126.safetensors +3 -0
model-00039-of-00126.safetensors +3 -0
model-00040-of-00126.safetensors +3 -0
model-00041-of-00126.safetensors +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,341 @@

+---
+pipeline_tag: text-generation
+license: other
+license_name: modified-mit
+license_link: https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE
+library_name: llm-compressor
+tags:
+- fp8
+- awq
+- conversational
+- vllm
+- code
+- devops
+- software engineering
+- engineer
+- developer
+- architect
+- stem
+- agent
+datasets:
+- HuggingFaceH4/ultrachat_200k
+- databricks/databricks-dolly-15k
+- neuralmagic/calibration
+- HuggingFaceH4/no_robots
+- nvidia/HelpSteer
+- garage-bAInd/Open-Platypus
+- PJMixers/grimulkan_physical-reasoning-ShareGPT
+- PJMixers/grimulkan_theory-of-mind-ShareGPT
+- HuggingFaceH4/Multilingual-Thinking
+- ServiceNow-AI/M2Lingual
+- droussis/euroblocks_sft_1sample_per_lang
+- interstellarninja/hermes_reasoning_tool_use
+- deepmind/code_contests
+- dh02391735/stackoverflow-kubernetes-questions
+- diversoailab/humaneval-rust
+- ammarnasr/the-stack-rust-clean
+- CSJianYang/CodeArena
+- nvidia/OpenCodeInstruct
+- nvidia/Llama-Nemotron-Post-Training-Dataset
+- nvidia/Nemotron-Competitive-Programming-v1
+- rombodawg/code_bagel_hermes-2.5
+- MathArena/project_euler
+- nvidia/Nemotron-Math-Proofs-v1
+- nvidia/OpenMathInstruct-2
+- nvidia/OpenScienceReasoning-2
+- MegaScience/MegaScience
+- OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B
+- ccdv/pubmed-summarization
+- gbharti/finance-alpaca
+- vladlen32230/summarization-yahoo-stock-finance-article-text
+- fka/awesome-chatgpt-prompts
+- theoldmandthesea/17k_business_book
+- ruggsea/stanford-encyclopedia-of-philosophy_instruct
+- mlfoundations-dev/stackexchange_philosophy
+- FreedomIntelligence/SocraticChat
+- Gryphe/Opus-WritingPrompts
+- anthracite-org/nopm_claude_writing_fixed
+- zerofata/Roleplay-Anime-Characters
+- zerofata/Instruct-Anime
+- zerofata/Instruct-Anime-CreativeWriting
+- sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo
+- PocketDoc/Dans-Prosemaxx-Adventure
+- anthracite-org/stheno-filtered-v1.1
+- KaraKaraWitch/TvTroper-2025
+- AquaV/US-Army-Survival-Sharegpt
+- AquaV/Interrogation-Sharegpt
+- AquaV/Multi-Environment-Operations-Sharegpt
+- AquaV/Resistance-Sharegpt
+- PocketDoc/Dans-Kinomaxx-VanillaBackrooms
+base_model:
+- MiniMaxAI/MiniMax-M2.5
+---
+# MiniMax M2.5 (Mixed-Precision FP8 + INT4 AWQ FrankenQuant)
+This strives to be the highest quality quant that can run on 192GiB VRAM
+> [!TIP]
+> 💡 A non-FP8 version is available at [mratsim/MiniMax-M2.5-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.5-BF16-INT4-AWQ) \
+> That version is compatible with 8x RTX 3090s and with SGLang (which doesn't support mixed quantization yet) for an extra 3GiB in VRAM. \
+> This FP8+INT4 AWQ was build by merging the original FP8 self-attention weights and [mratsim/MiniMax-M2.5-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.5-BF16-INT4-AWQ) experts.
+It features:
+- That model has ensured that all experts are calibrated, not doing so is extremely detrimental, PR: https://github.com/vllm-project/llm-compressor/pull/2171
+  <details>
+  <summary>**[Click me!]** Visual showcase of why ensuring quantization of all MoE experts is important</summary>
+  - Source: https://avtc.github.io/aquarium-side-by-side/
+  - Context: https://github.com/ModelCloud/GPTQModel/pull/2235
+  ![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/BDc3-0m3_WLl3ZmbBMhmd.png)
+  </details>
+- Mixed precision with:
+  - self-attention weights copied directly from the official version (default FP8 with 2D-blocks)
+  - experts weights quantized using AWQ W4A16G32 scheme (4-bit weights, 16-bit activations, scaling factor per group of 32 weights)
+- High-quality large and diverse dataset with programming and devops focus
+  as well as domain-specific knowledge (math, sciences, medical, finance, business, humanities, philosophy, creative writing), general knowledge, pop culture and behavioral situations because we never code in a vacuum. And we want to make sure all experts are calibrated to the full range of their activations.
+- Calibration explicitly tests multilingual capabilities:
+  - Asia: Chinese, Hindi, Korean, Japanese
+  - Europe: French, German, Portuguese, Russian, Spanish
+  - Middle-East: Arabic, Hebrew, Turkish
+- Calibration explicitly tests 60 programming languages and not just Python:
+  - Imperative programming: C, C++, Go, Zig, ...
+  - Functional programming: Haskell, F#, OCaml, Erlang, Lisp, Clojure ...
+  - Web-focused: HTML/CSS, Typescript, PHP, ...
+  - Mixed paradigm: D, Kotlin, Nim, Rust, Swift, ...
+  - Theorem provers: Coq, Lean
+  - Low-level: ARM64 assembly, x86-64 assembly, LLVM IR
+  - GPU Programming: Cuda, Vulkan, Apple Metal
+  - Game Programming: GDScript, GLSL
+  - Domain-specific: MATLAB, Julia, Solidity, R
+- Calibration tries to ensure coverage for a wide variety of experience (from explaining concepts to your grandmother to debugging Kubernetes logs)
+- Built by a dev, for devs (and it looks very good for STEM as well)
+It uses my new declarative quantization framework https://github.com/mratsim/quantizers which facilitates highly-tuned calibration sets: [calibrate_software_engineer.yaml](./calibrate_software_engineer.yaml)
+<details>
+<summary>This has taken several days and contribution and bug reports to the ecosystem, I hope you find it useful.</summary>
+- https://github.com/vllm-project/llm-compressor/pull/2171
+- https://github.com/vllm-project/llm-compressor/issues/2172
+- https://github.com/vllm-project/vllm/issues/31623
+- https://github.com/sgl-project/sglang/issues/16276
+- https://github.com/sgl-project/sglang/issues/16295
+</details>
+## 📥 Usage & Running Instructions
+The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend.
+> [!WARNING]
+> ⚠️ Due to rope_parameters change, at the moment this model is incompatible with transformers V5.\
+This makes it incompatible with GLM-4.6V which requires transformers V5. Use different Docker images.
+> [!WARNING]
+> ⚠️ SGLang does not support this model due to missing mixed precision support. Feature request raised at https://github.com/sgl-project/sglang/issues/16276.\
+> Please use [mratsim/MiniMax-M2.5-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.5-BF16-INT4-AWQ) in the meantime.
+### Running script
+`--trust-remote-code` is necessary until the transformers team merges github.com/huggingface/transformers/pull/42028
+You have 2 reasoning parsers;
+- `minimax_m2`, puts the reasoning content in a special field like DeepSeek models that is usually rendered in a specific manner in frontends.
+- `minimax_m2_append_think`, puts the reasoning into `<think>reasoning_content</think>` and that is sent as normal text. Few frontends properly render that, I'm aware of [Cherry Studio](https://github.com/CherryHQ/cherry-studio) on Desktop and [ChatterUI](https://github.com/Vali-98/ChatterUI) on Android.
+The reason why `minimax_m2_append_think` was introduced was Interleaved Thinking and having the model build upon it's previous thinking (usually frontends discard the thinking trace)
+> [!TIP]
+> 💡With the recommended parameters the model tends to get stuck in repetition loops.\
+> It seems like repetition_penalty: 1.10, frequency_penalty: 0.40 avoids that
+```bash
+# Model configuration (Mandatory)
+MODEL="mratsim/MiniMax-M2.5-FP8-INT4-AWQ"
+MODELNAME="MiniMax-M2.5"
+GPU_UTIL=0.93
+SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
+# Prevent memory fragmentation
+export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
+# Prevent vLLM from using 100% CPU when idle (Very Recommended)
+export VLLM_SLEEP_WHEN_IDLE=1
+vllm serve "${MODEL}" \
+  --served-model-name "${MODELNAME}" \
+  --trust-remote-code \
+  --gpu-memory-utilization ${GPU_UTIL} \
+  --tp 2 \
+  --override-generation-config "${SAMPLER_OVERRIDE}" \
+  --enable-auto-tool-choice \
+  --tool-call-parser minimax_m2 \
+  --reasoning-parser minimax_m2
+  # --reasoning-parser minimax_m2_append_think
+```
+## Performance
+On dual RTX Pro 6000, I can reach over 5500 prefill/prompt/context processing and over 100 tok/s token generation for a single request.
+![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/YbP1qw_YhcaM0aywJHSjG.png)
+With PagedAttention in action you can reach over 25000 tok/s in prompt processing speed.
+![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/haCbdHZWScsgGiGCj768i.png)
+When batching, with default config, you can reach over 6000 even 8000 tok/s and 1200 tok/s generation speed.\
+Tune prefill vs decode prioritization with `--max_num_batched_tokens` see [Performance & Tuning | vLLM](https://docs.vllm.ai/en/v0.4.2/models/performance.html)
+![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/ma7oVnEGbj15Rk4EG0h5B.png)
+In a steady state with interleaved prefill and decode requests that interrupt each other, you can get ~2400 tok/s context processing and 800 tok/s generation
+![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/Gbc2pz5Tpm8gF-MV_UPDe.png)
+Note: vLLM supports prefill-decode disaggregation for high throughput serving if you have double the minimum hardware:
+- https://pytorch.org/blog/disaggregated-inference-at-scale-with-pytorch-vllm/
+- https://github.com/vllm-project/production-stack
+  - Prefill/decode disaggregation
+  - Multi-Tier KV-cache via [LMCache](https://github.com/LMCache/LMCache) (GPU > CPU > Local Disk)
+  - Cache aware router
+  - Multi-model dispatch via single interface
+## 🔬 Quantization method
+Quantization was quite complex for this model and was done in 3 steps:
+1. Original weights are in FP8, they were dequantized to FP16 due to llm-compressor not being able to process FP8.
+2. llm-compressor was used to quantize the MLP experts projection using AWQ, with [PR #2171](https://github.com/vllm-project/llm-compressor/pull/2171) to ensure they were all activated.
+3. Stitching the FrankenQuant: I combined the original weights, including the 2D-block FP8, with the experts-only AWQ weights.
+The llmcompressor library was used with the following recipe:
+```yaml
+default_stage:
+  default_modifiers:
+    AWQModifier:
+      config_groups:
+        mlp_experts_projections:
+          # Include only MLP expert weights for 4-bit quantization
+          targets: ["re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1|w2|w3)$"]
+          weights:
+            num_bits: 4
+            type: int
+            symmetric: true
+            group_size: 32
+            strategy: group
+            dynamic: false
+            # actorder: group
+            observer: memoryless_minmax
+      mappings:
+        - smooth_layer: re:.*post_attention_layernorm$
+          balance_layers: ["re:.*w1$", "re:.*w3$"]
+        - smooth_layer: re:.*w3$
+          balance_layers: ["re:.*w2$"]
+      duo_scaling: true
+```
+The calibration set had 590 examples, 8192 sequence length, 60 programming languages, 12 spoken languages and is detailed at [calibrate_software_engineer.yaml](./calibrate_software_engineer.yaml)
+## Quantization theory and heuristics for manual tuning
+<details>
+<summary>In-depth overview of quantization theory and heuristics for manual tuning</summary>
+### Layers to quantize
+Quantization should be focused on Linear layers (also called Dense or Fully-Connected layers i.e. MatMul+Bias)
+In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]
+> LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the
+> LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression.
+> Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.
+This is also reported in Intel and Nvidia repo:
+- https://github.com/intel/neural-compressor/issues/1963#issuecomment-2274873441
+- https://github.com/NVIDIA/TensorRT/issues/4084#issuecomment-2294513950
+### Tensors to up-quantize
+If there is enough bits, down projections should be prioritized.
+According to [4]
+> Fig. 3: Maximum absolute value over layers for a LLaMA3-8B.
+> Each color represent a different projection and we clearly see that down_proj has the biggest
+> spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model
+According to [5]
+> Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting
+> that weight outliers are concentrated in the down-projection matrices Wdown
+> ℓ of the second layer and
+> the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last
+> two layers.
+### Mixture-of-Experts quantization (MoE)
+Mixture-of-Experts require specific quantization techniques.
+#### Mixed-precision quantization
+Some layers have a higher impact on LLM performance.
+According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers.
+According to [3] on 2-bit quantization:
+- quantizing expert FFN layers do not seriously impact model quality
+- quantizing cross-attention has some impact
+- quantizing self-attention has a large impact
+- quantizing dense FFN has a very significant impact
+Hence to preserve model quality we should choose not to quantize dense FFN layers and self-attention layers.
+We notice that:
+- official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
+  - https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json
+- NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
+  - https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json
+#### Layers with high-impact
+According to [2], giving more bits to the first `k` blocks have a significantly higher impact on model quality than for the same last `k` blocks.
+#### Expert quantization
+When quantizing MoE, quantizing activations is tricky as only a subset of experts are activated per request. You have to make sure all experts are calibrated.
+<details>
+<summary>Visual showcase of why ensuring quantization of all MoE experts is important</summary>
+- Source: https://avtc.github.io/aquarium-side-by-side/
+- Context: https://github.com/ModelCloud/GPTQModel/pull/2235
+![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/BDc3-0m3_WLl3ZmbBMhmd.png)
+</details>
+## References
+1. Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)\
+  Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia\
+  https://arxiv.org/pdf/2506.12044
+2. Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)\
+  Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen\
+  https://arxiv.org/pdf/2406.08155v1
+3. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)\
+  Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla\
+  https://arxiv.org/pdf/2310.02410
+4. Precision Where It Matters: A Novel Spike\
+   Aware Mixed-Precision Quantization Strategy for\
+   LLaMA-based Language Models (2025)\
+   Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello\
+   https://arxiv.org/pdf/2504.21553
+5. Systematic Outliers in Large Language Models (2025)\
+   Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang\
+   https://arxiv.org/pdf/2502.06415v2
+</details>

calibrate_software_engineer.yaml ADDED Viewed

	@@ -0,0 +1,442 @@

+calibration_set:
+  _templates:
+    programming_languages: &programming_languages "Solve the following problem using {{ ['Zephyr', 'Prolog', 'Cobol', 'Apex', 'Crystal', 'Fortran', 'Nim', 'Delphi', 'Ada', 'Objective-C', 'VBA', 'Perl', 'Groovy', 'MATLAB', 'Solidity', 'Visual Basic', 'OCaml', 'Erlang', 'Julia', 'Lisp', 'F#', 'Clojure', 'GDScript', 'Scala', 'R', 'Haskell', 'Ruby', 'Elixir', 'Lua', 'Zig', 'Dart', 'Swift', 'Metal', 'PowerShell', 'PHP', 'Kotlin', 'C', 'Java', 'C++', 'C#', 'Bash/Shell', 'Go', 'Rust', 'TypeScript', 'HTML/CSS', 'SQL', 'JavaScript', 'Python', 'Lean', 'Coq', 'Pony', 'D', 'Racket', 'Haxe', 'x86-64 ASM', 'ARM-64 ASM', 'LLVM IR', 'GLSL', 'CUDA', 'Vulkan'][hash(row|string) % 60] }}\n***\n"
+    spoken_languages: &spoken_languages "Answer in {{ ['Arabic', 'Chinese', 'French', 'German', 'Greek', 'Hebrew', 'Hindi', 'Japanese', 'Korean', 'Portuguese', 'Russian', 'Spanish', 'Turkish'][hash(row|string) % 13] }}\n***\n"
+  max_seq_length: 8192
+  shuffle: true
+  seed: 42
+  datasets:
+    # Category Summary (Total: 624 samples)
+    # =====================================================
+    # General chat (24 samples - 3.85%)
+    # Instruction and Reasoning tuning (14 samples - 2.24%)
+    # Multilingual (70 samples - 11.22%)
+    # Tool use (100 samples - 16.03%)
+    # Code / Programming / Software Engineering / Devops (328 samples - 52.56%)
+    # Math (12 samples - 1.92%)
+    # Sciences (16 samples - 2.56%)
+    # Medical (8 samples - 1.28%)
+    # Finance (8 samples - 1.28%)
+    # Business (16 samples - 2.56%)
+    # Humanities and Philosophy (8 samples - 1.28%)
+    # Creative Writing, Adventure, Roleplay (13 samples - 2.08%)
+    # General Knowledge and Pop Culture (2 samples - 0.32%)
+    # Behavioral skills (4 samples - 0.64%)
+    # Misc (1 sample - 0.16%)
+    # =====================================================
+    # Research
+    # =====================================================
+    # According to this presentation https://minjiazhang.github.io/courses/fall24-resource/slides/awq.pdf
+    # AWQ only needs 64 samples to identify salient weights that need to be preserved.
+    #
+    # This research predates the boom of MoE (Mixture-of-Experts) models
+    # and it's safer to assume that 64 samples of a general dataset
+    # cannot properly identify salient weights of experts.
+    # General chat (24 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: HuggingFaceH4/ultrachat_200k
+      columns: [messages]
+      split: train_sft
+      formatter: chat_completion
+      num_samples: 8
+      streaming: true
+    - dataset: databricks/databricks-dolly-15k
+      split: train
+      columns: [instruction, response]
+      formatter: prompt_answer
+      num_samples: 8
+    - dataset: neuralmagic/calibration
+      subset: LLM
+      split: train
+      columns: [messages]
+      formatter: chat_completion
+      num_samples: 8
+    # Instruction and Reasoning tuning (14 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: HuggingFaceH4/no_robots
+      split: train
+      columns: [messages]
+      formatter: chat_completion
+      num_samples: 2
+    - dataset: nvidia/HelpSteer
+      split: train
+      columns: [prompt, response]
+      formatter: prompt_answer
+      num_samples: 2
+      streaming: true
+    - dataset: garage-bAInd/Open-Platypus
+      split: train
+      columns: [instruction, output]
+      formatter: prompt_answer
+      num_samples: 2
+    - dataset: PJMixers/grimulkan_physical-reasoning-ShareGPT
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 4
+    - dataset: PJMixers/grimulkan_theory-of-mind-ShareGPT
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 4
+    # Multilingual (70 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: HuggingFaceH4/Multilingual-Thinking
+      split: train
+      columns: [user]
+      formatter: raw_text
+      num_samples: 32
+      formatter_params:
+        prefix: *spoken_languages
+    - dataset: ServiceNow-AI/M2Lingual
+      subset: full_data
+      split: train
+      columns: [conversation]
+      formatter: chat_completion
+      num_samples: 4
+      streaming: true
+    - dataset: droussis/euroblocks_sft_1sample_per_lang
+      split: train
+      columns: [conversations]
+      formatter: chat_completion
+      num_samples: 34
+    # Tool use (include commented out ToolAce) (100 samples)
+    # ---------------------------------------------------------------------------
+    # Fail with minimax!
+    # jinja2.exceptions.TemplateError: Message has tool role, but there was no previous assistant message with a tool call!
+    # - dataset: Team-ACE/ToolACE
+    #   split: train
+    #   columns: [system, conversations]
+    #   formatter: chat_completion_with_sysprompt
+    #   num_samples: 100
+    - dataset: interstellarninja/hermes_reasoning_tool_use
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 100
+      streaming: true
+    # Code / Programming / Software Engineering / Devops (336 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: deepmind/code_contests
+      split: train
+      columns: [name]
+      formatter: deepmind_code_contests
+      num_samples: 50
+      streaming: true
+    - dataset: dh02391735/stackoverflow-kubernetes-questions
+      split: train
+      columns: [instruction]
+      formatter: raw_text
+      num_samples: 8
+      streaming: true
+    - dataset: diversoailab/humaneval-rust
+      split: train
+      columns: [prompt]
+      formatter: raw_text
+      num_samples: 100
+      formatter_params: # The dataset actually doesn't hardcode the language
+        prefix: *programming_languages
+    - dataset: ammarnasr/the-stack-rust-clean
+      split: train
+      columns: [content]
+      formatter: raw_text
+      num_samples: 8
+      streaming: true
+      formatter_params:
+        prefix: "Explain this code and comment it for a junior dev.\n***\n"
+    - dataset: CSJianYang/CodeArena
+      split: test
+      columns: [messages]
+      formatter: chat_completion
+      num_samples: 8
+    - dataset: nvidia/OpenCodeInstruct
+      split: train
+      columns: [input, output]
+      formatter: prompt_answer
+      num_samples: 8
+      streaming: true
+    - dataset: nvidia/Llama-Nemotron-Post-Training-Dataset
+      split: code
+      columns: [input]
+      formatter: chat_completion
+      num_samples: 8
+      streaming: true
+    - dataset: nvidia/Nemotron-Competitive-Programming-v1
+      split: competitive_coding_cpp_part00
+      columns: [messages]
+      formatter: chat_completion
+      num_samples: 8
+      streaming: true
+    # The conversations columns has another "conversations" field :/
+    # - dataset: sr5434/CodegebraGPT_data
+    #   subset: 100k-text
+    #   split: train
+    #   columns: [conversations]
+    #   formatter: sharegpt
+    #   num_samples: 8
+    - dataset: rombodawg/code_bagel_hermes-2.5
+      split: train
+      columns: [input, output]
+      formatter: prompt_answer
+      num_samples: 100
+      streaming: true
+    - dataset: MathArena/project_euler
+      split: train
+      columns: [problem]
+      formatter: raw_text
+      num_samples: 30
+      formatter_params:
+        prefix: *programming_languages
+    # Math (12 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: nvidia/Llama-Nemotron-Post-Training-Dataset
+      split: math
+      columns: [input]
+      formatter: chat_completion
+      num_samples: 4
+      streaming: true
+    - dataset: nvidia/Nemotron-Math-Proofs-v1
+      split: lean
+      columns: [formal_statement]
+      formatter: raw_text
+      num_samples: 4
+      streaming: true
+      formatter_params:
+        prefix: "Can you improve, document and add comment to this Lean proof for a non-mathematician?\n***\n"
+    - dataset: nvidia/OpenMathInstruct-2
+      split: train
+      columns: [problem, generated_solution]
+      formatter: prompt_answer
+      num_samples: 4
+      streaming: true
+    # Sciences (16 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: nvidia/Llama-Nemotron-Post-Training-Dataset
+      split: science
+      columns: [input]
+      formatter: chat_completion
+      num_samples: 4
+      streaming: true
+    - dataset: nvidia/OpenScienceReasoning-2
+      split: train
+      columns: [input, output]
+      formatter: prompt_answer
+      num_samples: 8
+      streaming: true
+    - dataset: MegaScience/MegaScience
+      split: train
+      columns: [question, answer]
+      formatter: prompt_answer
+      num_samples: 4
+      streaming: true
+    # Medical (8 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B
+      split: train
+      columns: [messages]
+      formatter: chat_completion
+      num_samples: 4
+      streaming: true
+    - dataset: ccdv/pubmed-summarization
+      subset: section
+      split: train
+      columns: [article]
+      formatter: raw_text
+      num_samples: 4
+      streaming: true
+      formatter_params:
+        prefix: "Summarize this:\n***\n"
+    # Finance (8 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: gbharti/finance-alpaca
+      split: train
+      columns: [instruction, output]
+      formatter: prompt_answer
+      num_samples: 4
+    - dataset: vladlen32230/summarization-yahoo-stock-finance-article-text
+      split: train
+      columns: [text]
+      formatter: raw_text
+      num_samples: 4
+      formatter_params:
+        prefix: "Summarize this:\n***\n"
+    # Business (16 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: fka/awesome-chatgpt-prompts
+      split: train
+      columns: [prompt]
+      formatter: raw_text
+      num_samples: 8
+    - dataset: theoldmandthesea/17k_business_book
+      split: train
+      columns: [question, answer]
+      formatter: prompt_answer
+      num_samples: 8
+    # Humanities and Philosophy (8 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: ruggsea/stanford-encyclopedia-of-philosophy_instruct
+      split: train
+      columns: [question, answer]
+      formatter: prompt_answer
+      num_samples: 2
+      streaming: true
+    - dataset: mlfoundations-dev/stackexchange_philosophy
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 2
+    - dataset: FreedomIntelligence/SocraticChat
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 4
+      streaming: true
+    # Creative Writing, Adventure, Roleplay (13 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: Gryphe/Opus-WritingPrompts
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 2
+    - dataset: anthracite-org/nopm_claude_writing_fixed
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 2
+    - dataset: zerofata/Roleplay-Anime-Characters
+      split: train
+      columns: [messages]
+      formatter: chat_completion
+      num_samples: 1
+    - dataset: zerofata/Instruct-Anime
+      split: train
+      columns: [messages]
+      formatter: chat_completion
+      num_samples: 1
+    - dataset: zerofata/Instruct-Anime-CreativeWriting
+      split: train
+      columns: [messages]
+      formatter: chat_completion
+      num_samples: 1
+    - dataset: sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo
+      split: train
+      columns: [chosen]
+      formatter: chat_completion
+      num_samples: 2
+    - dataset: PocketDoc/Dans-Prosemaxx-Adventure
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 2
+    - dataset: anthracite-org/stheno-filtered-v1.1
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 2
+      streaming: true
+    # General Knowledge and Pop Culture (2 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: KaraKaraWitch/TvTroper-2025
+      split: train
+      columns: [article]
+      formatter: raw_text
+      num_samples: 2
+      streaming: true
+      formatter_params:
+        prefix: "Explain this trope like I'm your grandmother\n***\n"
+    # Behavioral skills (4 samples)
+    # ---------------------------------------------------------------------------
+    - dataset: AquaV/US-Army-Survival-Sharegpt
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 1
+    - dataset: AquaV/Interrogation-Sharegpt
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 1
+    - dataset: AquaV/Multi-Environment-Operations-Sharegpt
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 1
+    - dataset: AquaV/Resistance-Sharegpt
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 1
+    # Misc (1 sample)
+    # ---------------------------------------------------------------------------
+    - dataset: PocketDoc/Dans-Kinomaxx-VanillaBackrooms
+      split: train
+      columns: [conversations]
+      formatter: sharegpt
+      num_samples: 1

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,159 @@

+{# ----------‑‑‑ special token variables ‑‑‑---------- #}
+{%- set toolcall_begin_token   = '<minimax:tool_call>'         -%}
+{%- set toolcall_end_token     = '</minimax:tool_call>'        -%}
+{#- Tool Rendering Functions ============================================== -#}
+{%- macro render_tool_namespace(namespace_name, tool_list) -%}
+{%- for tool in tool_list -%}
+<tool>{{ tool.function | tojson(ensure_ascii=False) }}</tool>
+{% endfor -%}
+{%- endmacro -%}
+{%- macro visible_text(content) -%}
+    {%- if content is string -%}
+        {{ content }}
+    {%- elif content is iterable and content is not mapping -%}
+        {%- for item in content -%}
+            {%- if item is mapping and item.type == 'text' -%}
+                {{- item.text }}
+            {%- elif item is string -%}
+                {{- item }}
+            {%- endif -%}
+        {%- endfor -%}
+    {%- else -%}
+        {{- content }}
+    {%- endif -%}
+{%- endmacro -%}
+{#- System Message Construction ============================================ -#}
+{%- macro build_system_message(system_message) -%}
+    {%- if system_message and system_message.content -%}
+        {{- visible_text(system_message.content) }}
+    {%- else -%}
+        {%- if model_identity is not defined -%}
+            {%- set model_identity = "You are a helpful assistant. Your name is MiniMax-M2.5 and is built by MiniMax." -%}
+        {%- endif -%}
+        {{- model_identity }}
+    {%- endif -%}
+    {#- Handle current_date -#}
+    {%- if system_message and system_message.current_date -%}
+        {{- '\n' ~ 'Current date: ' + system_message.current_date }}
+    {%- endif -%}
+    {#- Handle current_location -#}
+    {%- if system_message and system_message.current_location -%}
+        {{- '\n' ~ 'Current location: ' + system_message.current_location }}
+    {%- endif -%}
+{%- endmacro -%}
+{#- Main Template Logic ================================================= -#}
+{#- Extract system message (only first message if it's system) -#}
+{%- set system_message = none -%}
+{%- set conversation_messages = messages -%}
+{%- if messages and messages[0].role == "system" -%}
+    {%- set system_message = messages[0] -%}
+    {%- set conversation_messages = messages[1:] -%}
+{%- endif -%}
+{#- Get the last user message turn, for interleved thinking -#}
+{%- set ns = namespace(last_user_index=-1) %}
+{% for m in conversation_messages %}
+    {%- if m.role == 'user' %}
+        {% set ns.last_user_index = loop.index0 -%}
+    {%- endif %}
+{%- endfor %}
+{#- Render system message -#}
+{{- ']~!b[' ~ ']~b]system' ~ '\n' }}
+{{- build_system_message(system_message) }}
+{#- Render tools if available -#}
+{%- if tools -%}
+    {{- '\n\n' ~ '# Tools' ~ '\n' ~ 'You may call one or more tools to assist with the user query.\nHere are the tools available in JSONSchema format:' ~ '\n' }}
+    {{- '\n' ~ '<tools>' ~ '\n' }}
+    {{- render_tool_namespace("functions", tools) }}
+    {{- '</tools>' ~ '\n\n' }}
+{{- 'When making tool calls, use XML format to invoke tools and pass parameters:' ~ '\n' }}
+{{- '\n' ~ toolcall_begin_token }}
+<invoke name="tool-name-1">
+<parameter name="param-key-1">param-value-1</parameter>
+<parameter name="param-key-2">param-value-2</parameter>
+...
+</invoke>
+{{- '\n' ~ toolcall_end_token }}
+{%- endif -%}
+{{- '[e~[\n' }}
+{#- Render messages -#}
+{%- set last_tool_call = namespace(name=none) -%}
+{%- for message in conversation_messages -%}
+    {%- if message.role == 'assistant' -%}
+        {#- Only render reasoning_content if no user message follows -#}
+        {{- ']~b]ai' ~ '\n' }}
+        {%- set reasoning_content = '' %}
+        {%- set content = visible_text(message.content) %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].strip('\n').split('<think>')[-1].strip('\n') %}
+                {%- set content = content.split('</think>')[-1].strip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- if reasoning_content and loop.index0 > ns.last_user_index -%}
+            {{- '<think>' ~ '\n' ~ reasoning_content ~ '\n' ~ '</think>' ~ '\n\n' }}
+        {%- endif -%}
+        {%- if content -%}
+            {{- content }}
+        {%- endif -%}
+        {%- if message.tool_calls -%}
+            {{- '\n' ~ toolcall_begin_token ~ '\n' }}
+            {%- for tool_call in message.tool_calls -%}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<invoke name="' + tool_call.name + '">' }}
+                {% set _args = tool_call.arguments %}
+                {%- for k, v in _args.items() %}
+                {{- '<parameter name="' + k + '">' }}
+                {{- v | tojson(ensure_ascii=False) if v is not string else v }}
+                {{- '</parameter>' }}
+                {% endfor %}
+                {{- '</invoke>' ~ '\n' }}
+            {%- endfor -%}
+            {{- toolcall_end_token}}
+            {%- set last_tool_call.name = message.tool_calls[-1].name -%}
+        {%- else -%}
+            {%- set last_tool_call.name = none -%}
+        {%- endif -%}
+        {{- '[e~[' ~ '\n' }}
+    {%- elif message.role == 'tool' -%}
+    {%- if last_tool_call.name is none -%}
+        {{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
+    {%- endif -%}
+    {%- if loop.first or (conversation_messages[loop.index0 - 1].role != 'tool') -%}
+        {{- ']~b]tool' }}
+    {%- endif -%}
+    {%- if message.content is string -%}
+        {{- '\n<response>' }}
+        {{- message.content }}
+        {{- '</response>' }}
+    {%- else -%}
+        {%- for tr in message.content -%}
+            {{- '\n<response>' }}
+            {{- tr.output if tr.output is defined else (tr.text if tr.type == 'text' and tr.text is defined else tr) }}
+            {{- '\n</response>' }}
+        {%- endfor -%}
+    {%- endif -%}
+    {%- if loop.last or (conversation_messages[loop.index0 + 1].role != 'tool') -%}
+        {{- '[e~[\n' -}}
+    {%- endif -%}
+    {%- elif message.role == 'user' -%}
+        {{- ']~b]user' ~ '\n' }}
+        {{- visible_text(message.content) }}
+        {{- '[e~[' ~ '\n' }}
+    {%- endif -%}
+{%- endfor -%}
+{#- Generation prompt -#}
+{%- if add_generation_prompt -%}
+{{- ']~b]ai' ~ '\n' ~ '<think>' ~ '\n' }}
+{%- endif -%}

config.json ADDED Viewed

	@@ -0,0 +1,223 @@

+{
+  "architectures": [
+    "MiniMaxM2ForCausalLM"
+  ],
+  "attn_type_list": [
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1,
+    1
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_minimax_m2.MiniMaxM2Config",
+    "AutoModelForCausalLM": "modeling_minimax_m2.MiniMaxM2ForCausalLM"
+  },
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 3072,
+  "intermediate_size": 1536,
+  "max_position_embeddings": 196608,
+  "model_type": "minimax_m2",
+  "mtp_transformer_layers": 1,
+  "num_attention_heads": 48,
+  "num_experts_per_tok": 8,
+  "num_hidden_layers": 62,
+  "num_key_value_heads": 8,
+  "num_local_experts": 256,
+  "num_mtp_modules": 3,
+  "qk_norm_type": "per_layer",
+  "rms_norm_eps": 1e-06,
+  "rope_theta": 5000000,
+  "rotary_dim": 64,
+  "scoring_func": "sigmoid",
+  "shared_intermediate_size": 0,
+  "tie_word_embeddings": false,
+  "transformers_version": "4.46.1",
+  "use_cache": true,
+  "use_mtp": true,
+  "use_qk_norm": true,
+  "use_routing_bias": true,
+  "vocab_size": 200064,
+  "quantization_config": {
+    "quant_method": "compressed-tensors",
+    "format": "mixed-precision",
+    "quantization_status": "compressed",
+    "config_groups": {
+      "self_attention_projections": {
+        "targets": [
+          "Linear",
+          "re:.*self_attn\\.(k|q|o|v)_proj$",
+          "re:.*self_attn\\.qkv_proj$"
+        ],
+        "weights": {
+          "type": "float",
+          "num_bits": 8,
+          "strategy": "block",
+          "block_structure": [
+            128,
+            128
+          ],
+          "symmetric": true,
+          "dynamic": false
+        },
+        "input_activations": {
+          "type": "float",
+          "num_bits": 8,
+          "strategy": "token",
+          "symmetric": true,
+          "dynamic": true
+        },
+        "format": "float-quantized"
+      },
+      "mlp_experts_projections": {
+        "format": "pack-quantized",
+        "input_activations": null,
+        "output_activations": null,
+        "targets": [
+          "Linear",
+          "re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1|w2|w3)$"
+        ],
+        "weights": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": false,
+          "group_size": 32,
+          "num_bits": 4,
+          "observer": "memoryless_minmax",
+          "observer_kwargs": {},
+          "strategy": "group",
+          "symmetric": true,
+          "type": "int"
+        }
+      }
+    },
+    "ignore": [
+      "model.layers.0.block_sparse_moe.gate",
+      "model.layers.1.block_sparse_moe.gate",
+      "model.layers.2.block_sparse_moe.gate",
+      "model.layers.3.block_sparse_moe.gate",
+      "model.layers.4.block_sparse_moe.gate",
+      "model.layers.5.block_sparse_moe.gate",
+      "model.layers.6.block_sparse_moe.gate",
+      "model.layers.7.block_sparse_moe.gate",
+      "model.layers.8.block_sparse_moe.gate",
+      "model.layers.9.block_sparse_moe.gate",
+      "model.layers.10.block_sparse_moe.gate",
+      "model.layers.11.block_sparse_moe.gate",
+      "model.layers.12.block_sparse_moe.gate",
+      "model.layers.13.block_sparse_moe.gate",
+      "model.layers.14.block_sparse_moe.gate",
+      "model.layers.15.block_sparse_moe.gate",
+      "model.layers.16.block_sparse_moe.gate",
+      "model.layers.17.block_sparse_moe.gate",
+      "model.layers.18.block_sparse_moe.gate",
+      "model.layers.19.block_sparse_moe.gate",
+      "model.layers.20.block_sparse_moe.gate",
+      "model.layers.21.block_sparse_moe.gate",
+      "model.layers.22.block_sparse_moe.gate",
+      "model.layers.23.block_sparse_moe.gate",
+      "model.layers.24.block_sparse_moe.gate",
+      "model.layers.25.block_sparse_moe.gate",
+      "model.layers.26.block_sparse_moe.gate",
+      "model.layers.27.block_sparse_moe.gate",
+      "model.layers.28.block_sparse_moe.gate",
+      "model.layers.29.block_sparse_moe.gate",
+      "model.layers.30.block_sparse_moe.gate",
+      "model.layers.31.block_sparse_moe.gate",
+      "model.layers.32.block_sparse_moe.gate",
+      "model.layers.33.block_sparse_moe.gate",
+      "model.layers.34.block_sparse_moe.gate",
+      "model.layers.35.block_sparse_moe.gate",
+      "model.layers.36.block_sparse_moe.gate",
+      "model.layers.37.block_sparse_moe.gate",
+      "model.layers.38.block_sparse_moe.gate",
+      "model.layers.39.block_sparse_moe.gate",
+      "model.layers.40.block_sparse_moe.gate",
+      "model.layers.41.block_sparse_moe.gate",
+      "model.layers.42.block_sparse_moe.gate",
+      "model.layers.43.block_sparse_moe.gate",
+      "model.layers.44.block_sparse_moe.gate",
+      "model.layers.45.block_sparse_moe.gate",
+      "model.layers.46.block_sparse_moe.gate",
+      "model.layers.47.block_sparse_moe.gate",
+      "model.layers.48.block_sparse_moe.gate",
+      "model.layers.49.block_sparse_moe.gate",
+      "model.layers.50.block_sparse_moe.gate",
+      "model.layers.51.block_sparse_moe.gate",
+      "model.layers.52.block_sparse_moe.gate",
+      "model.layers.53.block_sparse_moe.gate",
+      "model.layers.54.block_sparse_moe.gate",
+      "model.layers.55.block_sparse_moe.gate",
+      "model.layers.56.block_sparse_moe.gate",
+      "model.layers.57.block_sparse_moe.gate",
+      "model.layers.58.block_sparse_moe.gate",
+      "model.layers.59.block_sparse_moe.gate",
+      "model.layers.60.block_sparse_moe.gate",
+      "model.layers.61.block_sparse_moe.gate",
+      "lm_head"
+    ],
+    "kv_cache_scheme": null,
+    "global_compression_ratio": null,
+    "sparsity_config": {},
+    "transform_config": {},
+    "version": "0.13.1.dev0+g797d301.d20251228"
+  }
+}

configuration_minimax_m2.py ADDED Viewed

	@@ -0,0 +1,200 @@

+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+#           This file was automatically generated from src/transformers/models/minimax_m2/modular_minimax_m2.py.
+#               Do NOT edit this file manually as any edits will be overwritten by the generation of
+#             the file from the modular. If any change should be done, please apply the change to the
+#                          modular_minimax_m2.py file directly. One of our CI enforces this.
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+# coding=utf-8
+# Copyright 2025 the HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from transformers.configuration_utils import PretrainedConfig
+class MiniMaxM2Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MiniMaxM2Model`]. It is used to instantiate an
+    MiniMaxM2 model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the MiniMaxM2-7B-v0.1 or MiniMaxM2-7B-Instruct-v0.1.
+    [minimax_m2ai/MiniMaxM2-8x7B](https://huggingface.co/minimax_m2ai/MiniMaxM2-8x7B)
+    [minimax_m2ai/MiniMaxM2-7B-Instruct-v0.1](https://huggingface.co/minimax_m2ai/MiniMaxM2-7B-Instruct-v0.1)
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the MiniMaxM2 model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`MiniMaxM2Model`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 14336):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_key_value_heads (`int`, *optional*, defaults to 8):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details, check out [this
+            paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to `8`.
+        head_dim (`int`, *optional*, defaults to `hidden_size // num_attention_heads`):
+            The attention head dimension.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to `4096*32`):
+            The maximum sequence length that this model might ever be used with. MiniMaxM2's sliding window attention
+            allows sequence of up to 4096*32 tokens.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-05):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        pad_token_id (`int`, *optional*):
+            The id of the padding token.
+        bos_token_id (`int`, *optional*, defaults to 1):
+            The id of the "beginning-of-sequence" token.
+        eos_token_id (`int`, *optional*, defaults to 2):
+            The id of the "end-of-sequence" token.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied.
+        rope_theta (`float`, *optional*, defaults to 1000000.0):
+            The base period of the RoPE embeddings.
+        sliding_window (`int`, *optional*):
+            Sliding window attention window size. If not specified, will default to `4096`.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        num_experts_per_tok (`int`, *optional*, defaults to 2):
+            The number of experts to route per-token, can be also interpreted as the `top-k` routing
+            parameter
+        num_local_experts (`int`, *optional*, defaults to 8):
+            Number of experts per Sparse MLP layer.
+        output_router_logits (`bool`, *optional*, defaults to `False`):
+            Whether or not the router logits should be returned by the model. Enabling this will also
+            allow the model to output the auxiliary loss. See [here]() for more details
+        router_aux_loss_coef (`float`, *optional*, defaults to 0.001):
+            The aux loss factor for the total loss.
+        router_jitter_noise (`float`, *optional*, defaults to 0.0):
+            Amount of noise to add to the router.
+    ```python
+    >>> from transformers import MiniMaxM2Model, MiniMaxM2Config
+    >>> # Initializing a MiniMaxM2 7B style configuration
+    >>> configuration = MiniMaxM2Config()
+    >>> # Initializing a model from the MiniMaxM2 7B style configuration
+    >>> model = MiniMaxM2Model(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "minimax_m2"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    base_model_tp_plan = {
+        "layers.*.self_attn.q_proj": "colwise",
+        "layers.*.self_attn.k_proj": "colwise",
+        "layers.*.self_attn.v_proj": "colwise",
+        "layers.*.self_attn.o_proj": "rowwise",
+        "layers.*.block_sparse_moe.gate": "colwise_rep",  # we need to replicate here to correctly route experts
+        "layers.*.block_sparse_moe.experts.*.w1": "colwise",
+        "layers.*.block_sparse_moe.experts.*.w2": "rowwise",
+        "layers.*.block_sparse_moe.experts.*.w3": "colwise",
+    }
+    base_model_pp_plan = {
+        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+        "norm": (["hidden_states"], ["hidden_states"]),
+    }
+    def __init__(
+        self,
+        vocab_size=32000,
+        hidden_size=4096,
+        intermediate_size=14336,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=8,
+        head_dim=None,
+        hidden_act="silu",
+        max_position_embeddings=4096 * 32,
+        initializer_range=0.02,
+        rms_norm_eps=1e-5,
+        use_cache=True,
+        pad_token_id=None,
+        bos_token_id=1,
+        eos_token_id=2,
+        tie_word_embeddings=False,
+        rope_theta=1e6,
+        sliding_window=None,
+        attention_dropout=0.0,
+        num_experts_per_tok=2,
+        num_local_experts=8,
+        output_router_logits=False,
+        router_aux_loss_coef=0.001,
+        router_jitter_noise=0.0,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.sliding_window = sliding_window
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.attention_dropout = attention_dropout
+        self.head_dim = head_dim
+        self.num_experts_per_tok = num_experts_per_tok
+        self.num_local_experts = num_local_experts
+        self.output_router_logits = output_router_logits
+        self.router_aux_loss_coef = router_aux_loss_coef
+        self.router_jitter_noise = router_jitter_noise
+        self.use_qk_norm = kwargs.pop("use_qk_norm", False)
+        self.rotary_dim = kwargs.pop("rotary_dim", self.head_dim)
+        self.partial_rotary_factor = kwargs.pop("partial_rotary_factor", 1)
+        if self.head_dim is not None:
+            self.partial_rotary_factor = self.rotary_dim / self.head_dim
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+__all__ = ["MiniMaxM2Config"]

generation_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "bos_token_id": 200019,
+  "do_sample": true,
+  "eos_token_id": 200020,
+  "top_k": 40,
+  "top_p": 0.95,
+  "transformers_version": "4.57.3"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model-00000-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2ab3425a58e19c5937397c2e650fa5c2eca31d8510854b85f9f1f86038e61d01
+size 2635579896

model-00001-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2216aa0d02dc9f9b24af9f43044af5c1cf6c1afa667b3d1c7390e0be59d825b5
+size 679579608

model-00002-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:397e30f6587539cf2d8ab054333e02c89a95308b68c5fec1f657cd5c815546e7
+size 1406386576

model-00003-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c23c95eeaab36e3041ddf9c1d931f97bf5fcdc77ad3f3720a9fde9f1c10fcc30
+size 679579608

model-00004-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f8007a0d220e3cad06535d61e9af55319a7c8e16f5f3f7a6a0d536634a2c053f
+size 1406386576

model-00005-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8357379b03591695f22dd8623e85f9ee344dc50e2adaac9182446fa6678451c9
+size 679579608

model-00006-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a2304172be40d611560eec65b92a8e67dbdd76db6242f4b1a8f70dbc0dc119ea
+size 1406386576

model-00007-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b847fcbbfac12ea04d89aab6f9bf8fa9f8a0ca43afb037d10ce0b03761d88c96
+size 679579608

model-00008-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:49b7716786362a2bda42cc37e74de9246c0da8bae3b8dd541e223a4f4a80f9cc
+size 1406386576

model-00009-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9a7178b968a409f64445ddba49d5ac42ffca93d30a5271649285f8350a81b61a
+size 679579608

model-00010-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5153cab8d648eee39c3164cb8ae60437a2865fd57788ace734d0a8e4452ae76a
+size 1406386576

model-00011-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8989077984e9fd1036d9848373daf2b76c2def9da0a7e8db81b068b59a20465b
+size 679579608

model-00012-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f361ed1b4fc43c71a0ec6df3298828cd5b39ff28486717663497b22e8d2cec3a
+size 1406386576

model-00013-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9c16d7af5f4a82c8a6f0879faf883ea1e21deff74f7b3477ed46909f9dab8bf0
+size 679579608

model-00014-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:95ee92624cc8f902a33a46d921babd224c00602736dc43b0d80438e62497b703
+size 1406386576

model-00015-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fac258378c41b643c8a6533ddfb19f80f28b2e4ad895b161a9259951044d573f
+size 679579608

model-00016-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0de568683b7f823d37b706beab45de331687399dd1e07afc94c7dd8dc725dca2
+size 1406386576

model-00017-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d6c5b094ae52fa3cbd8cbcaf955cf49bc5a8b262ae21f7bc0b98397717f609c1
+size 679579608

model-00018-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:db9b60e07f20da7bbccfcc7869b25f9502330fda3c9600ee555d2c1fc8c39771
+size 1406386576

model-00019-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2e4663e6ea7dac4dd0b49bde8eca5e2c3b96d2913c0997371a383cd80855ec14
+size 679579608

model-00020-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:97de867f6abe61daaa1feae6767c1d3c8db4c2e888124b834cd3bd9a32b269c6
+size 1406388128

model-00021-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:70996131b5f441319a24e5b04fffd21fcf16fccf16246ef1d04c5094d481ac07
+size 679580376

model-00022-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f23f667295c6e4623859e5e9e68554a48c15614106fb076ffdef57a302ac70df
+size 1406388128

model-00023-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fe6d49a0a4da5c96c6e44f8bb3976b859cbd5e3f806144d19c7a5d2971bc3562
+size 679580376

model-00024-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:615dfb553e208187841c3f49af967f1537d32ea7915e78873515819b9c10d203
+size 1406388128

model-00025-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4b631abe29100d76512c54b30e845ac077e19740d20d5384ec3576ca1c20981b
+size 679580376

model-00026-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6aff9daaeaf1b44541286daba889264169b4327dff893e1e681ceeae0ef4b971
+size 1406388128

model-00027-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c67d132dff423f512c986e461a13e4bc32debe8d479f9a37cbba8235051f14d0
+size 679580376

model-00028-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b164fb2b5121eb460a1f450f98bce94dfffb8948c987634c0fbb6db789ce1160
+size 1406388128

model-00029-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d0fb7b93d05a8853e03ddc65070e73be1bbd1742c8da7b31ae05efe7e7af0d7b
+size 679580376

model-00030-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6aac5cf99a35483dffef5b3f7c88446932cb99a5ada0eee3cfb42737b46c332f
+size 1406388128

model-00031-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b00d17d98b3340d90acb59bd7bd333623dc86b9ed0bbafcef1c08a5afe2084dc
+size 679580376

model-00032-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:65791b5f25d12df6cc5b96528445190f9d041ca9a8f5ada6d205c4c2d4f2abbe
+size 1406388128

model-00033-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:17ba972e688ad1dea052a414993664ffd1cf7d977d86bba977176bd88cf8e616
+size 679580376

model-00034-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0c2b98a81ee0bebf55808833fef1f75664402ade20952fe14f618adc6c13b953
+size 1406388128

model-00035-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9bc11127c7dd9cb7731a6045edac6eabb26c378ce7990d14de83507bea841dff
+size 679580376

model-00036-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8ee6346f9fd286195a7801240a5d334bee6e234714d7965f46c057a0cc41fe5c
+size 1406388128

model-00037-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:60ad909251a4af09afe66716632e80cd70df295f6c12d6e7ffa86d5037b6e208
+size 679580376

model-00038-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8fd4b5d62565d0f41a75b22b43ede83d170fd013da124c1b1878bf5e995ad68d
+size 1406388128

model-00039-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:172835f6c4dee07b34d92de53c942e27066e2b61bfad2e3cef892d652245f9ab
+size 679580376

model-00040-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8fb479c10ef909d12ee2cfcc73a6661104a24c9c30f6f42df2945213a857572b
+size 1406388128

model-00041-of-00126.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c0abbd93045df3f1409ed3a8e18bfce814123d3c2d52108a3bd67c57f72e27dc
+size 679580376