--- language: - en - zh - ko - ja license: apache-2.0 base_model: google/gemma-4-31B-it tags: - gemma - gemma4 - instruction-tuned - reasoning - alignment pipeline_tag: text-generation --- # π Gemopus-4-31B-it > [!NOTE] > **Gemopus** is an attempt at fine-tuning Gemma 4 with a core philosophy of "stability first". > > While preserving the original reasoning order of **Gemma 4** as much as possible, we conducted targeted refinements for answer quality, structure, clarity, and consistency. > > This model was trained in a post-fix **Unsloth** environment, after Unsloth's official gradient-accumulation and loss-accounting fixes for Gemma-family training. In practice, I used a bug-fixed stack aligned with `unsloth_zoo>=2026.4.6` and `transformers==5.5.0`, in order to avoid misleading loss inflation under gradient accumulation and to obtain more reliable optimization behavior for **Gemma 4 31B** fine-tuning. > > **π Therefore, My fine-tuning strategy chose not to follow other teams in aggressive direct distillation from Claude. Instead, we opted for a more conservative and controllable path.** ## π― Development Motivation & Industry Insights **Gemopus-4-31B-it** is a supervised fine-tune version based on the Gemma 4 31B Instruction model. * Although this model has "Opus" in its name, it is more of a continuation of the naming convention. * The goal here is not to deny that reasoning SFT can generalize under the right conditions, but to avoid naive or superstitious replication of **"Claude-style chain of thought (CoT)"** from public distillation corpora. Recent evidence suggests that whether reasoning supervision transfers depends on optimization, data quality, and model capability. In practice, many publicly available reasoning traces still do not necessarily reflect the teacher model's true, faithful, and transferable internal process; they are often closer to polished summaries than genuinely connected reasoning. A series of recent studies have also shown that models can exhibit post-hoc rationalization in natural settings, and that CoT faithfulness varies substantially across model families and training regimes. In other words, text that merely **looks** like reasoning is not automatically a high-quality, transferable supervision signal for reasoning. ---  --- ## π¬ Supporting Evidence Recent work: **Ren et al., 2026 β *Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability*** ([arXiv:2604.06628](https://arxiv.org/abs/2604.06628))
Short-epoch reasoning SFT can underestimate generalization β in-domain gains may appear early, while out-of-domain improvements often require sufficient optimization.
This paper suggests that generalization in reasoning SFT is **not fixed, but conditional** β shaped jointly by optimization dynamics, training data quality, and base-model capability. Key takeaways: - Reasoning SFT can generalize when sufficiently optimized, often showing a **dip β recovery** pattern rather than a monotonic curve. - **High-quality long-CoT data** can support cross-domain transfer, whereas weak or noisy reasoning traces may not. - **Stronger models** are more likely to internalize transferable reasoning structure instead of merely imitating longer outputs. - The gains are **asymmetric**: reasoning ability may improve while safety behavior can degrade. For **Gemopus-4-31B-it**, this evidence supports a more conditional interpretation of reasoning supervision. My strategy is therefore not based on the simplistic claim that reasoning SFT never generalizes, but on a practical judgment about **which kind of reasoning supervision is worth applying to Gemma 4**. Since **Gemma 4 31B** already has a relatively orderly and restrained reasoning-chat prior, I chose not to aggressively overwrite it with public "Claude-style" traces of uneven quality. Instead, the SFT objective focuses on preserving Gemma 4's native reasoning order while improving **answer quality, structure, clarity, and interaction consistency**. This also suggests that reasoning SFT should be viewed as a **dynamic optimization process**, rather than a static training outcome. For this project, that means prioritizing **data quality, optimization discipline, and compatibility with the base model's native strengths**, rather than assuming that longer visible reasoning alone will automatically produce a better student. --- ## π‘ Model Features & Alignment Optimization Based on the methodological deduction above, I chose to focus my optimization efforts on the lower-risk, more consistently rewarding levels of **final answer quality and interactive experience**: * βοΈ **Overall Style Consistency:** Eliminated the stiff "machine translation tone" and redundant preaching feel inherent in the base model, making conversations more natural, clear, and organized. * π **Structural & Completeness Enhancements:** Significantly optimized the organizational structure of long responses. The model can more proficiently use Markdown syntax (e.g., lists, bolding) for hierarchical structuring and noise reduction, ensuring key points stand out visually and improving the reading experience. * π **Expressive Rigor & Depth of Explanation:** In technical and popular science responses, enhanced the rigor of professional terminology and the ability to explain complex concepts simply, while avoiding mechanical, encyclopedia-like recitation. --- ## π Evaluation Benchmarks (TBD) > --- ## π οΈ Best Practices For the best performance, use these configurations and best practices: ### 1. Sampling Parameters Use the following standardized sampling configuration across all use cases: * `temperature=1.0` * `top_p=0.95` * `top_k=64` ### 2. Thinking Mode Configuration Compared to Gemma 3, the models use standard `system`, `assistant`, and `user` roles. To properly manage the thinking process, use the following control tokens: * **Trigger Thinking:** Thinking is enabled by including the `<|think|>` token at the start of the system prompt. To disable thinking, remove the token. * **Standard Generation:** When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure: `<|channel>thought\n` **[Internal reasoning]** `