Noeum-1-Nano

---
license: apache-2.0
language:
- en
tags:
- moe
- mixture-of-experts
- reasoning
- chain-of-thought
- cot
- system-2-thinking
- nlp
- text-generation
- conversational
- instruct
- sft
- dpo
- grpo
- rlhf
- math
- logic
- scientific-reasoning
- efficient
- low-resource
- data-efficient
- from-scratch
- pretrained
- 0.6b
- nano-model
- small-model
- european-ai
- austria
- independent-research
- arxiv
- python
- coding
- step-by-step
- self-correction
- hallucination-reduction
- educational
- research
- benchmark
- thinking-mode
- mental-models
- deductive-reasoning
- analytical
- problem-solving
pipeline_tag: text-generation
library_name: transformers
datasets:
- wikipedia
- c4
- fineweb-edu
- arxiv
- stack-exchange
---

<div align="center">
  <img src="https://noeum.ai/wp-content/uploads/2025/11/noeum.png" alt="Noeum Logo" width="200"/>
  
  <h1>Noeum-1-Nano</h1>
  <p><em>A 0.6B MoE model trained entirely from scratch.</em></p>
  
  <p>
    <a href="https://noeum.ai">Website</a> •
    <a href="#performance">Benchmarks</a> •
    <a href="#quickstart">Quickstart</a> •
    <a href="#training">Training</a> •
    <a href="#about-noeum">About Noeum</a>
  </p>
</div>

---

##  Overview

**Noeum-1-Nano** is a nano-scale Mixture-of-Experts (MoE) model (0.6B total / 0.2B active) trained on only **18 billion tokens**. 

It has proven its efficiency and reasoning quality by matching the capabilities of major labs’ nano-class models, despite utilizing a fraction of the data. Built entirely from scratch—with no pretrained weights and no inherited shortcuts—this independent, self-funded effort demonstrates that innovative techniques and intelligent design can rival brute-force scale.

*   **Data Efficiency:** Achieves competitive reasoning with **20x to 667x less data** than standard models like Qwen2 or TinyLlama.
*   **System 2 Reasoning:** Features a dedicated `<think>` mode for logic, math, and self-correction.

---

##  Performance & Benchmarks

The benchmarks below demonstrate Noeum-1-Nano achieving above-average performance despite an extreme disparity in training volume. While standard models typically require 2 Trillion to 12 Trillion tokens, Noeum achieves competitive results with just 18 billion high-signal tokens.

### Quantitative Benchmarks (lm-eval-harness)
### ALL benchmarks conducted with Noeum thinking mode DISABLED to ensure fair comparison




| Task | Metric | Noeum-1-Nano (0.6B) | Note |
|:-----|:-------|:-------------------:|:-----|
| **SciQ** | Accuracy | **77.5%** | *Exceptional scientific knowledge retrieval* |
| **MRPC** | F1 Score | **81.2%** | *Rank #1 vs comparable models on semantic equivalence* |
| **BoolQ** | Accuracy | **62.0%** | *Strong yes/no reasoning on complex text* |
| **PIQA** | Accuracy | **62.9%** | *Physical interaction reasoning* |
| **ARC-Easy**| Accuracy | 47.1% | |


***

###  Internal Evaluation & Best Practices

Based on our internal automated benchmarks (100-question comparative deep dive), **Noeum-1-Nano** performs exceptionally well on specific task types when the reasoning engine is properly configured.

*   **Scientific Fact Retrieval:** The model demonstrates high retention of constants and definitions (Physics, Biology).
*   **Step-by-Step Word Problems:** Unlike standard small models which guess numbers, Noeum successfully sets up equations (e.g., $Distance = Speed \times Time$).
*   **Logical Deduction:** It correctly handles transitive logic puzzles (e.g., *If A > B and B > C, who is tallest?*).

**⚠ Critical Configuration:**
These results are conditional on specific generation parameters. Our tests confirm that a **Thinking Budget of 128 tokens** combined with a **Temperature of 0.1** is the "sweet spot." Lower budgets cut off reasoning prematurely, while higher temperatures introduce instability.

---

##  Dataset Composition

To achieve competitive performance with only **18 Billion tokens**, we prioritized data density over volume. We curated a "high-signal" mixture designed to maximize reasoning density per token.

The pre-training mixture includes:
*   **Academic & Reasoning:** arXiv papers (math/cs subsets), portions of **CC-Math-Finest**, and curated math datasets.
*   **Coding:** High-quality **Python** repositories and **StackExchange** discussions.
*   **General Knowledge:** **Wikipedia** (specifically filtered for long-context articles >2k tokens), **C4**, and **FineWeb-Edu** (High quality subset).
*   **Synthetic Data:** Custom-generated synthetic reasoning traces designed to bootstrap the model's cognitive capabilities, including the ability to engage in deliberative reasoning before responding, explore contradictory perspectives, apply first-principles analysis, generate divergent solutions, and employ lateral thinking strategies."* 


### Tiny model but with Thinking option and impact of extra Reasoning (A/B Test)

Noeum-1-Nano features a specific **Thinking Mode**. When enabled (temp=0.1), the model engages a hidden chain-of-thought process that grounds facts and solves multi-step problems.

#### 1. Hallucination Correction
*Standard generation guesses; reasoning verifies.*

**User:** "What is the capital of Spain?"

| Mode | Output | Verdict            |
|:---|:---|:-------------------|
| **Standard** | "La Muerte is the capital of Spain" |  **Hallucination** |
| **Reasoning** | `<think>` The capital of Spain is Madrid. It is known for its rich history... `</think>` <br> **"Madrid is the capital of Spain."** | ✅ **Correct**      |

#### 2. Mathematical Logic
*Standard generation struggles with arithmetic; reasoning sets up equations.*

**User:** "If a train travels 60 km in 1 hour, how far in 3 hours?"

| Mode | Output | Verdict             |
|:---|:---|:--------------------|
| **Standard** | "Therefore, the distance traveled by the train is 60 kilometers." |  **Repeated Input** |
| **Reasoning** | `<think>` Distance = Speed × Time. <br> 60 km × 3 hours = 180 km `</think>` <br> **"So, the train travels 180 kilometers in 3 hours."** | ✅ **Correct**       |

---

## Architecture & Configuration

| Component | Specification |
|-----------|---------------|
| **Type** | Mixture-of-Experts (MoE) |
| **Total Params** | 0.6B |
| **Active Params** | ~0.2B |
| **Experts** | 8 Routed, 1 Shared (Top-2 Active) |
| **Layers** | 24 |
| **Attention** | 12 Heads (GQA), 768 Hidden Dim |
| **Context** | 2048 Tokens (RoPE + YaRN) |

---

## 🛠️ Training Stack

This model was not fine-tuned from an existing checkpoint. It was built from the ground up to test the efficiency of our custom stack.

1. **Pre-training:** Two-phase training (512 ctx $\to$ 2048 ctx) on high-signal data.
2. **Post-Training:** 
    *   **SFT:** Supervised Fine-Tuning for instruction following.
    *   **GRPO:** Group Relative Policy Optimization for reasoning capabilities.
    *   **DPO:** Direct Preference Optimization for alignment.
3. **Hardware:** Trained efficiently on **8x NVIDIA RTX 5090s**.

---

## Quickstart

### Chat Format
The model supports two distinct modes via system prompts or flags:

*   **`/think`**: Activates System 2 reasoning (Recommended for logic/math).
*   **`/no think`**: Standard fast text generation.


## 🛠️ Advanced Usage: Full Benchmarking & Chat Script

To fully explore **Noeum-1-Nano**, we provide the complete all-in-one inference script used to generate the benchmarks above. This script is not just a chat interface; it is a comprehensive evaluation tool.

**Capabilities of this script:**
1.  **Interactive Reasoning Chat:** Talk to the model with streaming output. Toggle "Thinking Mode" on/off dynamically using commands like `/think on` or `/think off`.
2.  **Deep Dive Analysis:** Select Option 2 to run a single prompt through multiple configurations (Temperature 0.1 vs 0.7, Budget 32 vs 256) simultaneously to see how the model's logic changes.
3.  **Automated Benchmarking:** Select Option 3 to run the full internal test suite (Math, Logic, History, Science) and generate A/B comparison logs.

**How to use:**
Save the code below as `run_noeum.py` and execute it. It handles token streaming, template application, and logging automatically.

```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
import sys
from datetime import datetime


class TeeLogger:

    def __init__(self, filename):
        self.terminal = sys.stdout
        self.log = open(filename, 'w', encoding='utf-8')

    def write(self, message):
        self.terminal.write(message)
        self.log.write(message)
        self.log.flush()

    def flush(self):
        self.terminal.flush()
        self.log.flush()

    def close(self):
        self.log.close()


# Start logging
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
log_filename = f"benchmark_results_{timestamp}.txt"
logger = TeeLogger(log_filename)
sys.stdout = logger

print(f"Logging started: {log_filename}")
print(f"Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

# ============================================================================
# MODEL SETUP
# ============================================================================
MODEL_PATH = "./Noeum-0.6B-hf-nano"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

print(f"\nLoading model from {MODEL_PATH} on {DEVICE}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, trust_remote_code=True).to(DEVICE)
model.eval()

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

EOS_ID = tokenizer.eos_token_id


def streaming_generate(model, prompt_ids, max_new_tokens=512, temperature=0.7, top_p=0.9, device="cuda"):
    input_ids = prompt_ids.to(device)

    for _ in range(max_new_tokens):
        with torch.inference_mode():
            outputs = model(input_ids)
            logits = outputs.logits[:, -1, :]

            # Greedy decoding
            if temperature <= 0:
                next_token = torch.argmax(logits, dim=-1, keepdim=True)
            else:
                logits = logits / temperature

                # Top-p (nucleus sampling)
                if top_p < 1.0:
                    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                    sorted_probs = F.softmax(sorted_logits, dim=-1)
                    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

                    # Remove tokens with cumulative probability above threshold
                    sorted_indices_to_remove = cumulative_probs > top_p
                    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
                    sorted_indices_to_remove[..., 0] = 0

                    # Set removed tokens to -inf
                    sorted_logits[sorted_indices_to_remove] = -float("inf")
                    # Scatter back to original positions
                    logits = torch.zeros_like(logits).scatter(1, sorted_indices, sorted_logits)

                probs = F.softmax(logits, dim=-1)
                next_token = torch.multinomial(probs, num_samples=1)

        # Decode token
        tok_id = int(next_token.item())
        token_text = tokenizer.decode([tok_id], skip_special_tokens=False)
        yield token_text

        # Stop if EOS token
        if tok_id == EOS_ID:
            break

        # Append to input for next iteration
        input_ids = torch.cat([input_ids, next_token], dim=-1)


def chat(
        question: str,
        # chat_history argument removed/ignored
        thinking: bool = True,
        think_budget: int = 128,
        temperature: float = 0.1,
        top_p: float = 0.9,
        system_prompt: str = "You are a helpful assistant.",
        verbose: bool = True
):
    """
    Main chat function with streaming support.
    MEMORY DISABLED: Each call is a fresh context.
    """

    # Build conversation - STRICTLY SYSTEM + CURRENT QUESTION
    messages = [{'role': 'system', 'content': system_prompt}]

    # Add current question with /think flag if enabled
    user_message = f"{question} /think" if thinking else question
    messages.append({'role': 'user', 'content': user_message})

    # Apply chat template
    prompt_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    prompt_ids = tokenizer.encode(prompt_text, return_tensors='pt')

    # Generate
    thinking_content = ""
    answer_content = ""
    current_mode = None

    generator = streaming_generate(
        model=model,
        prompt_ids=prompt_ids,
        max_new_tokens=think_budget + 256 if thinking else 512,
        temperature=temperature,
        top_p=top_p,
        device=DEVICE
    )

    if verbose:
        print("\n" + "=" * 80)

    for chunk in generator:
        if chunk == '</s>':
            break

        # Track mode switches
        if chunk == '<think>':
            current_mode = 'thinking'
            if verbose:
                print("\n💭 THINKING:")
            continue
        elif chunk == '</think>':
            current_mode = None
            continue
        elif chunk == '<answer>':
            current_mode = 'answer'
            if verbose:
                print("\n✅ ANSWER:")
            continue
        elif chunk == '</answer>':
            current_mode = None
            continue

        # Accumulate content
        if current_mode == 'thinking':
            thinking_content += chunk
            if verbose:
                print(chunk, end='', flush=True)
        elif current_mode == 'answer':
            answer_content += chunk
            if verbose:
                print(chunk, end='', flush=True)

    if verbose:
        print("\n" + "=" * 80)

    return {
        'thinking': thinking_content.strip(),
        'answer': answer_content.strip(),
        'full_thinking': thinking_content.strip(),
        'full_answer': answer_content.strip()
    }


# ============================================================================
# BENCHMARK FUNCTIONS
# ============================================================================
def benchmark_single_question(question: str, temperatures=[0.1, 0.3, 0.7], budgets=[32, 128, 256]):
    """
    Run a single question through all configurations - STREAMING
    """
    print("\n" + "=" * 100)
    print(f"QUESTION: {question}")
    print("=" * 100)

    # NO THINK
    print("\n🚫 NO THINK MODE (temperature=0.7)")
    print("-" * 100)
    result = chat(question, thinking=False, temperature=0.7, verbose=True)

    # THINK MODE - Different temperatures
    for temp in temperatures:
        print(f"\n💭 THINK MODE - Temperature: {temp}, Budget: 128")
        print("-" * 100)
        result = chat(question, thinking=True, think_budget=128, temperature=temp, verbose=True)

    # THINK MODE - Different budgets (at temp=0.7)
    for budget in budgets:
        print(f"\n💭 THINK MODE - Temperature: 0.7, Budget: {budget}")
        print("-" * 100)
        result = chat(question, thinking=True, think_budget=budget, temperature=0.7, verbose=True)


def benchmark_all_questions(questions: list, config: dict):
    results = []

    print("\n" + "=" * 100)
    print(f"BENCHMARK: {config}")
    print("=" * 100)

    for i, q in enumerate(questions, 1):
        print(f"\n[{i}/{len(questions)}] Q: {q}")
        result = chat(
            q,
            thinking=config.get('thinking', True),
            think_budget=config.get('think_budget', 128),
            temperature=config.get('temperature', 0.1),
            verbose=True
        )

        results.append({
            'question': q,
            'thinking': result['full_thinking'],
            'answer': result['full_answer']
        })

    return results


def compare_configurations(questions: list):
    configurations = [
        {'name': 'NO THINK', 'thinking': False, 'temperature': 0.7},
        {'name': 'THINK - Temp 0.1', 'thinking': True, 'think_budget': 128, 'temperature': 0.1},
        {'name': 'THINK - Temp 0.3', 'thinking': True, 'think_budget': 128, 'temperature': 0.3},
        {'name': 'THINK - Temp 0.7', 'thinking': True, 'think_budget': 128, 'temperature': 0.7},
        {'name': 'THINK - Budget 32', 'thinking': True, 'think_budget': 32, 'temperature': 0.7},
        {'name': 'THINK - Budget 256', 'thinking': True, 'think_budget': 256, 'temperature': 0.7},
    ]

    all_results = {}

    for config in configurations:
        name = config.pop('name')
        print(f"\n{'#' * 100}")
        print(f"RUNNING CONFIGURATION: {name}")
        print(f"{'#' * 100}")
        all_results[name] = benchmark_all_questions(questions, config)

    # Print FULL comparison table
    print("\n" + "=" * 100)
    print("COMPARISON SUMMARY - FULL OUTPUTS")
    print("=" * 100)

    for i, q in enumerate(questions):
        print(f"\n{'=' * 100}")
        print(f"Q{i + 1}: {q}")
        print(f"{'=' * 100}")

        for config_name, results in all_results.items():
            print(f"\n{config_name}:")
            print(f"  💭 THINKING: {results[i]['thinking']}")
            print(f"  ✅ ANSWER: {results[i]['answer']}")
            print("-" * 100)


# ============================================================================
# INTERACTIVE CHAT LOOP
# ============================================================================
def interactive_chat():
    """
    Interactive chat session in terminal - STATELESS (No Memory)
    """
    print("\n" + "=" * 80)
    print("INTERACTIVE CHAT SESSION (NO MEMORY)")
    print("=" * 80)
    print("Commands:")
    print("  /quit - Exit chat")
    print("  /think on/off - Toggle thinking mode")
    print("  /budget <number> - Set thinking budget (tokens)")
    print("  /temp <number> - Set temperature (0-1)")
    print("=" * 80 + "\n")

    # chat_history removed
    thinking_enabled = True
    think_budget = 128
    temperature = 0.1

    while True:
        try:
            user_input = input("\n👤 You: ").strip()

            if not user_input:
                continue

            # Handle commands
            if user_input == '/quit':
                print("Goodbye!")
                break
            elif user_input.startswith('/think'):
                parts = user_input.split()
                if len(parts) > 1:
                    thinking_enabled = parts[1].lower() == 'on'
                print(f"Thinking mode: {'ON' if thinking_enabled else 'OFF'}")
                continue
            elif user_input.startswith('/budget'):
                parts = user_input.split()
                if len(parts) > 1:
                    think_budget = int(parts[1])
                print(f"Thinking budget set to: {think_budget} tokens")
                continue
            elif user_input.startswith('/temp'):
                parts = user_input.split()
                if len(parts) > 1:
                    temperature = float(parts[1])
                print(f"Temperature set to: {temperature}")
                continue

            # /clear command removed as there is no memory

            # Get response - NO HISTORY PASSED
            result = chat(
                question=user_input,
                thinking=thinking_enabled,
                think_budget=think_budget,
                temperature=temperature,
                verbose=True
            )

            # Chat history update logic removed

        except KeyboardInterrupt:
            print("\n\nGoodbye!")
            break
        except Exception as e:
            print(f"\nError: {e}")


# ============================================================================
# MAIN - RUN BENCHMARKS
# ============================================================================
if __name__ == '__main__':
    try:
        # Test questions
        test_questions = {
            # CATEGORY 1: Simple Math (Addition/Subtraction)
            "Simple Math": [
                "What is 15 + 27?",
                "What is 100 - 37?",
                "What is 45 + 55?",
                "What is 82 - 19?",
                "What is 7 + 8?",
                "What is 50 - 25?",
                "What is 123 + 456?",
                "What is 200 - 88?",
                "What is 9 + 6?",
                "What is 75 - 30?"
            ],

            # CATEGORY 2: Multiplication
            "Multiplication": [
                "What is 8 × 7?",
                "What is 12 × 12?",
                "What is 9 × 6?",
                "What is 15 × 4?",
                "What is 7 × 9?",
                "What is 11 × 11?",
                "What is 6 × 8?",
                "What is 13 × 5?",
                "What is 25 × 4?",
                "What is 20 × 3?"
            ],

            # CATEGORY 3: Division & Fractions
            "Division & Fractions": [
                "What is 56 ÷ 8?",
                "Which is larger: 1/2 or 1/3?",
                "What is 100 ÷ 4?",
                "Which is larger: 2/3 or 3/4?",
                "What is 81 ÷ 9?",
                "Which is larger: 1/4 or 1/5?",
                "What is 144 ÷ 12?",
                "What is 1/2 + 1/4?",
                "What is 50 ÷ 2?",
                "Which is larger: 3/5 or 2/5?"
            ],

            # CATEGORY 4: Word Problems
            "Word Problems": [
                "If a train travels 60 km in 1 hour, how far in 3 hours?",
                "If John has 5 apples and gives 2 to Mary, how many does he have left?",
                "If a book costs $12 and you buy 3 books, how much do you spend?",
                "If there are 24 students and 6 tables, how many students per table?",
                "If a car uses 8 liters per 100km, how much for 300km?",
                "If you earn $15 per hour and work 8 hours, how much do you earn?",
                "If a pizza has 8 slices and 4 people share equally, how many slices each?",
                "If a pen costs $2 and you have $20, how many pens can you buy?",
                "If a movie is 2 hours long and starts at 3pm, when does it end?",
                "If you save $10 per week, how much in 10 weeks?"
            ],

            # CATEGORY 5: Prime Numbers & Math Concepts
            "Math Concepts": [
                "Is 17 a prime number?",
                "Is 20 a prime number?",
                "Is 13 a prime number?",
                "What is the square root of 64?",
                "What is 5 squared (5²)?",
                "Is 1 a prime number?",
                "What is 10% of 100?",
                "Is 2 the only even prime number?",
                "What is the square root of 49?",
                "What is 3 cubed (3³)?"
            ],

            # CATEGORY 6: History
            "History": [
                "Who wrote Romeo and Juliet?",
                "Who was the first president of the United States?",
                "In what year did World War 2 end?",
                "Who discovered America in 1492?",
                "Who painted the Mona Lisa?",
                "What year did the Titanic sink?",
                "Who was the first man on the moon?",
                "In what year did World War 1 start?",
                "Who wrote the Declaration of Independence?",
                "Who was Julius Caesar?"
            ],

            # CATEGORY 7: Geography
            "Geography": [
                "What is the capital of France?",
                "Which is bigger: Russia or Canada?",
                "What is the capital of Italy?",
                "Is the Nile or Amazon river longer?",
                "Which ocean is largest: Atlantic or Pacific?",
                "What is the capital of Japan?",
                "Is Australia a continent?",
                "What is the tallest mountain in the world?",
                "How many continents are there?",
                "What is the capital of Spain?"
            ],

            # CATEGORY 8: Science & Nature
            "Science & Nature": [
                "Does the Sun orbit the Earth?",
                "What gas do plants produce during photosynthesis?",
                "Is iron magnetic?",
                "What is H2O?",
                "Does ice float in water?",
                "How many legs does a spider have?",
                "How many legs does an ant have?",
                "What is the speed of light approximately?",
                "Is the Earth flat or round?",
                "What planet is closest to the Sun?"
            ],

            # CATEGORY 9: Logical Reasoning
            "Logical Reasoning": [
                "If all cats are animals, and Fluffy is a cat, is Fluffy an animal?",
                "If today is Monday, what day is tomorrow?",
                "If you have 3 red balls and 2 blue balls, how many balls total?",
                "Complete the pattern: 2, 4, 6, 8, ?",
                "Which is the odd one out: apple, banana, car, orange?",
                "If A is taller than B, and B is taller than C, who is tallest?",
                "True or False: All birds can fly?",
                "If it's raining, the ground is wet. The ground is wet. Is it raining?",
                "Complete the pattern: Monday, Tuesday, Wednesday, ?",
                "If 5 > 3 and 3 > 1, is 5 > 1?"
            ],

            # CATEGORY 10: General Knowledge
            "General Knowledge": [
                "How many days are in a week?",
                "Which is heavier: gold or aluminum?",
                "Is the Eiffel Tower in London?",
                "Is Bitcoin a cryptocurrency?",
                "How many hours are in a day?",
                "What color is the sky on a clear day?",
                "How many months are in a year?",
                "What is the freezing point of water in Celsius?",
                "How many wheels does a bicycle have?",
                "What is the boiling point of water in Celsius?"
            ]
        }

        # Flatten all questions for quick testing
        all_questions = []
        for category, questions in test_questions.items():
            all_questions.extend(questions)

        print(f"\nTotal questions: {len(all_questions)}")
        print(f"Categories: {len(test_questions)}")
        print(f"Questions per category: {len(test_questions['Simple Math'])}")

        # Choose what to run
        print("\n" + "=" * 80)
        print("SELECT BENCHMARK MODE")
        print("=" * 80)
        print("1. Simple test (original 3 questions with default settings)")
        print("2. Deep dive single question (all configs)")
        print("3. Compare all configurations (by category) - FULL OUTPUTS")
        print("4. Interactive chat (NO MEMORY)")
        print("=" * 80)

        choice = input("\nEnter choice (1-4): ").strip()

        if choice == '1':
            # Simple test
            print("\n" + "=" * 80)
            print("SIMPLE TEST")
            print("=" * 80)

            for q in list(test_questions.values())[0][:3]:
                print(f"\nQ: {q}")
                result = chat(q, thinking=True, think_budget=128, temperature=0.7, verbose=True)

        elif choice == '2':
            # Deep dive on single question
            question = input("\nEnter question (or press Enter for '15 + 27'): ").strip()
            if not question:
                question = "What is 15 + 27?"

            benchmark_single_question(question)

        elif choice == '3':
            # Test by category
            print("\nSelect category to test:")
            categories = list(test_questions.keys())
            for i, cat in enumerate(categories, 1):
                print(f"{i}. {cat}")
            print(f"{len(categories) + 1}. ALL CATEGORIES (100 questions)")

            cat_choice = input("\nEnter choice: ").strip()

            if cat_choice == str(len(categories) + 1):
                # Test all
                compare_configurations(all_questions)
            else:
                # Test specific category
                cat_idx = int(cat_choice) - 1
                cat_name = categories[cat_idx]
                print(f"\nTesting category: {cat_name}")
                compare_configurations(test_questions[cat_name])

        elif choice == '4':
            # Interactive chat
            interactive_chat()

        else:
            print("Invalid choice. Starting interactive chat...")
            interactive_chat()

    finally:
        print(f"\n\nLogging completed. Results saved to: {log_filename}")
        logger.close()
        sys.stdout = logger.terminal 

```

---

##  Limitations & Bias

While Noeum-1-Nano demonstrates impressive reasoning for its size, users should be aware of the following:
*   **Hallucinations:** Like all small models, it can generate plausible but incorrect information, especially when the `<think>` mode is disabled.
*   **Arithmetic:** While it can derive formulas correctly, it may struggle with calculating large numbers precisely.
*   **Scope:** The model is optimized for English and general reasoning. It is not intended for medical, legal, or safety-critical advice.

---

## About Noeum

<div align="center">
  <img src="https://noeum.ai/wp-content/uploads/2025/11/noeum.png" alt="Noeum" width="100"/>
</div>

**Noeum** is an independent AI research & engineering lab based in Austria, building the next generation of intelligent systems. We are one of the few labs in Europe executing the full AI pipeline—from  pre-training and alignment—entirely in-house.

***

###  The Vision & Future Roadmap

This project, spearheaded by **[Bledar Ramo](https://www.linkedin.com/in/ramobledar)**, is not just a nano-model—it is a validation of a high-efficiency scaling hypothesis. We have proven that rapid iteration on small-scale "proxy" models is a reliable predictor of large-scale performance, allowing us to innovate faster than labs burdened by massive training runs.

**Our Core Philosophy:** *Iterate fast at nano-scale; scale only what works.*

With the right compute infrastructure and backing, we plan to scale these validated recipes to a **1 Trillion+ token** frontier model. Our roadmap includes integrating cutting-edge techniques inspired by our internal research and recent literature:

*   **Recursive Reasoning Architectures:** Moving beyond static Chain-of-Thought to **Recursive Language Models (RLMs)** that treat prompts as dynamic environments, solving problems far exceeding standard context windows
*   **Agentic Data Synthesis:** Implementing large-scale, self-correcting synthetic data pipelines that simulate real-world tool use and multi-step reasoning
*   **Stability at Scale:** Utilizing advanced optimization techniques like **MuonClip** (QK-Norm/Clip) to ensure stability during massive training runs without loss spikes.
*   **Hyper-Efficient Architectures:** Further refining our MoE routing and **Multi-head Latent Attention (MLA)** to maximize active parameter efficiency.

**Noeum** (derived from *"mind," "meaning,"* and *"thought"*) is building the next generation of genuine reasoning systems—not by brute-force, but by architectural intelligence.

***

🌐 **Website:** [noeum.ai](https://noeum.ai)
📧 **Contact:** contact@noeum.ai