---
license: apache-2.0
language:
  - ko
  - en
tags:
  - moe
  - mixture-of-experts
  - custom
  - aether
  - latent-thought
  - multi-token-prediction
library_name: transformers
pipeline_tag: text-generation
---

# AETHER-Micro 0.5B (Phase 1 Checkpoint)

AETHER-Micro is an experimental MoE-based language model.

## Model Details

| Item | Value |
|------|-------|
| Architecture | MoE big.LITTLE + LTL + MTP |
| Total Parameters | 2.08B |
| Active Parameters | ~0.5B per token |
| Hidden Size | 1024 |
| Layers | 24 |
| Attention | GQA 16 heads, 4 KV heads |
| Experts | 5 Big + 15 Small + 2 Shared |
| Vocab Size | 64,000 Korean + English + Code |
| Context Length | 8,192 RoPE |
| Training Step | 57,000 / 100,000 |
| Training Loss | ~3.54 |

## Architecture Features

- **big.LITTLE MoE**: 5 large experts (2048 intermediate) + 15 small experts (1024 intermediate) + 2 shared experts (always active)
- **Latent Thought Layer (LTL)**: K-step latent reasoning (K=0,1,2) via Gumbel-Softmax selection
- **Multi-Token Prediction (MTP)**: 4-step ahead prediction replacing standard NTP loss
- **Wu-Xing Router**: Five-element inspired expert routing
- **Quality Head**: 4-dimensional quality assessment

## Training

- **Phase**: 1 of 3 (57% complete)
- **Data**: 13.1B tokens (Korean 22%, English 25%, Code 21%, Math 24%, Dialogue 8%)
- **Optimizer**: AdamW (lr=1e-4, cosine decay)
- **Precision**: FP32

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Be2Jay/AETHER-Micro-0.5B",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Be2Jay/AETHER-Micro-0.5B")
```

> **Note**: This is a Phase 1 training checkpoint. The model is still in early training and not yet suitable for production use.

## License

Apache 2.0