Instructions to use cognica/Cognica-PoE-v1.0-1.5B-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cognica/Cognica-PoE-v1.0-1.5B-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="cognica/Cognica-PoE-v1.0-1.5B-base", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("cognica/Cognica-PoE-v1.0-1.5B-base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use cognica/Cognica-PoE-v1.0-1.5B-base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cognica/Cognica-PoE-v1.0-1.5B-base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cognica/Cognica-PoE-v1.0-1.5B-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/cognica/Cognica-PoE-v1.0-1.5B-base
- SGLang
How to use cognica/Cognica-PoE-v1.0-1.5B-base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cognica/Cognica-PoE-v1.0-1.5B-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cognica/Cognica-PoE-v1.0-1.5B-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cognica/Cognica-PoE-v1.0-1.5B-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cognica/Cognica-PoE-v1.0-1.5B-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use cognica/Cognica-PoE-v1.0-1.5B-base with Docker Model Runner:
docker model run hf.co/cognica/Cognica-PoE-v1.0-1.5B-base
Cognica-PoE-v1.0-1.5B-base
cognica/Cognica-PoE-v1.0-1.5B-base is a 1.5B parameter base language model trained with Cognica's Product-of-Experts (PoE) residual Log-OP local-learning objective. It is a base model, not an instruction-tuned assistant.
This release is intended to expose the PoE local-learning and inference surface, not just a single dense generate() path. The checkpoint contains a transformer trunk plus one additive prediction head per PoE stage. Each stage is trained as a locally predictive expert and the stage distributions are combined with Log-OP, so the same trained weights can be used as a full model, a prefix-pruned model, a single-stage drafter, an adaptive-depth WAND model, or a Log-OP composition of selected stages.
Architecture
- Parameters: 1,507,591,082
- Layers: 24
- Hidden size: 1536
- Attention heads: 12 query heads, 6 KV heads, head dim 128
- Context length: 2048
- Tokenizer: 48K multilingual nanochat/tiktoken BPE
- Attention pattern:
SSSL - PoE stage layout:
[12, 5, 4, 3] - Stage boundary layers:
[11, 16, 20, 23] - Stage heads:
lm_head_stages[0..3] - Router: disabled
- Training objective:
residual_logop - Inference combiner: geometric-mean Log-OP,
poe_alpha=0.0
The stage layout means:
| stage | layers evaluated | boundary | intended role |
|---|---|---|---|
s0 |
0-11 | 11 | base predictor / cheap drafter |
s1 |
0-16 | 16 | first verification/refinement stage |
s2 |
0-20 | 20 | second verification/refinement stage |
s3 |
0-23 | 23 | final full-depth stage |
Each stage is a complete next-token predictor:
logits_k = shared_lm_head(norm(h_k)) + stage_lm_head_k(norm(h_k))
p_k = softmax(logits_k)
Full PoE inference combines the stage distributions in log space:
score_full = (1 / K) * sum_k log p_k
token = argmax(score_full)
For ranking/generation this is the shrinkage-neutral geometric mean of experts. For BPB or calibrated probability reporting, renormalize score_full with logsumexp.
Local Learning
The model is not trained only by a final-layer next-token loss. It uses a stage-partitioned local-learning setup:
- the 24-layer trunk is split into four PoE stages:
[12, 5, 4, 3]; - every stage boundary has its own additive prediction head;
- each stage learns to be a usable local next-token predictor;
- the training objective is residual Log-OP, so later stages learn residual evidence relative to the product of earlier experts;
- inference can use any prefix of stages or the full geometric-mean product.
This is the local-learning interpretation of the checkpoint: learning pressure is exposed at intermediate stage boundaries instead of being applied only through the final transformer block.
Training Data
This model was trained on the Frontier V2 multilingual pretraining mix plus a train-only dialog overlay. The v2 dataset was rebuilt after finding that the earlier frontier mix had much lower CJK coverage than intended; this release uses the corrected multilingual data and a 48K tokenizer trained for that corrected mix.
The base multilingual mix provides the validation split. The dialog overlay is used only for training, so the reported validation metrics are anchored to the base frontier v2 multilingual distribution rather than the added dialog overlay.
Base mix recipe:
| Source | Target share |
|---|---|
| FineWeb-Edu | 28.0% |
| DCLM-Baseline | 20.0% |
| Stack/code | 15.0% |
| ProofPile | 4.0% |
| OpenWebMath | 4.0% |
| Wikipedia EN | 5.0% |
| CulturaX Korean | 4.0% |
| CulturaX Chinese | 2.5% |
| CulturaX Japanese | 2.5% |
| CulturaX Spanish | 1.5% |
| CulturaX French | 1.5% |
| Gutenberg | 4.0% |
| PG-19 | 2.0% |
| UltraChat | 2.0% |
| OpenHermes | 4.0% |
Total explicit CulturaX multilingual share is 12.0%, with Korean intentionally the largest non-English component.
Dialog overlay:
| Source | Use |
|---|---|
| Open-Orca/OpenOrca | train-only dialog/instruction overlay |
| Open-Orca/SlimOrca | fallback overlay source if needed |
Training data settings:
| Setting | Value |
|---|---|
| Training steps | 60,000 |
| Global tokens per step | 1,048,576 |
| Approximate token budget | 62.91B tokens |
| Context length | 2,048 |
| Case augmentation | 0.15 probability per document |
| Validation cadence | every 1,000 steps |
| Validation tokens | 4,194,304 |
Case augmentation lowercases or uppercases sampled documents during training to improve robustness to case-shifted prompts.
Loading
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "cognica/Cognica-PoE-v1.0-1.5B-base"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo,
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="auto",
)
ids = tok("๋ํ๋ฏผ๊ตญ์ ์๋๋", return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=32, do_sample=True, temperature=0.8, top_p=0.95)
print(tok.decode(out[0], skip_special_tokens=True))
The custom tokenizer prepends <|bos|> by default when add_special_tokens=True. This matters because the checkpoint was trained and validated with BOS-prepended prompts.
Inference Modes
1. Full PoE
This is the default forward() / generate() path. It evaluates all stage boundaries and combines all four stage distributions by Log-OP.
ids = tok("The capital of France is", return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=32, do_sample=False)
Use this when quality matters more than latency. It is the mode used for the final validation metrics below.
2. Prefix-Pruned Cumulative PoE
Use only the first k stages and aggregate them as a smaller PoE:
out_s0 = model.generate_prefix(ids.input_ids, max_stages=1, max_new_tokens=32)
out_s0_s1 = model.generate_prefix(ids.input_ids, max_stages=2, max_new_tokens=32)
out_s0_s2 = model.generate_prefix(ids.input_ids, max_stages=3, max_new_tokens=32)
out_full = model.generate_prefix(ids.input_ids, max_stages=4, max_new_tokens=32)
This directly supports the s0 -> s1 -> s2 -> s3 progression:
max_stages |
active experts | evaluated layers | approximate trunk compute |
|---|---|---|---|
| 1 | s0 |
12 / 24 | 50.0% |
| 2 | s0 + s1 |
17 / 24 | 70.8% |
| 3 | s0 + s1 + s2 |
21 / 24 | 87.5% |
| 4 | s0 + s1 + s2 + s3 |
24 / 24 | 100.0% |
This is the main prefix-pruning mode: stop early when a task can be answered by the earlier experts, keep going when more verification depth is needed.
3. Single-Stage Prediction
Use one stage endpoint as an independent predictor:
logits_s0 = model.forward_stage(ids.input_ids, stage=0)
logits_s3 = model.forward_stage(ids.input_ids, stage=3)
out = model.generate_stage(ids.input_ids, stage=0, max_new_tokens=32)
This is useful for probing stage specialization, using s0 as a cheap draft model, or measuring how much each stage changes the answer. stage=i evaluates the trunk up to that stage boundary and applies that stage's own additive head.
4. WAND Adaptive Depth
WAND mode evaluates stages incrementally and exits early when the current top-1 margin is larger than a calibrated upper bound on what remaining stages can change:
out, stages_used = model.generate_wand(
ids.input_ids,
max_new_tokens=32,
safety=1.0,
return_stages_used=True,
)
Interpretation:
- If
s0is already decisive, emit froms0. - If not, consult
s1, thens2, thens3. stages_usedrecords which stage emitted each generated token.- Higher
safetyis more conservative. - For strict deployment, calibrate
p99_boundson the target validation distribution and pass them explicitly.
5. PoE Speculative Decoding
Use an early stage as the drafter and the full PoE path as verifier:
out, accept_rate = model.generate_speculative(
ids.input_ids,
draft_stage=0,
k_draft=3,
max_new_tokens=64,
return_acceptance=True,
)
This preserves the full-model decision rule while exploiting the fact that s0 is already a trained predictor. The current implementation is greedy and uses the full PoE path for verification.
6. Parallel Stage Composition
Compose arbitrary stage experts in Log-OP:
out_all = model.generate_parallel_composition(
ids.input_ids,
stages=[0, 1, 2, 3],
max_new_tokens=32,
)
out_weighted = model.generate_parallel_composition(
ids.input_ids,
stages=[0, 2, 3],
stage_weights=[0.5, 1.0, 1.0],
max_new_tokens=32,
)
This is the explicit stage-composition API. On a single GPU the reference implementation emits all selected boundary logits in one forward pass and combines them in log space. In a serving system, the same factorization is the hook for distributed stage-parallel execution: compute the shared prefix, dispatch selected stage continuations/heads, then reduce the returned log-probabilities with the same Log-OP rule.
Validation Metrics
Final checkpoint: step 60,000. Validation used the matching 48K tokenizer.
| metric | value |
|---|---|
| full/geomean BPB | 0.787527 |
| entropy-weighted BPB | 0.787502 |
| full next-token accuracy | 0.4478 |
| stage BPB | 0.799041 / 0.800914 / 0.800918 / 0.810400 |
| s2-s3 top-1 agreement | 0.8973 |
| prompt target ranks, full vs shared | avg 2.0 vs 115.8, wins/ties/losses 9/0/1 |
Per-stage agreement and prompt-rank artifacts are included under eval/.
Practical Notes
- This is a research base checkpoint. It is not RLHF/SFT aligned.
- Korean and English continuations work, but long-form instruction following and repetition control are base-model quality.
trust_remote_code=Trueis required because PoE aggregation and stage inference modes are implemented in the custom model class.- For scoring/BPB, normalize Log-OP scores with
logsumexp; raw Log-OP scores are ranking scores.
- Downloads last month
- 89