Cognica-PoE-v1.0-1.5B-base

cognica/Cognica-PoE-v1.0-1.5B-base is a 1.5B parameter base language model trained with Cognica's Product-of-Experts (PoE) residual Log-OP local-learning objective. It is a base model, not an instruction-tuned assistant.

This release is intended to expose the PoE local-learning and inference surface, not just a single dense generate() path. The checkpoint contains a transformer trunk plus one additive prediction head per PoE stage. Each stage is trained as a locally predictive expert and the stage distributions are combined with Log-OP, so the same trained weights can be used as a full model, a prefix-pruned model, a single-stage drafter, an adaptive-depth WAND model, or a Log-OP composition of selected stages.

Architecture

  • Parameters: 1,507,591,082
  • Layers: 24
  • Hidden size: 1536
  • Attention heads: 12 query heads, 6 KV heads, head dim 128
  • Context length: 2048
  • Tokenizer: 48K multilingual nanochat/tiktoken BPE
  • Attention pattern: SSSL
  • PoE stage layout: [12, 5, 4, 3]
  • Stage boundary layers: [11, 16, 20, 23]
  • Stage heads: lm_head_stages[0..3]
  • Router: disabled
  • Training objective: residual_logop
  • Inference combiner: geometric-mean Log-OP, poe_alpha=0.0

The stage layout means:

stage layers evaluated boundary intended role
s0 0-11 11 base predictor / cheap drafter
s1 0-16 16 first verification/refinement stage
s2 0-20 20 second verification/refinement stage
s3 0-23 23 final full-depth stage

Each stage is a complete next-token predictor:

logits_k = shared_lm_head(norm(h_k)) + stage_lm_head_k(norm(h_k))
p_k      = softmax(logits_k)

Full PoE inference combines the stage distributions in log space:

score_full = (1 / K) * sum_k log p_k
token      = argmax(score_full)

For ranking/generation this is the shrinkage-neutral geometric mean of experts. For BPB or calibrated probability reporting, renormalize score_full with logsumexp.

Local Learning

The model is not trained only by a final-layer next-token loss. It uses a stage-partitioned local-learning setup:

  • the 24-layer trunk is split into four PoE stages: [12, 5, 4, 3];
  • every stage boundary has its own additive prediction head;
  • each stage learns to be a usable local next-token predictor;
  • the training objective is residual Log-OP, so later stages learn residual evidence relative to the product of earlier experts;
  • inference can use any prefix of stages or the full geometric-mean product.

This is the local-learning interpretation of the checkpoint: learning pressure is exposed at intermediate stage boundaries instead of being applied only through the final transformer block.

Training Data

This model was trained on the Frontier V2 multilingual pretraining mix plus a train-only dialog overlay. The v2 dataset was rebuilt after finding that the earlier frontier mix had much lower CJK coverage than intended; this release uses the corrected multilingual data and a 48K tokenizer trained for that corrected mix.

The base multilingual mix provides the validation split. The dialog overlay is used only for training, so the reported validation metrics are anchored to the base frontier v2 multilingual distribution rather than the added dialog overlay.

Base mix recipe:

Source Target share
FineWeb-Edu 28.0%
DCLM-Baseline 20.0%
Stack/code 15.0%
ProofPile 4.0%
OpenWebMath 4.0%
Wikipedia EN 5.0%
CulturaX Korean 4.0%
CulturaX Chinese 2.5%
CulturaX Japanese 2.5%
CulturaX Spanish 1.5%
CulturaX French 1.5%
Gutenberg 4.0%
PG-19 2.0%
UltraChat 2.0%
OpenHermes 4.0%

Total explicit CulturaX multilingual share is 12.0%, with Korean intentionally the largest non-English component.

Dialog overlay:

Source Use
Open-Orca/OpenOrca train-only dialog/instruction overlay
Open-Orca/SlimOrca fallback overlay source if needed

Training data settings:

Setting Value
Training steps 60,000
Global tokens per step 1,048,576
Approximate token budget 62.91B tokens
Context length 2,048
Case augmentation 0.15 probability per document
Validation cadence every 1,000 steps
Validation tokens 4,194,304

Case augmentation lowercases or uppercases sampled documents during training to improve robustness to case-shifted prompts.

Loading

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "cognica/Cognica-PoE-v1.0-1.5B-base"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="auto",
)

ids = tok("๋Œ€ํ•œ๋ฏผ๊ตญ์˜ ์ˆ˜๋„๋Š”", return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=32, do_sample=True, temperature=0.8, top_p=0.95)
print(tok.decode(out[0], skip_special_tokens=True))

The custom tokenizer prepends <|bos|> by default when add_special_tokens=True. This matters because the checkpoint was trained and validated with BOS-prepended prompts.

Inference Modes

1. Full PoE

This is the default forward() / generate() path. It evaluates all stage boundaries and combines all four stage distributions by Log-OP.

ids = tok("The capital of France is", return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=32, do_sample=False)

Use this when quality matters more than latency. It is the mode used for the final validation metrics below.

2. Prefix-Pruned Cumulative PoE

Use only the first k stages and aggregate them as a smaller PoE:

out_s0      = model.generate_prefix(ids.input_ids, max_stages=1, max_new_tokens=32)
out_s0_s1   = model.generate_prefix(ids.input_ids, max_stages=2, max_new_tokens=32)
out_s0_s2   = model.generate_prefix(ids.input_ids, max_stages=3, max_new_tokens=32)
out_full    = model.generate_prefix(ids.input_ids, max_stages=4, max_new_tokens=32)

This directly supports the s0 -> s1 -> s2 -> s3 progression:

max_stages active experts evaluated layers approximate trunk compute
1 s0 12 / 24 50.0%
2 s0 + s1 17 / 24 70.8%
3 s0 + s1 + s2 21 / 24 87.5%
4 s0 + s1 + s2 + s3 24 / 24 100.0%

This is the main prefix-pruning mode: stop early when a task can be answered by the earlier experts, keep going when more verification depth is needed.

3. Single-Stage Prediction

Use one stage endpoint as an independent predictor:

logits_s0 = model.forward_stage(ids.input_ids, stage=0)
logits_s3 = model.forward_stage(ids.input_ids, stage=3)

out = model.generate_stage(ids.input_ids, stage=0, max_new_tokens=32)

This is useful for probing stage specialization, using s0 as a cheap draft model, or measuring how much each stage changes the answer. stage=i evaluates the trunk up to that stage boundary and applies that stage's own additive head.

4. WAND Adaptive Depth

WAND mode evaluates stages incrementally and exits early when the current top-1 margin is larger than a calibrated upper bound on what remaining stages can change:

out, stages_used = model.generate_wand(
    ids.input_ids,
    max_new_tokens=32,
    safety=1.0,
    return_stages_used=True,
)

Interpretation:

  • If s0 is already decisive, emit from s0.
  • If not, consult s1, then s2, then s3.
  • stages_used records which stage emitted each generated token.
  • Higher safety is more conservative.
  • For strict deployment, calibrate p99_bounds on the target validation distribution and pass them explicitly.

5. PoE Speculative Decoding

Use an early stage as the drafter and the full PoE path as verifier:

out, accept_rate = model.generate_speculative(
    ids.input_ids,
    draft_stage=0,
    k_draft=3,
    max_new_tokens=64,
    return_acceptance=True,
)

This preserves the full-model decision rule while exploiting the fact that s0 is already a trained predictor. The current implementation is greedy and uses the full PoE path for verification.

6. Parallel Stage Composition

Compose arbitrary stage experts in Log-OP:

out_all = model.generate_parallel_composition(
    ids.input_ids,
    stages=[0, 1, 2, 3],
    max_new_tokens=32,
)

out_weighted = model.generate_parallel_composition(
    ids.input_ids,
    stages=[0, 2, 3],
    stage_weights=[0.5, 1.0, 1.0],
    max_new_tokens=32,
)

This is the explicit stage-composition API. On a single GPU the reference implementation emits all selected boundary logits in one forward pass and combines them in log space. In a serving system, the same factorization is the hook for distributed stage-parallel execution: compute the shared prefix, dispatch selected stage continuations/heads, then reduce the returned log-probabilities with the same Log-OP rule.

Validation Metrics

Final checkpoint: step 60,000. Validation used the matching 48K tokenizer.

metric value
full/geomean BPB 0.787527
entropy-weighted BPB 0.787502
full next-token accuracy 0.4478
stage BPB 0.799041 / 0.800914 / 0.800918 / 0.810400
s2-s3 top-1 agreement 0.8973
prompt target ranks, full vs shared avg 2.0 vs 115.8, wins/ties/losses 9/0/1

Per-stage agreement and prompt-rank artifacts are included under eval/.

Practical Notes

  • This is a research base checkpoint. It is not RLHF/SFT aligned.
  • Korean and English continuations work, but long-form instruction following and repetition control are base-model quality.
  • trust_remote_code=True is required because PoE aggregation and stage inference modes are implemented in the custom model class.
  • For scoring/BPB, normalize Log-OP scores with logsumexp; raw Log-OP scores are ranking scores.
Downloads last month
89
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including cognica/Cognica-PoE-v1.0-1.5B-base