Instructions to use cognica/Cognica-PoE-v1.0-1.5B-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cognica/Cognica-PoE-v1.0-1.5B-base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="cognica/Cognica-PoE-v1.0-1.5B-base", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("cognica/Cognica-PoE-v1.0-1.5B-base", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use cognica/Cognica-PoE-v1.0-1.5B-base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cognica/Cognica-PoE-v1.0-1.5B-base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cognica/Cognica-PoE-v1.0-1.5B-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/cognica/Cognica-PoE-v1.0-1.5B-base

SGLang

How to use cognica/Cognica-PoE-v1.0-1.5B-base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "cognica/Cognica-PoE-v1.0-1.5B-base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cognica/Cognica-PoE-v1.0-1.5B-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "cognica/Cognica-PoE-v1.0-1.5B-base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cognica/Cognica-PoE-v1.0-1.5B-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use cognica/Cognica-PoE-v1.0-1.5B-base with Docker Model Runner:
```
docker model run hf.co/cognica/Cognica-PoE-v1.0-1.5B-base
```

Cognica-PoE-v1.0-1.5B-base

cognica/Cognica-PoE-v1.0-1.5B-base is a 1.5B parameter base language model trained with Cognica's Product-of-Experts (PoE) residual Log-OP local-learning objective. It is a base model, not an instruction-tuned assistant.

This release is intended to expose the PoE local-learning and inference surface, not just a single dense generate() path. The checkpoint contains a transformer trunk plus one additive prediction head per PoE stage. Each stage is trained as a locally predictive expert and the stage distributions are combined with Log-OP, so the same trained weights can be used as a full model, a prefix-pruned model, a single-stage drafter, an adaptive-depth WAND model, or a Log-OP composition of selected stages.

Architecture

Parameters: 1,507,591,082
Layers: 24
Hidden size: 1536
Attention heads: 12 query heads, 6 KV heads, head dim 128
Context length: 2048
Tokenizer: 48K multilingual nanochat/tiktoken BPE
Attention pattern: SSSL
PoE stage layout: [12, 5, 4, 3]
Stage boundary layers: [11, 16, 20, 23]
Stage heads: lm_head_stages[0..3]
Router: disabled
Training objective: residual_logop
Inference combiner: geometric-mean Log-OP, poe_alpha=0.0

The stage layout means:

stage	layers evaluated	boundary	intended role
`s0`	0-11	11	base predictor / cheap drafter
`s1`	0-16	16	first verification/refinement stage
`s2`	0-20	20	second verification/refinement stage
`s3`	0-23	23	final full-depth stage

Each stage is a complete next-token predictor:

logits_k = shared_lm_head(norm(h_k)) + stage_lm_head_k(norm(h_k))
p_k      = softmax(logits_k)

Full PoE inference combines the stage distributions in log space:

score_full = (1 / K) * sum_k log p_k
token      = argmax(score_full)

For ranking/generation this is the shrinkage-neutral geometric mean of experts. For BPB or calibrated probability reporting, renormalize score_full with logsumexp.

Local Learning

The model is not trained only by a final-layer next-token loss. It uses a stage-partitioned local-learning setup:

the 24-layer trunk is split into four PoE stages: [12, 5, 4, 3];
every stage boundary has its own additive prediction head;
each stage learns to be a usable local next-token predictor;
the training objective is residual Log-OP, so later stages learn residual evidence relative to the product of earlier experts;
inference can use any prefix of stages or the full geometric-mean product.

This is the local-learning interpretation of the checkpoint: learning pressure is exposed at intermediate stage boundaries instead of being applied only through the final transformer block.

Training Data

This model was trained on the Frontier V2 multilingual pretraining mix plus a train-only dialog overlay. The v2 dataset was rebuilt after finding that the earlier frontier mix had much lower CJK coverage than intended; this release uses the corrected multilingual data and a 48K tokenizer trained for that corrected mix.

The base multilingual mix provides the validation split. The dialog overlay is used only for training, so the reported validation metrics are anchored to the base frontier v2 multilingual distribution rather than the added dialog overlay.

Base mix recipe:

Source	Target share
FineWeb-Edu	28.0%
DCLM-Baseline	20.0%
Stack/code	15.0%
ProofPile	4.0%
OpenWebMath	4.0%
Wikipedia EN	5.0%
CulturaX Korean	4.0%
CulturaX Chinese	2.5%
CulturaX Japanese	2.5%
CulturaX Spanish	1.5%
CulturaX French	1.5%
Gutenberg	4.0%
PG-19	2.0%
UltraChat	2.0%
OpenHermes	4.0%

Total explicit CulturaX multilingual share is 12.0%, with Korean intentionally the largest non-English component.

Dialog overlay:

Source	Use
Open-Orca/OpenOrca	train-only dialog/instruction overlay
Open-Orca/SlimOrca	fallback overlay source if needed

Training data settings:

Setting	Value
Training steps	60,000
Global tokens per step	1,048,576
Approximate token budget	62.91B tokens
Context length	2,048
Case augmentation	0.15 probability per document
Validation cadence	every 1,000 steps
Validation tokens	4,194,304

Case augmentation lowercases or uppercases sampled documents during training to improve robustness to case-shifted prompts.

Loading

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "cognica/Cognica-PoE-v1.0-1.5B-base"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="auto",
)

ids = tok("대한민국의 수도는", return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=32, do_sample=True, temperature=0.8, top_p=0.95)
print(tok.decode(out[0], skip_special_tokens=True))

The custom tokenizer prepends <|bos|> by default when add_special_tokens=True. This matters because the checkpoint was trained and validated with BOS-prepended prompts.

Inference Modes

1. Full PoE

This is the default forward() / generate() path. It evaluates all stage boundaries and combines all four stage distributions by Log-OP.

ids = tok("The capital of France is", return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=32, do_sample=False)

Use this when quality matters more than latency. It is the mode used for the final validation metrics below.

2. Prefix-Pruned Cumulative PoE

Use only the first k stages and aggregate them as a smaller PoE:

out_s0      = model.generate_prefix(ids.input_ids, max_stages=1, max_new_tokens=32)
out_s0_s1   = model.generate_prefix(ids.input_ids, max_stages=2, max_new_tokens=32)
out_s0_s2   = model.generate_prefix(ids.input_ids, max_stages=3, max_new_tokens=32)
out_full    = model.generate_prefix(ids.input_ids, max_stages=4, max_new_tokens=32)

This directly supports the s0 -> s1 -> s2 -> s3 progression:

`max_stages`	active experts	evaluated layers	approximate trunk compute
1	`s0`	12 / 24	50.0%
2	`s0 + s1`	17 / 24	70.8%
3	`s0 + s1 + s2`	21 / 24	87.5%
4	`s0 + s1 + s2 + s3`	24 / 24	100.0%

This is the main prefix-pruning mode: stop early when a task can be answered by the earlier experts, keep going when more verification depth is needed.

3. Single-Stage Prediction

Use one stage endpoint as an independent predictor:

logits_s0 = model.forward_stage(ids.input_ids, stage=0)
logits_s3 = model.forward_stage(ids.input_ids, stage=3)

out = model.generate_stage(ids.input_ids, stage=0, max_new_tokens=32)

This is useful for probing stage specialization, using s0 as a cheap draft model, or measuring how much each stage changes the answer. stage=i evaluates the trunk up to that stage boundary and applies that stage's own additive head.

4. WAND Adaptive Depth

WAND mode evaluates stages incrementally and exits early when the current top-1 margin is larger than a calibrated upper bound on what remaining stages can change:

out, stages_used = model.generate_wand(
    ids.input_ids,
    max_new_tokens=32,
    safety=1.0,
    return_stages_used=True,
)

Interpretation:

If s0 is already decisive, emit from s0.
If not, consult s1, then s2, then s3.
stages_used records which stage emitted each generated token.
Higher safety is more conservative.
For strict deployment, calibrate p99_bounds on the target validation distribution and pass them explicitly.

5. PoE Speculative Decoding

Use an early stage as the drafter and the full PoE path as verifier:

out, accept_rate = model.generate_speculative(
    ids.input_ids,
    draft_stage=0,
    k_draft=3,
    max_new_tokens=64,
    return_acceptance=True,
)

This preserves the full-model decision rule while exploiting the fact that s0 is already a trained predictor. The current implementation is greedy and uses the full PoE path for verification.

6. Parallel Stage Composition

Compose arbitrary stage experts in Log-OP:

out_all = model.generate_parallel_composition(
    ids.input_ids,
    stages=[0, 1, 2, 3],
    max_new_tokens=32,
)

out_weighted = model.generate_parallel_composition(
    ids.input_ids,
    stages=[0, 2, 3],
    stage_weights=[0.5, 1.0, 1.0],
    max_new_tokens=32,
)

This is the explicit stage-composition API. On a single GPU the reference implementation emits all selected boundary logits in one forward pass and combines them in log space. In a serving system, the same factorization is the hook for distributed stage-parallel execution: compute the shared prefix, dispatch selected stage continuations/heads, then reduce the returned log-probabilities with the same Log-OP rule.

Validation Metrics

Final checkpoint: step 60,000. Validation used the matching 48K tokenizer.

metric	value
full/geomean BPB	0.787527
entropy-weighted BPB	0.787502
full next-token accuracy	0.4478
stage BPB	0.799041 / 0.800914 / 0.800918 / 0.810400
s2-s3 top-1 agreement	0.8973
prompt target ranks, full vs shared	avg 2.0 vs 115.8, wins/ties/losses 9/0/1

Per-stage agreement and prompt-rank artifacts are included under eval/.

Practical Notes

This is a research base checkpoint. It is not RLHF/SFT aligned.
Korean and English continuations work, but long-form instruction following and repetition control are base-model quality.
trust_remote_code=True is required because PoE aggregation and stage inference modes are implemented in the custom model class.
For scoring/BPB, normalize Log-OP scores with logsumexp; raw Log-OP scores are ranking scores.

Downloads last month: 89

Safetensors

Model size

2B params

Tensor type

BF16

Collection including cognica/Cognica-PoE-v1.0-1.5B-base

Product of Experts as Scalable Local Learning (Per-Stage)

Collection

Product of Experts (PoE) replaces backprop's global state with local learning, validated at 1.5B across five modularity axes. • 3 items • Updated 24 days ago