0.6b-4b LCLM, 16× compression

Latent Context Language Model: an encoder–decoder compressor described in End-to-End Context Compression at Scale.

The text to compress should be wrapped between <|memory_start|> and <|memory_end|>.

Running these checkpoints requires the LCLM codebase: https://github.com/LeonLixyz/LCLM. Standard transformers.AutoModel / vllm.LLM will not load this format on its own.

Quick load

from latent_context import LCLM

model = LCLM.from_pretrained("latent-context/0.6b-4b-LCLM-16x")

prompt = (
    "<|memory_start|>"
    "<long document, code, or text to compress>"
    "<|memory_end|> "
    "Summarize the document above."
)
# model.generate(...) — see latent_context/inference/hf.py

vLLM serving (two-stage CLI)

The vLLM path runs the encoder and the decoder in separate processes that hand off via a .pt file on disk. Running both in one process OOMs — vLLM grabs all GPU memory at init, leaving none for the HF encoder.

# Step 1: HF encoder over a jsonl of prompts → embeds.pt
python -m inference.vllm_inference.encode     --checkpoint latent-context/0.6b-4b-LCLM-16x     --prompts-jsonl prompts.jsonl     --out embeds.pt

# Step 2: vLLM decoder reads embeds.pt → completions.jsonl
python -m inference.vllm_inference.decode     --checkpoint latent-context/0.6b-4b-LCLM-16x     --embeds-pt embeds.pt     --out completions.jsonl

See inference/examples/README.md in the codebase for the prompts.jsonl schema and an end-to-end RULER NIAH eval driver.

Configuration

field value
encoder Qwen/Qwen3-Embedding-0.6B
decoder Qwen/Qwen3-4B-Instruct-2507
compression_ratio 16
encoder_window_size 1024
pooling mean
encoder_mask_type causal
boundary_overlap 0
adapter_type mlp

Code: https://github.com/LeonLixyz/LCLM

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for latent-context/0.6b-4b-LCLM-16x

Finetuned
(1738)
this model