microd_v1 / README.md
webxos's picture
Upload 12 files
6253d52 verified
|
raw
history blame
1 kB

Micro-Distilled GRPO+VAE Model

Model Description

This is a distilled language model trained using Group Relative Policy Optimization (GRPO) with VAE filtering.

Model Details

  • Model type: micro-distill-grpo-vae
  • Model size: 42M parameters
  • Language: English
  • License: Apache 2.0

Training Methodology

  • GRPO (Group Relative Policy Optimization): 8 groups
  • VAE Filtering: 32D latent space
  • KV-Cache Reuse: 512 cache size

Architecture Details

  • Hidden size: 512
  • Number of layers: 8
  • Attention heads: 8
  • Vocabulary size: 50257
  • Maximum sequence length: 1024

Usage

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("micro-distill-grpo-vae")
tokenizer = AutoTokenizer.from_pretrained("micro-distill-grpo-vae")

inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))