---
license: mit
language:
  - en
library_name: pytorch
pipeline_tag: robotics
tags:
  - robotics
  - vla
  - vision-language-action
  - libero
  - llama
  - llama-3.2-vision
  - dit-regression
base_model:
  - meta-llama/Llama-3.2-11B-Vision-Instruct
datasets:
  - LIBERO
---

# LlamaOFT · LIBERO (all 4 suites, joint training, 80 k steps)

> Vision-Language-Action (VLA) checkpoint released with the
> [AlphaBrain](https://github.com/AlphaBrainGroup/AlphaBrain) framework.
> Trained jointly on **all four LIBERO suites** — Goal, Spatial, Object, and
> Long — for direct evaluation across the full LIBERO benchmark without
> retraining.

LlamaOFT couples a **Llama-3.2-11B-Vision** VLM with a **DiT-B regression
action head** (action_dim=7, horizon=8). This release is the
**steps = 80 000** checkpoint of a 150 000-step budget run on LIBERO
`libero_all`, and is the strongest multi-task LlamaOFT checkpoint in the
AlphaBrain family on LIBERO.

## Overview

| | |
|:---|:---|
| **Architecture**        | LlamaOFT (Llama 3.2 Vision 11B + DiT-B regression head) |
| **Base VLM**            | `meta-llama/Llama-3.2-11B-Vision-Instruct` |
| **Action head**         | DiT-B · `hidden_size=4096`, `action_dim=7`, `state_dim=7`, horizon 8 |
| **Training data**       | LIBERO · **all 4 suites (Goal + Spatial + Object + Long)** · `dataset_mix=libero_all` |
| **Training type**       | Supervised fine-tuning (single run; not continual learning) |
| **Attention**           | SDPA |
| **Optimiser**           | AdamW · cosine-with-min-lr |
| **Step budget**         | **80 000 (this release)** / 150 000 planned |
| **Hardware / batch**    | 4 × A800 80 GB · `per_device_batch = 4` · `grad_accum = 8` · **effective batch = 128** |

## Results

Evaluated on all 4 LIBERO suites, **50 rollouts per task × 10 tasks per suite = 500 episodes per suite**.

| Suite          | Success Rate |
|:---------------|:------------:|
| LIBERO-Goal    | **97.2 %** |
| LIBERO-Spatial | **92.4 %** |
| LIBERO-Object  | **99.4 %** |
| LIBERO-10 (Long) | **82.6 %** |
| **Avg (4-suite)** | **92.9 %** |

## Files

```
├── README.md                   model card
├── framework_config.yaml       AlphaBrain framework configuration
├── dataset_statistics.json     action normalization statistics
├── model.safetensors           full VLA weights (~21 GB, Llama 11B + DiT-B + DINO)
├── resume_meta.json            training metadata (completed_steps=80000, effective_bs=128)
└── llama_pretrained/           Llama-3.2-Vision tokenizer + chat_template + preprocessor configs
```

## Usage

```bash
git clone https://github.com/AlphaBrainGroup/AlphaBrain.git
cd AlphaBrain
pip install -e .

export PRETRAINED_MODELS_DIR=/path/to/models   # must contain Llama-3.2-11B-Vision-Instruct/

huggingface-cli download AlphaBrainGroup/llamaoft-libero-all4suite \
    --local-dir ./llamaoft_libero_all

python deployment/model_server/server_policy.py \
    --ckpt_path ./llamaoft_libero_all --port 10093 --use_bf16
```

For evaluation on any of the 4 LIBERO suites, see the
[LIBERO eval pipeline](https://github.com/AlphaBrainGroup/AlphaBrain/tree/dev/benchmarks/LIBERO/eval).

## Reproduction

```bash
bash scripts/run_base_vla/train.sh llama_oft_all_150k
```

Expect multi-day training on 4 × A800 80 GB for the full 150 000-step
schedule. The shipped `framework_config.yaml` is the exact training
configuration used for this checkpoint.

## Notes

- **Joint-training baseline**, not continual learning.
- **Attention: SDPA** — chosen so the checkpoint loads without a pinned
  flash-attn wheel. Users can override to `flash_attention_2` via
  `--framework.llamavl.attn_implementation=flash_attention_2` if available.

## License

MIT — see the [parent repository](https://github.com/AlphaBrainGroup/AlphaBrain).

## Citation

```bibtex
@misc{alphabrain2026,
  title  = {AlphaBrain: A Modular Open-Source Framework for Embodied Intelligence Research},
  author = {AlphaBrain Team},
  year   = {2026},
  url    = {https://github.com/AlphaBrainGroup/AlphaBrain}
}
```