--- license: mit language: - en library_name: pytorch pipeline_tag: robotics tags: - robotics - vla - vision-language-action - libero - llama - llama-3.2-vision - dit-regression base_model: - meta-llama/Llama-3.2-11B-Vision-Instruct datasets: - LIBERO --- # LlamaOFT · LIBERO (all 4 suites, joint training, 80 k steps) > Vision-Language-Action (VLA) checkpoint released with the > [AlphaBrain](https://github.com/AlphaBrainGroup/AlphaBrain) framework. > Trained jointly on **all four LIBERO suites** — Goal, Spatial, Object, and > Long — for direct evaluation across the full LIBERO benchmark without > retraining. LlamaOFT couples a **Llama-3.2-11B-Vision** VLM with a **DiT-B regression action head** (action_dim=7, horizon=8). This release is the **steps = 80 000** checkpoint of a 150 000-step budget run on LIBERO `libero_all`, and is the strongest multi-task LlamaOFT checkpoint in the AlphaBrain family on LIBERO. ## Overview | | | |:---|:---| | **Architecture** | LlamaOFT (Llama 3.2 Vision 11B + DiT-B regression head) | | **Base VLM** | `meta-llama/Llama-3.2-11B-Vision-Instruct` | | **Action head** | DiT-B · `hidden_size=4096`, `action_dim=7`, `state_dim=7`, horizon 8 | | **Training data** | LIBERO · **all 4 suites (Goal + Spatial + Object + Long)** · `dataset_mix=libero_all` | | **Training type** | Supervised fine-tuning (single run; not continual learning) | | **Attention** | SDPA | | **Optimiser** | AdamW · cosine-with-min-lr | | **Step budget** | **80 000 (this release)** / 150 000 planned | | **Hardware / batch** | 4 × A800 80 GB · `per_device_batch = 4` · `grad_accum = 8` · **effective batch = 128** | ## Results Evaluated on all 4 LIBERO suites, **50 rollouts per task × 10 tasks per suite = 500 episodes per suite**. | Suite | Success Rate | |:---------------|:------------:| | LIBERO-Goal | **97.2 %** | | LIBERO-Spatial | **92.4 %** | | LIBERO-Object | **99.4 %** | | LIBERO-10 (Long) | **82.6 %** | | **Avg (4-suite)** | **92.9 %** | ## Files ``` ├── README.md model card ├── framework_config.yaml AlphaBrain framework configuration ├── dataset_statistics.json action normalization statistics ├── model.safetensors full VLA weights (~21 GB, Llama 11B + DiT-B + DINO) ├── resume_meta.json training metadata (completed_steps=80000, effective_bs=128) └── llama_pretrained/ Llama-3.2-Vision tokenizer + chat_template + preprocessor configs ``` ## Usage ```bash git clone https://github.com/AlphaBrainGroup/AlphaBrain.git cd AlphaBrain pip install -e . export PRETRAINED_MODELS_DIR=/path/to/models # must contain Llama-3.2-11B-Vision-Instruct/ huggingface-cli download AlphaBrainGroup/llamaoft-libero-all4suite \ --local-dir ./llamaoft_libero_all python deployment/model_server/server_policy.py \ --ckpt_path ./llamaoft_libero_all --port 10093 --use_bf16 ``` For evaluation on any of the 4 LIBERO suites, see the [LIBERO eval pipeline](https://github.com/AlphaBrainGroup/AlphaBrain/tree/dev/benchmarks/LIBERO/eval). ## Reproduction ```bash bash scripts/run_base_vla/train.sh llama_oft_all_150k ``` Expect multi-day training on 4 × A800 80 GB for the full 150 000-step schedule. The shipped `framework_config.yaml` is the exact training configuration used for this checkpoint. ## Notes - **Joint-training baseline**, not continual learning. - **Attention: SDPA** — chosen so the checkpoint loads without a pinned flash-attn wheel. Users can override to `flash_attention_2` via `--framework.llamavl.attn_implementation=flash_attention_2` if available. ## License MIT — see the [parent repository](https://github.com/AlphaBrainGroup/AlphaBrain). ## Citation ```bibtex @misc{alphabrain2026, title = {AlphaBrain: A Modular Open-Source Framework for Embodied Intelligence Research}, author = {AlphaBrain Team}, year = {2026}, url = {https://github.com/AlphaBrainGroup/AlphaBrain} } ```