--- title: Office Document Task Environment emoji: ๐Ÿ“Š colorFrom: green colorTo: blue sdk: docker pinned: false app_port: 8000 base_path: /dashboard/ tags: - openenv - agent-environment - rl-environment - office - excel - word - powerpoint --- # Office Document Task Environment An [OpenEnv](https://github.com/meta-pytorch/OpenEnv) **code-execution environment** for training and evaluating LLM agents on **real-world office document work** โ€” Excel spreadsheets, Word documents, and PowerPoint decks. The agent writes Python code (`openpyxl` / `python-docx` / `python-pptx`) to read or modify authentic enterprise files and gets graded by a **multi-layer, gaming-resistant** scoring stack. > 119 tasks across 3 file formats ยท 22-task eval split ยท Real artifacts from > [Finch (FinWorkBench)](https://huggingface.co/datasets/FinWorkBench/Finch), > [OSWorld-Verified](https://github.com/xlang-ai/OSWorld), and > [PPTArena](https://github.com/michaelofengend/PPTArena) ยท Multi-layer > grading + four independent reward-hacking defenses. --- ## The story (~30 sec read) **Problem.** Office workers spend hundreds of hours/year on spreadsheet, doc, and slide work. Current LLMs are tested on each format in isolation, in synthetic settings, with diff-based graders that an agent can game by copying the gold file. Nobody trains end-to-end across the three formats on real artifacts with proper anti-hacking defenses. **Environment.** The agent gets a real `.xlsx`/`.docx`/`.pptx`, an instruction in natural language, and a Python sandbox. It has 15 steps to read, modify, and submit the file. Per-step rewards measure *real file state* โ€” did your code actually mutate the file? Did the file stay valid? Did its structural distance to the gold reference actually decrease? Final grade is a 2- or 3-layer composition: validity gate + structural diff + (for docx) the per-task evaluator from OSWorld. **Training pipeline.** Kimi-K2.5 (the teacher) ran on the 97 training tasks to produce trajectories; we filtered them with a defense-in-depth pipeline (score โ‰ฅ 0.4, more than 1 step, never `1-step submit_file`) into a 53-trajectory SFT corpus. Qwen2.5-Coder-3B-Instruct (the student) was warm-started with LoRA on this corpus across two configs (4K and 8K context) โ€” both runs logged on HF Jobs L40S, ~$0.50โ€“0.80 each. The 8K run is online at [bpHigh/qwen3b-office-sft-kimi-long](https://huggingface.co/bpHigh/qwen3b-office-sft-kimi-long). --- ## Hero results โ€” 22-task eval split | Model | Avg score | Success rate | xlsx (n=10) | docx (n=4) | pptx (n=8) | |---|---|---|---|---|---| | **MiniMaxAI/MiniMax-M2.1** (frontier baseline) | 0.390 | 41% | 0.293 | 0.445 | 0.485 | | **moonshotai/Kimi-K2.5** (teacher) | 0.481 | 52% | 0.370 | 0.472 | 0.673 | | **Qwen/Qwen2.5-Coder-3B-Instruct** (student baseline) | **0.001** | 0% | 0.001 | 0.001 | 0.001 | | **Qwen2.5-Coder-3B + LoRA SFT (4K)** | 0.006 | 0% | 0.007 | 0.005 | 0.005 | | **Qwen2.5-Coder-3B + LoRA SFT (8K)** | 0.011 | 0% | 0.018 | 0.005 | 0.005 | > **Reading the SFT rows.** Both adapters lift the vanilla baseline ~6โ€“11ร— > on the eval set, but every episode still bottoms out at the env's reward > floor (0.005) โ€” the model produces *parseable* code but it doesn't mutate > the source file in ways the grader rewards. The SFT loss is well- > converged (0.19 on the training distribution), so the gap is a > generalization-from-Kimi-trajectories problem, not an under-training one. > The *next* step โ€” GRPO continuation directly against the env's reward > signal โ€” is what's expected to close this. See [`train_grpo.py`](train_grpo.py) > and the rollout-format note in [`edits.md`](edits.md) Phase 13. Reproduce any row: ```bash # Hosted models via HF Router python inference.py --split eval --model MiniMaxAI/MiniMax-M2.1 \ --output-dir runs/baseline_minimax_m21_eval # In-process LoRA eval (no hosted endpoint needed) python eval_lora.py \ --adapters bpHigh/qwen3b-office-sft-kimi,bpHigh/qwen3b-office-sft-kimi-long \ --split eval --output-dir runs/sft_eval ``` Per-task breakdown lives in `runs//summary.csv`; full step-by-step trajectories in `runs//trajectories/.jsonl`. --- ## SFT training run โ€” what the student saw Student model: `Qwen/Qwen2.5-Coder-3B-Instruct`. LoRA r=32 on all-linear targets, bf16, assistant-only loss masking. Two runs on 1ร— L40S 48GB ($1.80/hr). ![SFT loss โ€” 4K vs 8K context length ablation](runs/sft_plots/comparison_4k_vs_8k.png) | | 4K context | 8K context | |---|---|---| | Hardware | L40S 48GB | L40S 48GB | | Runtime | 198s | 354s | | Loss start โ†’ end | 0.412 โ†’ 0.069 | 0.384 โ†’ 0.103 | | Final train_loss | 0.196 | 0.193 | | Cost | ~$0.50 | ~$0.80 | | Adapter | [bpHigh/qwen3b-office-sft-kimi](https://huggingface.co/bpHigh/qwen3b-office-sft-kimi) | [bpHigh/qwen3b-office-sft-kimi-long](https://huggingface.co/bpHigh/qwen3b-office-sft-kimi-long) | The 8K curve has slightly higher end-loss because it sees the *long* debugging trajectories from Kimi (5โ€“8 of 53 episodes get truncated at 4K). Same convergence target, harder distribution โ†’ which configuration generalizes better is what the eval will tell us. **Training artifacts โ€” every run is reproducible from these:** | Run | Adapter on Hub | Raw stdout log | HF Job page | Loss curve | |---|---|---|---|---| | 4K context | [bpHigh/qwen3b-office-sft-kimi](https://huggingface.co/bpHigh/qwen3b-office-sft-kimi) | [raw_logs.txt](https://raw.githubusercontent.com/bp-high/openenv_financial_task_env/main/runs/sft_plots/qwen3b_kimi/raw_logs.txt) ([repo path](runs/sft_plots/qwen3b_kimi/raw_logs.txt)) | [Job 69ed74aeโ€ฆ4fc](https://huggingface.co/jobs/bpHigh/69ed74aed70108f37acdf4fc) | [PNG](runs/sft_plots/qwen3b_kimi/sft_loss_curve.png) | | 8K context | [bpHigh/qwen3b-office-sft-kimi-long](https://huggingface.co/bpHigh/qwen3b-office-sft-kimi-long) | [raw_logs.txt](https://raw.githubusercontent.com/bp-high/openenv_financial_task_env/main/runs/sft_plots/qwen3b_kimi_long/raw_logs.txt) ([repo path](runs/sft_plots/qwen3b_kimi_long/raw_logs.txt)) | [Job 69ed7b51โ€ฆef4](https://huggingface.co/jobs/bpHigh/69ed7b51d2c8bd8662bceef4) | [PNG](runs/sft_plots/qwen3b_kimi_long/sft_loss_curve.png) | Re-parse any HF Job's stdout into clean metrics + a loss curve PNG with [`data_pipeline/analyze_sft_logs.py`](data_pipeline/analyze_sft_logs.py) โ€” takes a `--job-id` and emits `training_metrics.jsonl`, `summary.json`, and `sft_loss_curve.png`. Both runs above were generated this way. **Eval artifacts โ€” both SFT adapters scored against the 22-task held-out split:** | Run | Eval results.json | Raw stdout log | HF Job page | |---|---|---|---| | 4K context | [results.json](runs/sft_eval_v2/bpHigh_qwen3b-office-sft-kimi/results.json) | [raw_logs.txt](runs/sft_eval_v2/raw_logs.txt) | [Job 69ed97e5โ€ฆ2ad](https://huggingface.co/jobs/bpHigh/69ed97e5d2c8bd8662bcf2ad) | | 8K context | [results.json](runs/sft_eval_v2/bpHigh_qwen3b-office-sft-kimi-long/results.json) | [raw_logs.txt](runs/sft_eval_v2/raw_logs.txt) | [Job 69ed97e5โ€ฆ2ad](https://huggingface.co/jobs/bpHigh/69ed97e5d2c8bd8662bcf2ad) | Both adapters were evaluated in a single HF Jobs run (L40S, ~30 min, ~$1) via [`eval_lora.py --adapters A,B`](eval_lora.py) โ€” the base model loads once and each adapter is detached/reattached without reloading. --- ## Task inventory (119 total) | Family | Source | Train | Eval | Total | What it tests | |---|---|---|---|---|---| | `xlsx` | [Finch](https://huggingface.co/datasets/FinWorkBench/Finch) โ€” hand-curated (Round 1) | 10 | 0 | 10 | Diverse Finch tasks hand-picked for the original submission (QA + MODIFY mix) | | `xlsx` | [Finch](https://huggingface.co/datasets/FinWorkBench/Finch) โ€” stratified pull (Round 2) | 40 | 10 | 50 | Stratified across 7 task-type tags | | `docx` | [OSWorld-Verified](https://github.com/xlang-ai/OSWorld) (libreoffice_writer) | 17 | 4 | 21 | 16 distinct evaluator functions ported from `desktop_env/evaluators/metrics/docs.py` | | `pptx` | [PPTArena](https://github.com/michaelofengend/PPTArena) | 30 | 8 | 38 | 16 distinct edit_types, including singletons (transitions, animations, A/V) | | **Total** | | **97** | **22** | **119** | | All 60 xlsx tasks come from **Finch (FinWorkBench)** โ€” the 10 hand-curated Round-1 picks plus the 50 stratified Round-2 pull. The 22-task eval set is stratified (at least 1 task per tag bucket per family) so the benchmark isn't biased toward one task type. --- ## How an episode works ``` reset(task_id="finch_10") โ†“ obs.task_description = "Per the headers and established formula logic, populate formulas for columns X through AH so the timing model's performance statistics for 2013โ€“2025 are complete..." obs.source_file = "/tmp/financial_env_finch_10_xxx/10_src_0.xlsx" obs.family = "xlsx" step(action_type="code", content="...") # 0โ€“15 of these โ†“ subprocess runs the code, returns stdout/stderr โ†“ env measures: did the file change? is it still valid? did it move toward gold? โ†“ reward = 0.005โ€“0.10 (dense process reward) step(action_type="submit_file", content="") # ends episode โ†“ multi-layer grading โ†“ reward = 0.001โ€“0.999 (final grade) ``` Three action types: `"code"` (Python), `"submit"` (text answer for QA tasks), `"submit_file"` (path to a modified file). **Submit is rejected on step 1** โ€” the agent must execute code at least once before any submission is accepted (see *Defenses* below). --- ## Reward design Two layers, both designed for *spec-aligned* signal. ### Per-step process reward (6 components, capped at 0.10/step) | Signal | Range | Measured from | |---|---|---| | `exec_health` | 0โ€“0.020 | Subprocess exit code; bonus if stdout non-empty | | `lib_engagement` | 0โ€“0.010 | Code uses the family's expected library | | `mutation` | 0โ€“0.030 | SHA-256 of working file changed | | `validity` | 0โ€“0.020 | Mutated file still parses with the family's loader | | `progress` | 0โ€“0.040 | Structural distance to gold *decreased* | | `eval_check` | 0โ€“0.020 | Per-task evaluator score *increased* (docx-only) | `progress` and `eval_check` give RL a dense gradient *toward correctness*, not just "code ran". They're disabled at eval time (`FINANCIAL_ENV_PROGRESS=0`) to keep the benchmark honest. ### Final grade (per family) | Family | Layer 1 (gate) | Layer 2 | Layer 3 | |---|---|---|---| | `xlsx` | โ€“ | 30% sheet-name match | 70% cell-level diff (2% numeric tolerance) | | `docx` | python-docx parse + byte-equality refusal | 40% paragraph diff | 60% per-task OSWorld evaluator (`compare_docx_files`, `check_tabstops`, etc.) | | `pptx` | python-pptx parse + byte-equality refusal | 20% slide-count | 80% avg per-shape composite: 40% text + 20% style + 20% position + 20% size | The `docx` 3rd layer is a port of OSWorld's [`metrics/docs.py`](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/evaluators/metrics/docs.py) (Apache-2.0). 16 evaluator functions, including compound `or` (multi-gold) and `and` (all-must-pass) checks. ### Anti-hacking defenses (4 independent layers) A model attempting the [Kimi-K2.5 exploit](edits.md#phase-7--live-discovered-exploit--anti-exploit-fix) (submit unmodified source on step 1, score 0.998) hits **all four** of: | Layer | Phase | What it does | |---|---|---| | **Env action gate** | 9 | Refuses `submit_file` if no code step has been taken โ€” episode stays open for recovery | | **Per-episode gold stash** | 4 | Gold files moved to `/tmp/oe_gold_/` at episode start; restored on close. Defeats `glob('data/**/*Gold*')` searches | | **Grader byte-equality refusal** | 7 | If submit's bytes match source bytes โ†’ score=0.001 (unless task is OSWorld `infeasible`) | | **SFT corpus filter** | 8 | Builder drops `n_steps==1 + submit_file` trajectories even at high score | See [`edits.md`](edits.md) for the live story of how Kimi found the exploit during eval and the 3 fixes that followed. --- ## Action & Observation spaces ### `FinancialAction` | Field | Type | Description | |---|---|---| | `action_type` | `str` | `"code"`, `"submit"`, `"submit_file"` | | `content` | `str` | Code, answer text, or absolute file path | ### `FinancialObservation` | Field | Type | Description | |---|---|---| | `task_id` | `str` | e.g. `finch_10`, `osworld_0a0faba3`, `pptarena_case_60_fix_text_placement` | | `task_description` | `str` | Instruction + constraints + source-file summary | | `source_file` | `str` | Path to the working file (per-episode tmpdir copy) | | `task_type` | `str` | `"QA"` or `"MODIFY"` | | `feedback` | `str` | Stdout/stderr of code, or grading explanation. Includes the per-step reward decomposition. | | `current_step` / `max_steps` | `int` | 0โ€“15 | | `done` | `bool` | Episode finished | | `reward` | `float` | Step or final reward in (0.001, 0.999) | --- ## Setup & usage ### Prerequisites - Python 3.10+ - Docker (for HF Space deployment) - LLM API key (for `inference.py`) or HF Jobs subscription (for training) ### Local dev ```bash pip install -e ".[dev]" PYTHONPATH=. uvicorn server.app:app --host 0.0.0.0 --port 8000 \ --ws-ping-interval 600 --ws-ping-timeout 600 --reload ``` ### Run a baseline against a hosted model ```bash export HF_TOKEN="hf_..." python inference.py \ --split eval \ --model MiniMaxAI/MiniMax-M2.1 \ --api-base https://router.huggingface.co/v1 \ --env-url http://localhost:8000 \ --task-timeout 900 ``` ### Run an in-process LoRA eval (no hosted endpoint needed) ```bash python eval_lora.py \ --base-model Qwen/Qwen2.5-Coder-3B-Instruct \ --adapters bpHigh/qwen3b-office-sft-kimi,bpHigh/qwen3b-office-sft-kimi-long \ --split eval \ --output-dir runs/sft_eval ``` CLI flags worth knowing on `inference.py`: - `--split {train,eval,all}` - `--family {xlsx,docx,pptx,all}` - `--task-ids id1,id2,โ€ฆ` (overrides split/family) - `--limit N` - `--resume` โ€” merge new task results into an existing `--output-dir` - `--skip-completed` โ€” re-run only failed/exploit tasks (paired with `--resume`) ### Re-pull data from upstream sources ```bash python data_pipeline/finch_pull.py python data_pipeline/osworld_writer_pull.py python data_pipeline/pptarena_pull.py --root /path/to/PPTArena-main ``` ### Build the SFT corpus ```bash python data_pipeline/build_sft_corpus.py \ --runs runs/teacher_kimi_k25_train \ --output data/sft_kimi_k25.jsonl \ --score-threshold 0.4 ``` ### Train the student (HF Jobs) ```bash hf jobs run --flavor l40sx1 --timeout 8h --secrets HF_TOKEN \ pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel \ bash -c "apt-get update -qq && apt-get install -y -qq git && \ git clone https://github.com/bp-high/openenv_financial_task_env.git /work && \ cd /work && \ pip install -U 'trl>=0.11' peft accelerate bitsandbytes && \ python train_sft.py \ --dataset data/sft_kimi_k25.jsonl \ --base-model Qwen/Qwen2.5-Coder-3B-Instruct \ --output-dir /tmp/qwen3b-sft \ --epochs 4 --gradient-accumulation 2 --lora-r 32 \ --max-seq-len 8192 \ --push-to-hub bpHigh/qwen3b-office-sft-kimi-long" ``` ### Docker ```bash docker build -t office-task-env:latest . docker run -p 8000:8000 office-task-env:latest ``` --- ## Project structure ``` . โ”œโ”€โ”€ data/ โ”‚ โ”œโ”€โ”€ manifest.jsonl # 109 rows: 50 Finch + 21 OSWorld + 38 PPTArena โ”‚ โ”œโ”€โ”€ sft_kimi_k25.jsonl # 53 filtered teacher trajectories โ”‚ โ”œโ”€โ”€ 0/, 21/, โ€ฆ # 10 hand-curated xlsx tasks โ”‚ โ”œโ”€โ”€ finch_50//{src,ref}.xlsx โ”‚ โ”œโ”€โ”€ osworld_writer//.docx โ”‚ โ””โ”€โ”€ pptarena//{src,ref}.pptx โ”œโ”€โ”€ data_pipeline/ โ”‚ โ”œโ”€โ”€ finch_pull.py # Phase 1 โ€” Finch xlsx tasks โ”‚ โ”œโ”€โ”€ osworld_writer_pull.py # Phase 3 โ€” OSWorld docx tasks โ”‚ โ”œโ”€โ”€ pptarena_pull.py # Phase 5 โ€” PPTArena pptx tasks โ”‚ โ”œโ”€โ”€ build_sft_corpus.py # Phase 8 โ€” trajectories โ†’ SFT JSONL โ”‚ โ”œโ”€โ”€ analyze_sft_logs.py # Phase 10.1 โ€” HF Job logs โ†’ metrics + PNG โ”‚ โ””โ”€โ”€ compare_sft_runs.py # overlay multiple SFT runs โ”œโ”€โ”€ graders/ โ”‚ โ”œโ”€โ”€ __init__.py # grade_xlsx + grade_docx + grade_pptx โ”‚ โ””โ”€โ”€ docx_metrics.py # 16 ported OSWorld evaluators โ”œโ”€โ”€ server/ โ”‚ โ”œโ”€โ”€ financial_environment.py # OpenEnv environment + gold-stash + early-submit gate โ”‚ โ””โ”€โ”€ app.py # FastAPI + WebSocket โ”œโ”€โ”€ rewards.py # 6-component RewardTracker โ”œโ”€โ”€ tasks.py # Manifest loader โ”œโ”€โ”€ inference.py # API-based eval (Router/Nebius/OpenAI) โ”œโ”€โ”€ eval_lora.py # In-process LoRA eval (no API needed) โ”œโ”€โ”€ train_sft.py # SFT trainer (TRL + PEFT, HF Jobs) โ”œโ”€โ”€ runs/ # Baseline + teacher + SFT artifacts (incl. plots) โ””โ”€โ”€ edits.md # Full Round-1 โ†’ Round-2 change log (10 phases) ``` --- ## Round-1 โ†’ Round-2 change log The journey from the original 10-task xlsx-only env to today's 3-format / 119-task / multi-layer-graded env is documented phase-by-phase in [`edits.md`](edits.md): manifest loader, RewardTracker, OSWorld docx port, PPTArena ingest, layout+style-aware pptx grader, gold-stash hardening, inference v2, anti-exploit defenses, SFT corpus builder, training script, log analyzer, in-process LoRA eval. --- ## Acknowledgments - **Finch / FinWorkBench** ([dataset](https://huggingface.co/datasets/FinWorkBench/Finch), [paper](https://arxiv.org/abs/2512.13168)) โ€” the xlsx tasks - **OSWorld-Verified** ([repo](https://github.com/xlang-ai/OSWorld)) โ€” the docx tasks and the evaluator functions in `graders/docx_metrics.py` (Apache-2.0) - **PPTArena** ([repo](https://github.com/michaelofengend/PPTArena)) โ€” the pptx tasks and the `evaluation_pairs_refined.json` schema - **Kimi-K2.5** (Moonshot AI, served via Nebius) โ€” the SFT teacher - **Qwen2.5-Coder-3B-Instruct** (Alibaba Qwen team) โ€” the student model - **TRL + PEFT + Unsloth + transformers** โ€” training stack - **OpenEnv / Meta PyTorch** ([repo](https://github.com/meta-pytorch/OpenEnv)) โ€” host framework - **Hugging Face Jobs** โ€” compute for SFT runs If you use this environment in research, please cite the upstream datasets: ```bibtex @article{dong2025finch, title={Finch: Benchmarking Finance \& Accounting across Spreadsheet-Centric Enterprise Workflows}, author={Dong, Haoyu and Zhang, Pengkun and Gao, Yan and Dong, Xuanyu and Cheng, Yilin and Lu, Mingzhe and Yakefu, Adina and Zheng, Shuxin}, journal={arXiv preprint arXiv:2512.13168}, year={2025} } ```