# Reproducibility Checklist

This document records the environment, artifacts, and commands needed to reproduce the TMF921 Qwen3-8B QLoRA results.

## Repositories

- Research dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
- Training/evaluation code: https://huggingface.co/nraptisss/tmf921-intent-training
- Primary stage-1 adapter: https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-QLoRA-qwen3-8b-qlora-20260501-083834
- Base model: https://huggingface.co/Qwen/Qwen3-8B

## Hardware used

- GPU: NVIDIA RTX 6000 Ada Generation
- VRAM: 48/50GB class
- CUDA visible devices: `CUDA_VISIBLE_DEVICES=0`

Server logs confirmed:

```text
torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
```

## Software versions observed

From the model card / training logs:

- Python: 3.13.2 on the server environment
- PyTorch: 2.6.0+cu124
- Transformers: 5.7.0
- TRL: 1.3.0
- Datasets: 4.8.5
- Tokenizers: 0.22.2
- PEFT: installed in the training environment
- bitsandbytes: installed in the training environment

## Installation

```bash
git clone https://huggingface.co/nraptisss/tmf921-intent-training
cd tmf921-intent-training

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
bash scripts/install_rtx6000ada.sh
python scripts/check_gpu.py
```

## Environment variables

```bash
export HF_TOKEN=hf_...
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH="$PWD/src"
export TOKENIZERS_PARALLELISM=false
export DISABLE_TRACKIO=1
```

Trackio was disabled for the successful main run to avoid external logging failures.

## Stage-1 training command

Recommended nohup command:

```bash
bash scripts/nohup_new_run.sh
```

The successful stage-1 run was:

```text
runs/qwen3-8b-qlora-20260501-083834
```

Key stage-1 config:

```yaml
model_name_or_path: Qwen/Qwen3-8B
dataset_name: nraptisss/TMF921-intent-to-config-research-sota
train_split: train_sota
eval_split: validation
max_length: 2048
assistant_only_loss: true
load_in_4bit: true
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true
lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: all-linear
learning_rate: 0.0002
lr_scheduler_type: constant
warmup_steps: 0
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
bf16: true
gradient_checkpointing: true
optim: paged_adamw_32bit
epochs: 2
```

If OOM occurs, preserve effective batch size by using:

```yaml
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
```

## Stage-1 evaluation

Merge adapter for faster evaluation:

```bash
RUN_DIR="runs/qwen3-8b-qlora-20260501-083834"

python scripts/merge_adapter.py \
  --base_model Qwen/Qwen3-8B \
  --adapter "$RUN_DIR/outputs/adapter" \
  --output_dir "$RUN_DIR/outputs/merged"
```

Evaluate:

```bash
EVAL_BATCH_SIZE=8 \
bash scripts/nohup_eval.sh "$RUN_DIR" "$RUN_DIR/outputs/merged"
```

Normalize metrics:

```bash
python scripts/normalize_eval_metrics.py \
  --eval_dir "$RUN_DIR/eval_merged"
```

If using `nohup_eval.sh` default output, replace `eval_merged` with `eval`.

## Results packaging

```bash
python scripts/package_results.py \
  --stage1_eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \
  --stage2_eval_dir runs/stage2-weak-20260505-080040/eval \
  --output_dir results
```

Qualitative examples:

```bash
python scripts/sample_failure_examples.py \
  --eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \
  --output_dir analysis/stage1_examples
```

## Main results to reproduce

Stage-1 normalized metrics:

| Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact |
|---|---:|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 |
| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 |
| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 |
| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 |
| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 |

## Determinism caveats

- Generation evaluation uses deterministic decoding (`temperature=0.0`) by default.
- Minor differences may occur across CUDA, Transformers, bitsandbytes, and PyTorch versions.
- Training is subject to nondeterminism from GPU kernels and data processing.
- Report exact library versions with any reproduced results.

## Known limitations

- No official standards validators are included yet.
- Normalized JSON metrics are a research proxy, not proof of production compliance.
- O1 NRM and A1 policy require layer-specific semantic evaluators.