# Reproducibility Checklist This document records the environment, artifacts, and commands needed to reproduce the TMF921 Qwen3-8B QLoRA results. ## Repositories - Research dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota - Training/evaluation code: https://huggingface.co/nraptisss/tmf921-intent-training - Primary stage-1 adapter: https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-QLoRA-qwen3-8b-qlora-20260501-083834 - Base model: https://huggingface.co/Qwen/Qwen3-8B ## Hardware used - GPU: NVIDIA RTX 6000 Ada Generation - VRAM: 48/50GB class - CUDA visible devices: `CUDA_VISIBLE_DEVICES=0` Server logs confirmed: ```text torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0 cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation ``` ## Software versions observed From the model card / training logs: - Python: 3.13.2 on the server environment - PyTorch: 2.6.0+cu124 - Transformers: 5.7.0 - TRL: 1.3.0 - Datasets: 4.8.5 - Tokenizers: 0.22.2 - PEFT: installed in the training environment - bitsandbytes: installed in the training environment ## Installation ```bash git clone https://huggingface.co/nraptisss/tmf921-intent-training cd tmf921-intent-training python -m venv .venv source .venv/bin/activate python -m pip install -U pip bash scripts/install_rtx6000ada.sh python scripts/check_gpu.py ``` ## Environment variables ```bash export HF_TOKEN=hf_... export CUDA_VISIBLE_DEVICES=0 export PYTHONPATH="$PWD/src" export TOKENIZERS_PARALLELISM=false export DISABLE_TRACKIO=1 ``` Trackio was disabled for the successful main run to avoid external logging failures. ## Stage-1 training command Recommended nohup command: ```bash bash scripts/nohup_new_run.sh ``` The successful stage-1 run was: ```text runs/qwen3-8b-qlora-20260501-083834 ``` Key stage-1 config: ```yaml model_name_or_path: Qwen/Qwen3-8B dataset_name: nraptisss/TMF921-intent-to-config-research-sota train_split: train_sota eval_split: validation max_length: 2048 assistant_only_loss: true load_in_4bit: true bnb_4bit_quant_type: nf4 bnb_4bit_use_double_quant: true lora_r: 64 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: all-linear learning_rate: 0.0002 lr_scheduler_type: constant warmup_steps: 0 per_device_train_batch_size: 2 gradient_accumulation_steps: 8 bf16: true gradient_checkpointing: true optim: paged_adamw_32bit epochs: 2 ``` If OOM occurs, preserve effective batch size by using: ```yaml per_device_train_batch_size: 1 gradient_accumulation_steps: 16 ``` ## Stage-1 evaluation Merge adapter for faster evaluation: ```bash RUN_DIR="runs/qwen3-8b-qlora-20260501-083834" python scripts/merge_adapter.py \ --base_model Qwen/Qwen3-8B \ --adapter "$RUN_DIR/outputs/adapter" \ --output_dir "$RUN_DIR/outputs/merged" ``` Evaluate: ```bash EVAL_BATCH_SIZE=8 \ bash scripts/nohup_eval.sh "$RUN_DIR" "$RUN_DIR/outputs/merged" ``` Normalize metrics: ```bash python scripts/normalize_eval_metrics.py \ --eval_dir "$RUN_DIR/eval_merged" ``` If using `nohup_eval.sh` default output, replace `eval_merged` with `eval`. ## Results packaging ```bash python scripts/package_results.py \ --stage1_eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \ --stage2_eval_dir runs/stage2-weak-20260505-080040/eval \ --output_dir results ``` Qualitative examples: ```bash python scripts/sample_failure_examples.py \ --eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \ --output_dir analysis/stage1_examples ``` ## Main results to reproduce Stage-1 normalized metrics: | Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact | |---|---:|---:|---:|---:| | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 | | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 | | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 | | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 | | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 | ## Determinism caveats - Generation evaluation uses deterministic decoding (`temperature=0.0`) by default. - Minor differences may occur across CUDA, Transformers, bitsandbytes, and PyTorch versions. - Training is subject to nondeterminism from GPU kernels and data processing. - Report exact library versions with any reproduced results. ## Known limitations - No official standards validators are included yet. - Normalized JSON metrics are a research proxy, not proof of production compliance. - O1 NRM and A1 policy require layer-specific semantic evaluators.