nraptisss
/

tmf921-intent-training

+# Reproducibility Checklist
+This document records the environment, artifacts, and commands needed to reproduce the TMF921 Qwen3-8B QLoRA results.
+## Repositories
+- Research dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
+- Training/evaluation code: https://huggingface.co/nraptisss/tmf921-intent-training
+- Primary stage-1 adapter: https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-QLoRA-qwen3-8b-qlora-20260501-083834
+- Base model: https://huggingface.co/Qwen/Qwen3-8B
+## Hardware used
+- GPU: NVIDIA RTX 6000 Ada Generation
+- VRAM: 48/50GB class
+- CUDA visible devices: `CUDA_VISIBLE_DEVICES=0`
+Server logs confirmed:
+```text
+torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
+cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
+```
+## Software versions observed
+From the model card / training logs:
+- Python: 3.13.2 on the server environment
+- PyTorch: 2.6.0+cu124
+- Transformers: 5.7.0
+- TRL: 1.3.0
+- Datasets: 4.8.5
+- Tokenizers: 0.22.2
+- PEFT: installed in the training environment
+- bitsandbytes: installed in the training environment
+## Installation
+```bash
+git clone https://huggingface.co/nraptisss/tmf921-intent-training
+cd tmf921-intent-training
+python -m venv .venv
+source .venv/bin/activate
+python -m pip install -U pip
+bash scripts/install_rtx6000ada.sh
+python scripts/check_gpu.py
+```
+## Environment variables
+```bash
+export HF_TOKEN=hf_...
+export CUDA_VISIBLE_DEVICES=0
+export PYTHONPATH="$PWD/src"
+export TOKENIZERS_PARALLELISM=false
+export DISABLE_TRACKIO=1
+```
+Trackio was disabled for the successful main run to avoid external logging failures.
+## Stage-1 training command
+Recommended nohup command:
+```bash
+bash scripts/nohup_new_run.sh
+```
+The successful stage-1 run was:
+```text
+runs/qwen3-8b-qlora-20260501-083834
+```
+Key stage-1 config:
+```yaml
+model_name_or_path: Qwen/Qwen3-8B
+dataset_name: nraptisss/TMF921-intent-to-config-research-sota
+train_split: train_sota
+eval_split: validation
+max_length: 2048
+assistant_only_loss: true
+load_in_4bit: true
+bnb_4bit_quant_type: nf4
+bnb_4bit_use_double_quant: true
+lora_r: 64
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_modules: all-linear
+learning_rate: 0.0002
+lr_scheduler_type: constant
+warmup_steps: 0
+per_device_train_batch_size: 2
+gradient_accumulation_steps: 8
+bf16: true
+gradient_checkpointing: true
+optim: paged_adamw_32bit
+epochs: 2
+```
+If OOM occurs, preserve effective batch size by using:
+```yaml
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 16
+```
+## Stage-1 evaluation
+Merge adapter for faster evaluation:
+```bash
+RUN_DIR="runs/qwen3-8b-qlora-20260501-083834"
+python scripts/merge_adapter.py \
+  --base_model Qwen/Qwen3-8B \
+  --adapter "$RUN_DIR/outputs/adapter" \
+  --output_dir "$RUN_DIR/outputs/merged"
+```
+Evaluate:
+```bash
+EVAL_BATCH_SIZE=8 \
+bash scripts/nohup_eval.sh "$RUN_DIR" "$RUN_DIR/outputs/merged"
+```
+Normalize metrics:
+```bash
+python scripts/normalize_eval_metrics.py \
+  --eval_dir "$RUN_DIR/eval_merged"
+```
+If using `nohup_eval.sh` default output, replace `eval_merged` with `eval`.
+## Results packaging
+```bash
+python scripts/package_results.py \
+  --stage1_eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \
+  --stage2_eval_dir runs/stage2-weak-20260505-080040/eval \
+  --output_dir results
+```
+Qualitative examples:
+```bash
+python scripts/sample_failure_examples.py \
+  --eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \
+  --output_dir analysis/stage1_examples
+```
+## Main results to reproduce
+Stage-1 normalized metrics:
+| Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact |
+|---|---:|---:|---:|---:|
+| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 |
+| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 |
+| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 |
+| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 |
+| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 |
+## Determinism caveats
+- Generation evaluation uses deterministic decoding (`temperature=0.0`) by default.
+- Minor differences may occur across CUDA, Transformers, bitsandbytes, and PyTorch versions.
+- Training is subject to nondeterminism from GPU kernels and data processing.
+- Report exact library versions with any reproduced results.
+## Known limitations
+- No official standards validators are included yet.
+- Normalized JSON metrics are a research proxy, not proof of production compliance.
+- O1 NRM and A1 policy require layer-specific semantic evaluators.