PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
nraptisss commited on
Commit
25ac503
·
verified ·
1 Parent(s): 06d564f

Add reproducibility checklist

Browse files
Files changed (1) hide show
  1. REPRODUCIBILITY.md +180 -0
REPRODUCIBILITY.md ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reproducibility Checklist
2
+
3
+ This document records the environment, artifacts, and commands needed to reproduce the TMF921 Qwen3-8B QLoRA results.
4
+
5
+ ## Repositories
6
+
7
+ - Research dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
8
+ - Training/evaluation code: https://huggingface.co/nraptisss/tmf921-intent-training
9
+ - Primary stage-1 adapter: https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-QLoRA-qwen3-8b-qlora-20260501-083834
10
+ - Base model: https://huggingface.co/Qwen/Qwen3-8B
11
+
12
+ ## Hardware used
13
+
14
+ - GPU: NVIDIA RTX 6000 Ada Generation
15
+ - VRAM: 48/50GB class
16
+ - CUDA visible devices: `CUDA_VISIBLE_DEVICES=0`
17
+
18
+ Server logs confirmed:
19
+
20
+ ```text
21
+ torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
22
+ cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
23
+ ```
24
+
25
+ ## Software versions observed
26
+
27
+ From the model card / training logs:
28
+
29
+ - Python: 3.13.2 on the server environment
30
+ - PyTorch: 2.6.0+cu124
31
+ - Transformers: 5.7.0
32
+ - TRL: 1.3.0
33
+ - Datasets: 4.8.5
34
+ - Tokenizers: 0.22.2
35
+ - PEFT: installed in the training environment
36
+ - bitsandbytes: installed in the training environment
37
+
38
+ ## Installation
39
+
40
+ ```bash
41
+ git clone https://huggingface.co/nraptisss/tmf921-intent-training
42
+ cd tmf921-intent-training
43
+
44
+ python -m venv .venv
45
+ source .venv/bin/activate
46
+ python -m pip install -U pip
47
+ bash scripts/install_rtx6000ada.sh
48
+ python scripts/check_gpu.py
49
+ ```
50
+
51
+ ## Environment variables
52
+
53
+ ```bash
54
+ export HF_TOKEN=hf_...
55
+ export CUDA_VISIBLE_DEVICES=0
56
+ export PYTHONPATH="$PWD/src"
57
+ export TOKENIZERS_PARALLELISM=false
58
+ export DISABLE_TRACKIO=1
59
+ ```
60
+
61
+ Trackio was disabled for the successful main run to avoid external logging failures.
62
+
63
+ ## Stage-1 training command
64
+
65
+ Recommended nohup command:
66
+
67
+ ```bash
68
+ bash scripts/nohup_new_run.sh
69
+ ```
70
+
71
+ The successful stage-1 run was:
72
+
73
+ ```text
74
+ runs/qwen3-8b-qlora-20260501-083834
75
+ ```
76
+
77
+ Key stage-1 config:
78
+
79
+ ```yaml
80
+ model_name_or_path: Qwen/Qwen3-8B
81
+ dataset_name: nraptisss/TMF921-intent-to-config-research-sota
82
+ train_split: train_sota
83
+ eval_split: validation
84
+ max_length: 2048
85
+ assistant_only_loss: true
86
+ load_in_4bit: true
87
+ bnb_4bit_quant_type: nf4
88
+ bnb_4bit_use_double_quant: true
89
+ lora_r: 64
90
+ lora_alpha: 16
91
+ lora_dropout: 0.05
92
+ lora_target_modules: all-linear
93
+ learning_rate: 0.0002
94
+ lr_scheduler_type: constant
95
+ warmup_steps: 0
96
+ per_device_train_batch_size: 2
97
+ gradient_accumulation_steps: 8
98
+ bf16: true
99
+ gradient_checkpointing: true
100
+ optim: paged_adamw_32bit
101
+ epochs: 2
102
+ ```
103
+
104
+ If OOM occurs, preserve effective batch size by using:
105
+
106
+ ```yaml
107
+ per_device_train_batch_size: 1
108
+ gradient_accumulation_steps: 16
109
+ ```
110
+
111
+ ## Stage-1 evaluation
112
+
113
+ Merge adapter for faster evaluation:
114
+
115
+ ```bash
116
+ RUN_DIR="runs/qwen3-8b-qlora-20260501-083834"
117
+
118
+ python scripts/merge_adapter.py \
119
+ --base_model Qwen/Qwen3-8B \
120
+ --adapter "$RUN_DIR/outputs/adapter" \
121
+ --output_dir "$RUN_DIR/outputs/merged"
122
+ ```
123
+
124
+ Evaluate:
125
+
126
+ ```bash
127
+ EVAL_BATCH_SIZE=8 \
128
+ bash scripts/nohup_eval.sh "$RUN_DIR" "$RUN_DIR/outputs/merged"
129
+ ```
130
+
131
+ Normalize metrics:
132
+
133
+ ```bash
134
+ python scripts/normalize_eval_metrics.py \
135
+ --eval_dir "$RUN_DIR/eval_merged"
136
+ ```
137
+
138
+ If using `nohup_eval.sh` default output, replace `eval_merged` with `eval`.
139
+
140
+ ## Results packaging
141
+
142
+ ```bash
143
+ python scripts/package_results.py \
144
+ --stage1_eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \
145
+ --stage2_eval_dir runs/stage2-weak-20260505-080040/eval \
146
+ --output_dir results
147
+ ```
148
+
149
+ Qualitative examples:
150
+
151
+ ```bash
152
+ python scripts/sample_failure_examples.py \
153
+ --eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \
154
+ --output_dir analysis/stage1_examples
155
+ ```
156
+
157
+ ## Main results to reproduce
158
+
159
+ Stage-1 normalized metrics:
160
+
161
+ | Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact |
162
+ |---|---:|---:|---:|---:|
163
+ | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 |
164
+ | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 |
165
+ | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 |
166
+ | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 |
167
+ | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 |
168
+
169
+ ## Determinism caveats
170
+
171
+ - Generation evaluation uses deterministic decoding (`temperature=0.0`) by default.
172
+ - Minor differences may occur across CUDA, Transformers, bitsandbytes, and PyTorch versions.
173
+ - Training is subject to nondeterminism from GPU kernels and data processing.
174
+ - Report exact library versions with any reproduced results.
175
+
176
+ ## Known limitations
177
+
178
+ - No official standards validators are included yet.
179
+ - Normalized JSON metrics are a research proxy, not proof of production compliance.
180
+ - O1 NRM and A1 policy require layer-specific semantic evaluators.