# TMF921 Intent-to-Configuration Research Journal This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step. Repository links: - Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented - Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota - Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training - Base model: https://huggingface.co/Qwen/Qwen3-8B --- ## Journal conventions Each entry should include: 1. **Date/time** 2. **Goal** 3. **Action** 4. **Evidence / result** 5. **Interpretation** 6. **Decision / next step** For research claims, prefer numeric evidence over qualitative statements. --- ## 2026-04-30 — Dataset cloned and audited ### Goal Clone and scientifically audit `nraptisss/TMF921-intent-to-config-augmented` before training. ### Action The dataset was cloned in the sandbox and a comprehensive audit was run over schema, missingness, ChatML formatting, JSON validity, duplicates/leakage, distribution balance, numeric KPI ranges, train/test similarity, and scientific validity. ### Evidence / result Dataset size: - Total rows: **41,815** - Train: **39,294** - Test: **2,521** Quality checks: - Missing values: **0** - Duplicate IDs: **0** - Duplicate full conversations: **0** - Assistant JSON parse validity: **41,815 / 41,815 = 100%** - Role sequence: `system -> user -> assistant` for all rows Leakage / similarity findings: - Exact train/test user-prompt overlap: **0** - Exact train/test full-message overlap: **0** - Near-duplicate prompt similarity was high: - test prompts with char-ngram similarity >= 0.90 to train: **1,290 / 2,521** - >= 0.95: **602 / 2,521** - >= 0.98: **262 / 2,521** Distribution findings: - `create` lifecycle operation: **40,090 / 41,815 = 95.9%** - non-create lifecycle rows: **1,725 = 4.1%** - adversarial rows: **166 = 0.397%** - only **31 unique JSON structure signatures** across 41,815 rows ### Interpretation The source dataset is technically clean and suitable for SFT, but the original split is mainly an in-distribution/template-compliance split, not a strong OOD benchmark. JSON validity is excellent, but scientific benchmark validity requires OOD splits and normalized/semantic evaluation. ### Decision / next step Create a research-grade derivative dataset with: - OOD splits, - train/eval provenance columns, - token-length audit, - validation flags, - lifecycle/adversarial upsampling for training only, - no fabricated continuous-KPI or cross-layer-paired examples without a validated generator. --- ## 2026-04-30 — Research SOTA dataset created ### Goal Implement the audit recommendations while preserving scientific soundness. ### Action Created `nraptisss/TMF921-intent-to-config-research-sota`. Implemented: - `train_base` - `train_sota` - `validation` - `test_in_distribution` - `test_template_ood` - `test_use_case_ood` - `test_sector_ood` - `test_adversarial` Added columns: - `system`, `prompt`, `completion` - `prompt_template_id` - `scenario_id` - `json_structure_id` - `json_root_family` - `messages_format_valid` - `assistant_is_valid_json` - `slice_sst_valid` - `kpi_profile_valid` - `semantic_rule_valid_v1` - `qwen3_chat_template_tokens` - `fits_2048_qwen3` - `fits_4096_qwen3` - `sampling_weight_*` - `is_augmented`, `augmentation_type`, `source_id`, `conversation_type` ### Evidence / result Published dataset: - https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota Splits: | Split | Rows | Purpose | |---|---:|---| | `train_base` | 26,357 | unaugmented training after OOD holdouts | | `train_sota` | 32,357 | training split with marked lifecycle/adversarial upsampling and multi-turn wrappers | | `validation` | 1,547 | validation | | `test_in_distribution` | 1,455 | in-distribution test | | `test_template_ood` | 3,503 | held-out prompt-template family | | `test_use_case_ood` | 4,341 | held-out use cases | | `test_sector_ood` | 4,579 | held-out sectors | | `test_adversarial` | 33 | held-out adversarial examples | Qwen3 token-length audit: - mean: **754.1** - p50: **705** - p95: **1293** - p99: **1300** - max: **1316** - fit within 2048: **100%** `train_sota` balancing: - non-create lifecycle rows: **5,166 = 15.97%** - adversarial rows: **2,115 = 6.54%** - synthetic multi-turn wrappers: **1,281** ### Interpretation `max_length=2048` is justified for Qwen3-8B. `train_sota` improves rare-class exposure. OOD splits allow scientifically meaningful generalization reporting. ### Decision / next step Build a training/evaluation repository for a single RTX 6000 Ada server using Qwen3-8B QLoRA. --- ## 2026-04-30 / 2026-05-01 — Training/evaluation repo created ### Goal Create a reproducible repo for training and evaluation on RTX 6000 Ada 48/50GB. ### Action Created `nraptisss/tmf921-intent-training` with: - QLoRA SFT training script, - evaluation script, - merge script, - RTX 6000 Ada install script, - GPU preflight, - nohup run scripts, - resumable checkpoints, - unique run directories. Default recipe: - model: `Qwen/Qwen3-8B` - method: QLoRA NF4 + double quant - LoRA target modules: `all-linear` - LoRA rank: `64` - LoRA alpha: `16` - LoRA dropout: `0.05` - LR: `2e-4` - scheduler: constant - max length: `2048` - assistant-only loss: enabled - bf16: enabled - gradient checkpointing: enabled - train split: `train_sota` - eval split: `validation` ### Evidence / result Repo: - https://huggingface.co/nraptisss/tmf921-intent-training ### Interpretation The training approach is consistent with QLoRA literature and fits the memory constraints of a 48/50GB RTX 6000 Ada GPU. ### Decision / next step Run training under `nohup`, require CUDA preflight, and ensure unique output directories to avoid overwriting results. --- ## 2026-05-01 — Runtime issues fixed ### Goal Resolve server-side training errors and ensure training uses GPU. ### Issues encountered and fixes #### 1. CPU/GPU uncertainty Observed concern that training might not use GPU. Fix: - Added `scripts/check_gpu.py` - Added `scripts/install_rtx6000ada.sh` - Added fail-fast CUDA checks to training/evaluation scripts. Evidence from server logs: ```text torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0 cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation ``` Conclusion: GPU setup confirmed. #### 2. TRL conversational dataset detection error Error: ```text ValueError: You set assistant_only_loss=True, but the dataset is not conversational. ``` Cause: The dataset contains `messages` plus convenience `prompt`/`completion` columns. TRL inferred prompt-completion format instead of conversational format. Fix: Training script now passes only: ```python train_dataset = train_dataset.select_columns(["messages"]) eval_dataset = eval_dataset.select_columns(["messages"]) ``` #### 3. Trackio invalid Space ID Error: ```text HFValidationError: Repo id ... 'nraptisss/' ``` Cause: Invalid `TRACKIO_SPACE_ID=nraptisss/`. Fix: Added validation/sanitization for Trackio Space IDs and support for: ```bash DISABLE_TRACKIO=1 ``` #### 4. Deprecated warmup argument Warning: ```text warmup_ratio is deprecated ``` Fix: Changed config/script to use: ```yaml warmup_steps: 0 ``` ### Decision / next step Restart training with fixed scripts and disabled Trackio to avoid external logging failures. --- ## 2026-05-01 / 2026-05-02 — Qwen3-8B QLoRA training run completed ### Goal Train Qwen3-8B QLoRA on `train_sota`. ### Action Started training under nohup with unique run directory: ```text runs/qwen3-8b-qlora-20260501-083834 ``` Trackio disabled: ```bash DISABLE_TRACKIO=1 ``` ### Evidence / result Training logs showed stable convergence. Representative metrics: Initial: ```text loss: 1.212 mean_token_accuracy: 0.7922 ``` After early training: ```text loss: ~0.15 mean_token_accuracy: ~0.945-0.953 ``` Validation loss over training: ```text eval_loss: 0.1593 at epoch 0.1236 eval_loss: 0.1561 at epoch 0.2472 eval_loss: 0.1548 at epoch 0.3709 eval_loss: 0.1535 at epoch 0.8653 eval_loss: 0.1530 at epoch 1.607 eval_loss: 0.1532 at epoch 1.730 ``` No observed: - CUDA OOM, - NaNs, - divergence, - gradient explosion. ### Interpretation The run converged smoothly. Loss stabilized around 0.14–0.15 and validation loss plateaued near 0.153, indicating stable SFT convergence. ### Decision / next step Evaluate the trained adapter across ID and OOD splits. --- ## 2026-05-02 / 2026-05-04 — Evaluation speed issue and merged-model evaluation ### Goal Evaluate the trained adapter on all splits. ### Issue Initial evaluator used single-example 4-bit adapter generation with large `max_new_tokens`, causing very slow evaluation: ```text test_in_distribution: 1455 examples in ~25h test_template_ood: ~30-90s/example ``` ### Action Patched evaluator to support: - batched generation, - dynamic generation length based on target length + buffer, - periodic save/resume, - partial prediction reuse. Also recommended merging adapter into base bf16 model for faster inference. ### Decision / next step Use merged model evaluation and normalized metrics. --- ## 2026-05-04 — Raw evaluation results ### Goal Measure raw JSON and field-level performance. ### Evidence / result Raw metrics: | Split | JSON parse | Exact match | Field F1 | KPI presence | |---|---:|---:|---:|---:| | `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 | | `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 | | `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 | | `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 | | `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 | ### Interpretation The model learned JSON formatting and adversarial rejection very well. Raw exact-match is low for primary config layers, but raw exact match is likely too strict because many fields are volatile/generated (`id`, `href`, timestamps, descriptions, schema links). ### Decision / next step Implement a normalized evaluator that removes volatile fields before scoring. --- ## 2026-05-04 — Normalized evaluator implemented and run ### Goal Re-score existing predictions using metrics that better reflect structural/semantic configuration agreement. ### Action Added: ```text scripts/normalize_eval_metrics.py ``` Normalization removes/masks: - IDs, - hrefs, - names/descriptions, - timestamps, - schema links, - UUID/hash-like strings, - generated request/policy/booking/intent IDs. It computes: - normalized exact match, - normalized field precision/recall/F1, - normalized key precision/recall/F1, - stratified metrics. ### Evidence / result Headline normalized metrics: | Split | JSON parse | Raw field F1 | Normalized field F1 | Normalized key F1 | Normalized exact | |---|---:|---:|---:|---:|---:| | `test_in_distribution` | 1.0000 | 0.6868 | **0.7956** | **0.9811** | 0.0351 | | `test_template_ood` | 1.0000 | 0.6790 | **0.7865** | **0.9801** | 0.0177 | | `test_use_case_ood` | 0.9998 | 0.6825 | **0.7907** | **0.9805** | 0.0253 | | `test_sector_ood` | 1.0000 | 0.6610 | **0.7697** | **0.9818** | 0.0293 | | `test_adversarial` | 1.0000 | 0.9697 | **0.9697** | **1.0000** | 0.9697 | Strong layers: - `tmf921`: normalized field F1 around **0.93–0.94** - `camara`: normalized field F1 around **0.81–0.87** - `intent_3gpp`: normalized field F1 around **0.80–0.82** - `etsi_zsm`: normalized field F1 around **0.75–0.79** Weak layers: - `o1_nrm`: normalized field F1 around **0.39–0.40** - `a1_policy`: normalized field F1 around **0.67–0.68** - `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18** - `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52** ### Interpretation The model is much stronger than raw exact-match suggested. It reliably emits valid JSON and correct structural schemas (`norm_key_f1 ≈ 0.98`) across ID and OOD splits. Field-level value fidelity is moderate-to-strong overall, but weak for low-level O1 NRM values and monitoring/report lifecycle outputs. ### Decision / next step Plan a second-stage weak-layer fine-tune focused on: - `o1_nrm`, - `a1_policy`, - `tmf921_lifecycle_report`, - `tmf921_lifecycle_monitor`, - optionally `tmf921_lifecycle_scale`. Use the current adapter as initialization, lower LR, and include replay from strong layers to prevent forgetting. --- ## Current scientific status ### What can be claimed now The Qwen3-8B QLoRA model trained on the TMF921 Research SOTA split achieves: - near-perfect JSON validity, - stable OOD generalization, - excellent adversarial rejection, - normalized structural key F1 around 98% across non-adversarial ID/OOD splits, - normalized field F1 around 77–80% across ID/OOD splits. ### What should not be overclaimed Do not claim production-grade standards compliance yet. Current evaluation is normalized JSON/field scoring, not official TMF921/3GPP/ETSI/CAMARA/O-RAN schema validation. ### Main weaknesses - O1 NRM value fidelity is poor despite correct structure. - Lifecycle report/monitor outputs need targeted improvement. - Raw exact match remains low for primary create configs. ### Next planned experiment Second-stage weak-layer adapter continuation: - initialize from current Qwen3-8B TMF921 adapter, - train on weak-layer examples plus replay buffer, - lower LR: `5e-5` or `1e-4`, - 1 epoch, - same max length 2048, - evaluate again with raw + normalized metrics. --- ## Open questions 1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1? 2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring? 3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation? 4. Should training use a weak-layer second stage or should dataset generation be improved first? --- ## Running log template ```markdown ## YYYY-MM-DD — Short title ### Goal ### Action ### Evidence / result ### Interpretation ### Decision / next step ``` --- ## 2026-05-04 — Stage 2 weak-layer continuation plan implemented ### Goal Improve weak target layers identified by normalized evaluation without degrading strong layers. Weak layers from normalized evaluation: - `o1_nrm`: normalized field F1 around **0.39–0.40** - `a1_policy`: normalized field F1 around **0.67–0.68** - `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18** - `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52** - `tmf921_lifecycle_scale`: mixed, included because lifecycle scaling still had noticeable errors ### Action Added stage-2 tooling: - `scripts/build_weak_layer_dataset.py` - `scripts/train_continue_adapter.py` - `configs/stage2_weak_layer_qwen3_8b.yaml` - `scripts/nohup_stage2_weak.sh` The weak-layer dataset builder creates a local parquet training set with: 1. all weak-layer rows from `train_sota`, 2. duplicated rare weak layers up to a minimum count, 3. a replay buffer from non-weak layers to reduce forgetting. The continuation trainer loads: 1. Qwen3-8B base model in 4-bit NF4, 2. the existing LoRA adapter with `is_trainable=True`, 3. the local weak-layer replay dataset, 4. TRL `SFTTrainer` without a new `peft_config`, per PEFT/TRL continuation best practices. Stage-2 default hyperparameters: ```yaml learning_rate: 5e-5 epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 16 max_length: 2048 assistant_only_loss: true ``` ### Interpretation A lower learning rate and replay buffer should improve weak-layer value fidelity while reducing catastrophic forgetting on strong layers. This is a targeted continuation, not a replacement for Gen4 data generation or official schema validation. ### Decision / next step Run stage-2 from the completed stage-1 adapter: ```bash bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-20260501-083834 ``` After training, evaluate with the same raw + normalized OOD protocol and compare against stage-1 metrics. --- ## 2026-05-05 — Stage 2 weak-layer continuation run started ### Goal Run the stage-2 weak-layer continuation experiment implemented on 2026-05-04. The intended scientific question is: > Can a short, low-learning-rate continuation on weak target layers improve low-performing layer-specific value fidelity while preserving the strong global JSON validity, key structure, and adversarial behavior from stage 1? ### Action Started stage 2 with: ```bash bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-20260501-083834 ``` Generated run: ```text runs/stage2-weak-20260505-080040 ``` Source adapter: ```text runs/qwen3-8b-qlora-20260501-083834/outputs/adapter ``` ### Stage-2 dataset composition The weak-layer dataset builder produced: ```json { "rows_train_stage2": 13829, "rows_validation": 1547, "weak_rows_total_after_duplication": 10638, "replay_rows": 3191, "rare_min_per_layer": 1500, "replay_ratio": 0.3 } ``` Layer counts before/after rare-layer duplication: | Target layer | Before | After | |---|---:|---:| | `o1_nrm` | 2,672 | 2,672 | | `a1_policy` | 3,466 | 3,466 | | `tmf921_lifecycle_report` | 596 | 1,500 | | `tmf921_lifecycle_monitor` | 726 | 1,500 | | `tmf921_lifecycle_scale` | 576 | 1,500 | Replay buffer size: - replay rows from non-weak layers: **3,191** - purpose: reduce catastrophic forgetting on strong layers such as `tmf921`, `camara`, `intent_3gpp`, `etsi_zsm`, and adversarial rejection. Full target-layer composition in stage-2 train set: | Target layer | Rows | |---|---:| | `a1_policy` | 3,466 | | `o1_nrm` | 2,672 | | `tmf921_lifecycle_monitor` | 1,500 | | `tmf921_lifecycle_report` | 1,500 | | `tmf921_lifecycle_scale` | 1,500 | | `tmf921` replay | 902 | | `intent_3gpp` replay | 630 | | `camara` replay | 618 | | `etsi_zsm` replay | 335 | | adversarial replay and other lifecycle replay | remaining rows | ### Training configuration Resolved stage-2 config: ```yaml model_name_or_path: Qwen/Qwen3-8B adapter_path: runs/qwen3-8b-qlora-20260501-083834/outputs/adapter dataset_dir: runs/stage2-weak-20260505-080040/weak_layer_data output_dir: runs/stage2-weak-20260505-080040/outputs/adapter learning_rate: 5.0e-05 epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 16 max_length: 2048 assistant_only_loss: true bf16: true gradient_checkpointing: true optim: paged_adamw_32bit ``` ### Evidence that adapter continuation was configured correctly Server log confirmed: ```text Base model: Qwen/Qwen3-8B Adapter: runs/qwen3-8b-qlora-20260501-083834/outputs/adapter trainable params: 174,587,904 || all params: 8,365,323,264 || trainable%: 2.0870 TunerModelStatus(... active_adapters=['default'], requires_grad={'default': True}, devices={'default': ['cuda']}) ``` Interpretation: - The existing adapter was loaded. - Adapter weights are trainable. - Training is on CUDA. - The base model is not being full-finetuned; only LoRA adapter parameters are updated. ### Early training evidence Stage-2 training began normally after tokenization: ```text Tokenizing train dataset: 13,829 / 13,829 Tokenizing eval dataset: 1,547 / 1,547 ``` Representative early logs: ```text loss: 0.1313, grad_norm: 0.0199, lr: 5e-05, mean_token_accuracy: 0.9572, epoch: 0.0012 loss: 0.1686, grad_norm: 0.0317, lr: 5e-05, mean_token_accuracy: 0.9435, epoch: 0.0116 loss: 0.1541, grad_norm: 0.0166, lr: 5e-05, mean_token_accuracy: 0.9463, epoch: 0.1157 ``` Validation during stage 2: ```text eval_loss: 0.1581 at epoch 0.1157 eval_loss: 0.1582 at epoch 0.2314 eval_loss: 0.1584 at epoch 0.3471 eval_loss: 0.1585 at epoch 0.4628 ``` At approximately 50% completion: ```text epoch: 0.4975 / 1.0 loss: 0.1366-0.1428 range near midpoint grad_norm: generally <0.14 mean_token_accuracy: about 0.95 ``` ### Interpretation The stage-2 run is healthy: - no CUDA OOM, - no NaN/Inf, - no gradient explosion, - GPU is active, - adapter continuation is correctly configured. Validation loss is slightly worse than the stage-1 plateau (~0.153), but this is expected because stage 2 intentionally shifts the training distribution toward harder weak layers. The decisive evaluation is not broad validation loss alone; it is the post-stage2 OOD normalized weak-layer comparison. ### Decision / next step Let stage 2 finish. After completion: 1. merge the stage-2 adapter, 2. run OOD evaluation, 3. run normalized evaluator, 4. compare against stage-1 baselines. Commands planned after stage 2: ```bash RUN_DIR="runs/stage2-weak-20260505-080040" python scripts/merge_adapter.py \ --base_model Qwen/Qwen3-8B \ --adapter "$RUN_DIR/outputs/adapter" \ --output_dir "$RUN_DIR/outputs/merged" EVAL_BATCH_SIZE=8 \ bash scripts/nohup_eval.sh "$RUN_DIR" "$RUN_DIR/outputs/merged" python scripts/normalize_eval_metrics.py \ --eval_dir "$RUN_DIR/eval" ``` ### Success criteria Stage 2 is successful if: 1. weak-layer normalized field F1 improves: - `o1_nrm` above stage-1 ~0.39-0.40, - `a1_policy` above stage-1 ~0.67-0.68, - `tmf921_lifecycle_report` above stage-1 ~0.15-0.18, - `tmf921_lifecycle_monitor` above stage-1 ~0.39-0.52; 2. global normalized field F1 does not regress substantially: - stage-1 ID: 0.7956, - stage-1 template OOD: 0.7865, - stage-1 use-case OOD: 0.7907, - stage-1 sector OOD: 0.7697; 3. JSON parse remains near 100%; 4. adversarial normalized exact remains close to 0.9697. ### Failure modes to watch - Global regression from weak-layer overfitting. - Adversarial degradation from insufficient replay. - O1 NRM still weak, suggesting need for layer-specific semantic evaluator or improved data generation rather than more SFT. - Lifecycle report/monitor still weak, suggesting those outputs include measurement/simulation fields that may require tolerance-based scoring. --- ## 2026-05-05 — Stage 2 evaluation completed and decision made ### Goal Determine whether the stage-2 weak-layer continuation improved the weak target layers enough to replace the stage-1 adapter as the main model. ### Action After stage-2 training completed, the adapter was merged into the Qwen3-8B base model and evaluated on the same OOD protocol used for stage 1: - `test_in_distribution` - `test_template_ood` - `test_use_case_ood` - `test_sector_ood` - `test_adversarial` The normalized evaluator was then run on the generated predictions: ```bash python scripts/normalize_eval_metrics.py \ --eval_dir runs/stage2-weak-20260505-080040/eval ``` ### Evidence / result Global normalized comparison, stage 1 -> stage 2: | Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta | |---|---:|---:|---:|---:|---:|---:| | `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 | | `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 | | `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 | | `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 | | `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 | JSON parse comparison: | Split | Stage 1 parse | Stage 2 parse | Delta | |---|---:|---:|---:| | `test_in_distribution` | 1.0000 | 0.9993 | -0.0007 | | `test_template_ood` | 1.0000 | 1.0000 | +0.0000 | | `test_use_case_ood` | 0.9998 | 0.9995 | -0.0002 | | `test_sector_ood` | 1.0000 | 1.0000 | +0.0000 | | `test_adversarial` | 1.0000 | 0.9697 | -0.0303 | Weak-layer normalized field F1 comparison, stage 1 -> stage 2: | Split | Layer | Stage 1 | Stage 2 | Delta | |---|---|---:|---:|---:| | ID | `o1_nrm` | 0.3927 | 0.3906 | -0.0021 | | ID | `a1_policy` | 0.6837 | 0.6787 | -0.0050 | | ID | `tmf921_lifecycle_report` | 0.1667 | 0.1889 | +0.0222 | | ID | `tmf921_lifecycle_monitor` | 0.5172 | 0.4926 | -0.0246 | | ID | `tmf921_lifecycle_scale` | 0.9345 | 0.9453 | +0.0108 | | Template OOD | `o1_nrm` | 0.3976 | 0.3993 | +0.0017 | | Template OOD | `a1_policy` | 0.6763 | 0.6758 | -0.0004 | | Template OOD | `tmf921_lifecycle_report` | 0.1799 | 0.1905 | +0.0106 | | Template OOD | `tmf921_lifecycle_scale` | 0.5363 | 0.5560 | +0.0197 | | Use-case OOD | `o1_nrm` | 0.3936 | 0.3895 | -0.0042 | | Use-case OOD | `a1_policy` | 0.6808 | 0.6786 | -0.0023 | | Use-case OOD | `tmf921_lifecycle_report` | 0.1531 | 0.1981 | +0.0450 | | Use-case OOD | `tmf921_lifecycle_monitor` | 0.3875 | 0.4187 | +0.0312 | | Use-case OOD | `tmf921_lifecycle_scale` | 0.6993 | 0.7411 | +0.0418 | | Sector OOD | `o1_nrm` | 0.3858 | 0.3888 | +0.0029 | | Sector OOD | `a1_policy` | 0.6740 | 0.6763 | +0.0023 | | Sector OOD | `tmf921_lifecycle_report` | 0.1763 | 0.1830 | +0.0067 | | Sector OOD | `tmf921_lifecycle_monitor` | 0.4310 | 0.4696 | +0.0385 | | Sector OOD | `tmf921_lifecycle_scale` | 0.7279 | 0.7437 | +0.0158 | ### Interpretation Stage 2 produced only marginal global changes and did not solve the main weak-layer problem. Key observations: 1. Global normalized field F1 changed by less than 0.12 percentage points on all non-adversarial splits. This is effectively flat. 2. Normalized key F1 regressed slightly across all splits. 3. Adversarial performance regressed meaningfully: - normalized field F1: **0.9697 -> 0.9596** - normalized key F1: **1.0000 -> 0.9697** - parse rate: **1.0000 -> 0.9697** 4. `o1_nrm` did not improve in any meaningful way. Changes are between about -0.004 and +0.003, which is noise-level. 5. `a1_policy` also did not improve meaningfully. 6. Lifecycle report/monitor/scale improved on some OOD splits, especially use-case and sector OOD, but not consistently enough to justify replacing the stage-1 model. The experiment is scientifically useful because it shows that simply continuing LoRA training on weak-layer examples is insufficient for O1 NRM and A1 policy value fidelity. The likely limitation is not lack of exposure alone, but either: - insufficient semantic supervision in the data, - inadequacy of flat field-F1 for some low-level configs, - need for layer-specific validators and value extractors, - or the need for Gen4 canonical scenario generation with explicit per-layer rendering rules. ### Decision Stage 2 should **not** replace the stage-1 model as the main model. The stage-1 adapter remains the current primary model because it has: - slightly better global normalized metrics, - better adversarial robustness, - no meaningful disadvantage on O1/A1 compared with stage 2. Stage 2 is retained as a diagnostic experiment and may be useful only as evidence that weak-layer continuation alone is not sufficient. ### Next step Do **not** run another blind weak-layer fine-tune yet. The next scientifically sound step is to improve evaluation/data for weak layers: 1. Build a layer-specific semantic evaluator for `o1_nrm` and `a1_policy` that extracts and scores telecom-relevant fields rather than flat JSON values. 2. Inspect O1 NRM predictions manually to identify whether failures are wrong values, wrong cell identities, wrong PRB ratios, wrong S-NSSAI encoding, or volatile fields still not normalized. 3. For Gen4, generate canonical scenario objects first, then render all target layers from the same canonical object with explicit validators. 4. Add row-level canonical labels for critical values so evaluation does not depend on brittle JSON flattening. ### Updated project status Primary model: **stage 1 Qwen3-8B QLoRA adapter** Stage 2 status: **diagnostic / not promoted** Current best headline metrics remain the stage-1 normalized results: | Split | JSON parse | Normalized field F1 | Normalized key F1 | |---|---:|---:|---:| | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 |