# TMF921 Intent-to-Configuration Research Journal

This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.

Repository links:

- Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
- Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
- Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
- Base model: https://huggingface.co/Qwen/Qwen3-8B

---

## Journal conventions

Each entry should include:

1. **Date/time**
2. **Goal**
3. **Action**
4. **Evidence / result**
5. **Interpretation**
6. **Decision / next step**

For research claims, prefer numeric evidence over qualitative statements.

---

## 2026-04-30 — Dataset cloned and audited

### Goal

Clone and scientifically audit `nraptisss/TMF921-intent-to-config-augmented` before training.

### Action

The dataset was cloned in the sandbox and a comprehensive audit was run over schema, missingness, ChatML formatting, JSON validity, duplicates/leakage, distribution balance, numeric KPI ranges, train/test similarity, and scientific validity.

### Evidence / result

Dataset size:

- Total rows: **41,815**
- Train: **39,294**
- Test: **2,521**

Quality checks:

- Missing values: **0**
- Duplicate IDs: **0**
- Duplicate full conversations: **0**
- Assistant JSON parse validity: **41,815 / 41,815 = 100%**
- Role sequence: `system -> user -> assistant` for all rows

Leakage / similarity findings:

- Exact train/test user-prompt overlap: **0**
- Exact train/test full-message overlap: **0**
- Near-duplicate prompt similarity was high:
  - test prompts with char-ngram similarity >= 0.90 to train: **1,290 / 2,521**
  - >= 0.95: **602 / 2,521**
  - >= 0.98: **262 / 2,521**

Distribution findings:

- `create` lifecycle operation: **40,090 / 41,815 = 95.9%**
- non-create lifecycle rows: **1,725 = 4.1%**
- adversarial rows: **166 = 0.397%**
- only **31 unique JSON structure signatures** across 41,815 rows

### Interpretation

The source dataset is technically clean and suitable for SFT, but the original split is mainly an in-distribution/template-compliance split, not a strong OOD benchmark. JSON validity is excellent, but scientific benchmark validity requires OOD splits and normalized/semantic evaluation.

### Decision / next step

Create a research-grade derivative dataset with:

- OOD splits,
- train/eval provenance columns,
- token-length audit,
- validation flags,
- lifecycle/adversarial upsampling for training only,
- no fabricated continuous-KPI or cross-layer-paired examples without a validated generator.

---

## 2026-04-30 — Research SOTA dataset created

### Goal

Implement the audit recommendations while preserving scientific soundness.

### Action

Created `nraptisss/TMF921-intent-to-config-research-sota`.

Implemented:

- `train_base`
- `train_sota`
- `validation`
- `test_in_distribution`
- `test_template_ood`
- `test_use_case_ood`
- `test_sector_ood`
- `test_adversarial`

Added columns:

- `system`, `prompt`, `completion`
- `prompt_template_id`
- `scenario_id`
- `json_structure_id`
- `json_root_family`
- `messages_format_valid`
- `assistant_is_valid_json`
- `slice_sst_valid`
- `kpi_profile_valid`
- `semantic_rule_valid_v1`
- `qwen3_chat_template_tokens`
- `fits_2048_qwen3`
- `fits_4096_qwen3`
- `sampling_weight_*`
- `is_augmented`, `augmentation_type`, `source_id`, `conversation_type`

### Evidence / result

Published dataset:

- https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota

Splits:

| Split | Rows | Purpose |
|---|---:|---|
| `train_base` | 26,357 | unaugmented training after OOD holdouts |
| `train_sota` | 32,357 | training split with marked lifecycle/adversarial upsampling and multi-turn wrappers |
| `validation` | 1,547 | validation |
| `test_in_distribution` | 1,455 | in-distribution test |
| `test_template_ood` | 3,503 | held-out prompt-template family |
| `test_use_case_ood` | 4,341 | held-out use cases |
| `test_sector_ood` | 4,579 | held-out sectors |
| `test_adversarial` | 33 | held-out adversarial examples |

Qwen3 token-length audit:

- mean: **754.1**
- p50: **705**
- p95: **1293**
- p99: **1300**
- max: **1316**
- fit within 2048: **100%**

`train_sota` balancing:

- non-create lifecycle rows: **5,166 = 15.97%**
- adversarial rows: **2,115 = 6.54%**
- synthetic multi-turn wrappers: **1,281**

### Interpretation

`max_length=2048` is justified for Qwen3-8B. `train_sota` improves rare-class exposure. OOD splits allow scientifically meaningful generalization reporting.

### Decision / next step

Build a training/evaluation repository for a single RTX 6000 Ada server using Qwen3-8B QLoRA.

---

## 2026-04-30 / 2026-05-01 — Training/evaluation repo created

### Goal

Create a reproducible repo for training and evaluation on RTX 6000 Ada 48/50GB.

### Action

Created `nraptisss/tmf921-intent-training` with:

- QLoRA SFT training script,
- evaluation script,
- merge script,
- RTX 6000 Ada install script,
- GPU preflight,
- nohup run scripts,
- resumable checkpoints,
- unique run directories.

Default recipe:

- model: `Qwen/Qwen3-8B`
- method: QLoRA NF4 + double quant
- LoRA target modules: `all-linear`
- LoRA rank: `64`
- LoRA alpha: `16`
- LoRA dropout: `0.05`
- LR: `2e-4`
- scheduler: constant
- max length: `2048`
- assistant-only loss: enabled
- bf16: enabled
- gradient checkpointing: enabled
- train split: `train_sota`
- eval split: `validation`

### Evidence / result

Repo:

- https://huggingface.co/nraptisss/tmf921-intent-training

### Interpretation

The training approach is consistent with QLoRA literature and fits the memory constraints of a 48/50GB RTX 6000 Ada GPU.

### Decision / next step

Run training under `nohup`, require CUDA preflight, and ensure unique output directories to avoid overwriting results.

---

## 2026-05-01 — Runtime issues fixed

### Goal

Resolve server-side training errors and ensure training uses GPU.

### Issues encountered and fixes

#### 1. CPU/GPU uncertainty

Observed concern that training might not use GPU.

Fix:

- Added `scripts/check_gpu.py`
- Added `scripts/install_rtx6000ada.sh`
- Added fail-fast CUDA checks to training/evaluation scripts.

Evidence from server logs:

```text
torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
```

Conclusion: GPU setup confirmed.

#### 2. TRL conversational dataset detection error

Error:

```text
ValueError: You set assistant_only_loss=True, but the dataset is not conversational.
```

Cause:

The dataset contains `messages` plus convenience `prompt`/`completion` columns. TRL inferred prompt-completion format instead of conversational format.

Fix:

Training script now passes only:

```python
train_dataset = train_dataset.select_columns(["messages"])
eval_dataset = eval_dataset.select_columns(["messages"])
```

#### 3. Trackio invalid Space ID

Error:

```text
HFValidationError: Repo id ... 'nraptisss/'
```

Cause:

Invalid `TRACKIO_SPACE_ID=nraptisss/`.

Fix:

Added validation/sanitization for Trackio Space IDs and support for:

```bash
DISABLE_TRACKIO=1
```

#### 4. Deprecated warmup argument

Warning:

```text
warmup_ratio is deprecated
```

Fix:

Changed config/script to use:

```yaml
warmup_steps: 0
```

### Decision / next step

Restart training with fixed scripts and disabled Trackio to avoid external logging failures.

---

## 2026-05-01 / 2026-05-02 — Qwen3-8B QLoRA training run completed

### Goal

Train Qwen3-8B QLoRA on `train_sota`.

### Action

Started training under nohup with unique run directory:

```text
runs/qwen3-8b-qlora-20260501-083834
```

Trackio disabled:

```bash
DISABLE_TRACKIO=1
```

### Evidence / result

Training logs showed stable convergence.

Representative metrics:

Initial:

```text
loss: 1.212
mean_token_accuracy: 0.7922
```

After early training:

```text
loss: ~0.15
mean_token_accuracy: ~0.945-0.953
```

Validation loss over training:

```text
eval_loss: 0.1593 at epoch 0.1236
eval_loss: 0.1561 at epoch 0.2472
eval_loss: 0.1548 at epoch 0.3709
eval_loss: 0.1535 at epoch 0.8653
eval_loss: 0.1530 at epoch 1.607
eval_loss: 0.1532 at epoch 1.730
```

No observed:

- CUDA OOM,
- NaNs,
- divergence,
- gradient explosion.

### Interpretation

The run converged smoothly. Loss stabilized around 0.14–0.15 and validation loss plateaued near 0.153, indicating stable SFT convergence.

### Decision / next step

Evaluate the trained adapter across ID and OOD splits.

---

## 2026-05-02 / 2026-05-04 — Evaluation speed issue and merged-model evaluation

### Goal

Evaluate the trained adapter on all splits.

### Issue

Initial evaluator used single-example 4-bit adapter generation with large `max_new_tokens`, causing very slow evaluation:

```text
test_in_distribution: 1455 examples in ~25h
test_template_ood: ~30-90s/example
```

### Action

Patched evaluator to support:

- batched generation,
- dynamic generation length based on target length + buffer,
- periodic save/resume,
- partial prediction reuse.

Also recommended merging adapter into base bf16 model for faster inference.

### Decision / next step

Use merged model evaluation and normalized metrics.

---

## 2026-05-04 — Raw evaluation results

### Goal

Measure raw JSON and field-level performance.

### Evidence / result

Raw metrics:

| Split | JSON parse | Exact match | Field F1 | KPI presence |
|---|---:|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
| `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
| `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
| `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
| `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |

### Interpretation

The model learned JSON formatting and adversarial rejection very well. Raw exact-match is low for primary config layers, but raw exact match is likely too strict because many fields are volatile/generated (`id`, `href`, timestamps, descriptions, schema links).

### Decision / next step

Implement a normalized evaluator that removes volatile fields before scoring.

---

## 2026-05-04 — Normalized evaluator implemented and run

### Goal

Re-score existing predictions using metrics that better reflect structural/semantic configuration agreement.

### Action

Added:

```text
scripts/normalize_eval_metrics.py
```

Normalization removes/masks:

- IDs,
- hrefs,
- names/descriptions,
- timestamps,
- schema links,
- UUID/hash-like strings,
- generated request/policy/booking/intent IDs.

It computes:

- normalized exact match,
- normalized field precision/recall/F1,
- normalized key precision/recall/F1,
- stratified metrics.

### Evidence / result

Headline normalized metrics:

| Split | JSON parse | Raw field F1 | Normalized field F1 | Normalized key F1 | Normalized exact |
|---|---:|---:|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.6868 | **0.7956** | **0.9811** | 0.0351 |
| `test_template_ood` | 1.0000 | 0.6790 | **0.7865** | **0.9801** | 0.0177 |
| `test_use_case_ood` | 0.9998 | 0.6825 | **0.7907** | **0.9805** | 0.0253 |
| `test_sector_ood` | 1.0000 | 0.6610 | **0.7697** | **0.9818** | 0.0293 |
| `test_adversarial` | 1.0000 | 0.9697 | **0.9697** | **1.0000** | 0.9697 |

Strong layers:

- `tmf921`: normalized field F1 around **0.93–0.94**
- `camara`: normalized field F1 around **0.81–0.87**
- `intent_3gpp`: normalized field F1 around **0.80–0.82**
- `etsi_zsm`: normalized field F1 around **0.75–0.79**

Weak layers:

- `o1_nrm`: normalized field F1 around **0.39–0.40**
- `a1_policy`: normalized field F1 around **0.67–0.68**
- `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
- `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**

### Interpretation

The model is much stronger than raw exact-match suggested. It reliably emits valid JSON and correct structural schemas (`norm_key_f1 ≈ 0.98`) across ID and OOD splits. Field-level value fidelity is moderate-to-strong overall, but weak for low-level O1 NRM values and monitoring/report lifecycle outputs.

### Decision / next step

Plan a second-stage weak-layer fine-tune focused on:

- `o1_nrm`,
- `a1_policy`,
- `tmf921_lifecycle_report`,
- `tmf921_lifecycle_monitor`,
- optionally `tmf921_lifecycle_scale`.

Use the current adapter as initialization, lower LR, and include replay from strong layers to prevent forgetting.

---

## Current scientific status

### What can be claimed now

The Qwen3-8B QLoRA model trained on the TMF921 Research SOTA split achieves:

- near-perfect JSON validity,
- stable OOD generalization,
- excellent adversarial rejection,
- normalized structural key F1 around 98% across non-adversarial ID/OOD splits,
- normalized field F1 around 77–80% across ID/OOD splits.

### What should not be overclaimed

Do not claim production-grade standards compliance yet. Current evaluation is normalized JSON/field scoring, not official TMF921/3GPP/ETSI/CAMARA/O-RAN schema validation.

### Main weaknesses

- O1 NRM value fidelity is poor despite correct structure.
- Lifecycle report/monitor outputs need targeted improvement.
- Raw exact match remains low for primary create configs.

### Next planned experiment

Second-stage weak-layer adapter continuation:

- initialize from current Qwen3-8B TMF921 adapter,
- train on weak-layer examples plus replay buffer,
- lower LR: `5e-5` or `1e-4`,
- 1 epoch,
- same max length 2048,
- evaluate again with raw + normalized metrics.

---

## Open questions

1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
4. Should training use a weak-layer second stage or should dataset generation be improved first?

---

## Running log template

```markdown
## YYYY-MM-DD — Short title

### Goal

### Action

### Evidence / result

### Interpretation

### Decision / next step
```

---

## 2026-05-04 — Stage 2 weak-layer continuation plan implemented

### Goal

Improve weak target layers identified by normalized evaluation without degrading strong layers.

Weak layers from normalized evaluation:

- `o1_nrm`: normalized field F1 around **0.39–0.40**
- `a1_policy`: normalized field F1 around **0.67–0.68**
- `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
- `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**
- `tmf921_lifecycle_scale`: mixed, included because lifecycle scaling still had noticeable errors

### Action

Added stage-2 tooling:

- `scripts/build_weak_layer_dataset.py`
- `scripts/train_continue_adapter.py`
- `configs/stage2_weak_layer_qwen3_8b.yaml`
- `scripts/nohup_stage2_weak.sh`

The weak-layer dataset builder creates a local parquet training set with:

1. all weak-layer rows from `train_sota`,
2. duplicated rare weak layers up to a minimum count,
3. a replay buffer from non-weak layers to reduce forgetting.

The continuation trainer loads:

1. Qwen3-8B base model in 4-bit NF4,
2. the existing LoRA adapter with `is_trainable=True`,
3. the local weak-layer replay dataset,
4. TRL `SFTTrainer` without a new `peft_config`, per PEFT/TRL continuation best practices.

Stage-2 default hyperparameters:

```yaml
learning_rate: 5e-5
epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
max_length: 2048
assistant_only_loss: true
```

### Interpretation

A lower learning rate and replay buffer should improve weak-layer value fidelity while reducing catastrophic forgetting on strong layers. This is a targeted continuation, not a replacement for Gen4 data generation or official schema validation.

### Decision / next step

Run stage-2 from the completed stage-1 adapter:

```bash
bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-20260501-083834
```

After training, evaluate with the same raw + normalized OOD protocol and compare against stage-1 metrics.

---

## 2026-05-05 — Stage 2 weak-layer continuation run started

### Goal

Run the stage-2 weak-layer continuation experiment implemented on 2026-05-04.

The intended scientific question is:

> Can a short, low-learning-rate continuation on weak target layers improve low-performing layer-specific value fidelity while preserving the strong global JSON validity, key structure, and adversarial behavior from stage 1?

### Action

Started stage 2 with:

```bash
bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-20260501-083834
```

Generated run:

```text
runs/stage2-weak-20260505-080040
```

Source adapter:

```text
runs/qwen3-8b-qlora-20260501-083834/outputs/adapter
```

### Stage-2 dataset composition

The weak-layer dataset builder produced:

```json
{
  "rows_train_stage2": 13829,
  "rows_validation": 1547,
  "weak_rows_total_after_duplication": 10638,
  "replay_rows": 3191,
  "rare_min_per_layer": 1500,
  "replay_ratio": 0.3
}
```

Layer counts before/after rare-layer duplication:

| Target layer | Before | After |
|---|---:|---:|
| `o1_nrm` | 2,672 | 2,672 |
| `a1_policy` | 3,466 | 3,466 |
| `tmf921_lifecycle_report` | 596 | 1,500 |
| `tmf921_lifecycle_monitor` | 726 | 1,500 |
| `tmf921_lifecycle_scale` | 576 | 1,500 |

Replay buffer size:

- replay rows from non-weak layers: **3,191**
- purpose: reduce catastrophic forgetting on strong layers such as `tmf921`, `camara`, `intent_3gpp`, `etsi_zsm`, and adversarial rejection.

Full target-layer composition in stage-2 train set:

| Target layer | Rows |
|---|---:|
| `a1_policy` | 3,466 |
| `o1_nrm` | 2,672 |
| `tmf921_lifecycle_monitor` | 1,500 |
| `tmf921_lifecycle_report` | 1,500 |
| `tmf921_lifecycle_scale` | 1,500 |
| `tmf921` replay | 902 |
| `intent_3gpp` replay | 630 |
| `camara` replay | 618 |
| `etsi_zsm` replay | 335 |
| adversarial replay and other lifecycle replay | remaining rows |

### Training configuration

Resolved stage-2 config:

```yaml
model_name_or_path: Qwen/Qwen3-8B
adapter_path: runs/qwen3-8b-qlora-20260501-083834/outputs/adapter
dataset_dir: runs/stage2-weak-20260505-080040/weak_layer_data
output_dir: runs/stage2-weak-20260505-080040/outputs/adapter
learning_rate: 5.0e-05
epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
max_length: 2048
assistant_only_loss: true
bf16: true
gradient_checkpointing: true
optim: paged_adamw_32bit
```

### Evidence that adapter continuation was configured correctly

Server log confirmed:

```text
Base model: Qwen/Qwen3-8B
Adapter: runs/qwen3-8b-qlora-20260501-083834/outputs/adapter
trainable params: 174,587,904 || all params: 8,365,323,264 || trainable%: 2.0870
TunerModelStatus(... active_adapters=['default'], requires_grad={'default': True}, devices={'default': ['cuda']})
```

Interpretation:

- The existing adapter was loaded.
- Adapter weights are trainable.
- Training is on CUDA.
- The base model is not being full-finetuned; only LoRA adapter parameters are updated.

### Early training evidence

Stage-2 training began normally after tokenization:

```text
Tokenizing train dataset: 13,829 / 13,829
Tokenizing eval dataset: 1,547 / 1,547
```

Representative early logs:

```text
loss: 0.1313, grad_norm: 0.0199, lr: 5e-05, mean_token_accuracy: 0.9572, epoch: 0.0012
loss: 0.1686, grad_norm: 0.0317, lr: 5e-05, mean_token_accuracy: 0.9435, epoch: 0.0116
loss: 0.1541, grad_norm: 0.0166, lr: 5e-05, mean_token_accuracy: 0.9463, epoch: 0.1157
```

Validation during stage 2:

```text
eval_loss: 0.1581 at epoch 0.1157
eval_loss: 0.1582 at epoch 0.2314
eval_loss: 0.1584 at epoch 0.3471
eval_loss: 0.1585 at epoch 0.4628
```

At approximately 50% completion:

```text
epoch: 0.4975 / 1.0
loss: 0.1366-0.1428 range near midpoint
grad_norm: generally <0.14
mean_token_accuracy: about 0.95
```

### Interpretation

The stage-2 run is healthy:

- no CUDA OOM,
- no NaN/Inf,
- no gradient explosion,
- GPU is active,
- adapter continuation is correctly configured.

Validation loss is slightly worse than the stage-1 plateau (~0.153), but this is expected because stage 2 intentionally shifts the training distribution toward harder weak layers. The decisive evaluation is not broad validation loss alone; it is the post-stage2 OOD normalized weak-layer comparison.

### Decision / next step

Let stage 2 finish. After completion:

1. merge the stage-2 adapter,
2. run OOD evaluation,
3. run normalized evaluator,
4. compare against stage-1 baselines.

Commands planned after stage 2:

```bash
RUN_DIR="runs/stage2-weak-20260505-080040"

python scripts/merge_adapter.py \
  --base_model Qwen/Qwen3-8B \
  --adapter "$RUN_DIR/outputs/adapter" \
  --output_dir "$RUN_DIR/outputs/merged"

EVAL_BATCH_SIZE=8 \
bash scripts/nohup_eval.sh "$RUN_DIR" "$RUN_DIR/outputs/merged"

python scripts/normalize_eval_metrics.py \
  --eval_dir "$RUN_DIR/eval"
```

### Success criteria

Stage 2 is successful if:

1. weak-layer normalized field F1 improves:
   - `o1_nrm` above stage-1 ~0.39-0.40,
   - `a1_policy` above stage-1 ~0.67-0.68,
   - `tmf921_lifecycle_report` above stage-1 ~0.15-0.18,
   - `tmf921_lifecycle_monitor` above stage-1 ~0.39-0.52;
2. global normalized field F1 does not regress substantially:
   - stage-1 ID: 0.7956,
   - stage-1 template OOD: 0.7865,
   - stage-1 use-case OOD: 0.7907,
   - stage-1 sector OOD: 0.7697;
3. JSON parse remains near 100%;
4. adversarial normalized exact remains close to 0.9697.

### Failure modes to watch

- Global regression from weak-layer overfitting.
- Adversarial degradation from insufficient replay.
- O1 NRM still weak, suggesting need for layer-specific semantic evaluator or improved data generation rather than more SFT.
- Lifecycle report/monitor still weak, suggesting those outputs include measurement/simulation fields that may require tolerance-based scoring.


---

## 2026-05-05 — Stage 2 evaluation completed and decision made

### Goal

Determine whether the stage-2 weak-layer continuation improved the weak target layers enough to replace the stage-1 adapter as the main model.

### Action

After stage-2 training completed, the adapter was merged into the Qwen3-8B base model and evaluated on the same OOD protocol used for stage 1:

- `test_in_distribution`
- `test_template_ood`
- `test_use_case_ood`
- `test_sector_ood`
- `test_adversarial`

The normalized evaluator was then run on the generated predictions:

```bash
python scripts/normalize_eval_metrics.py \
  --eval_dir runs/stage2-weak-20260505-080040/eval
```

### Evidence / result

Global normalized comparison, stage 1 -> stage 2:

| Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
|---|---:|---:|---:|---:|---:|---:|
| `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
| `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
| `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
| `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
| `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |

JSON parse comparison:

| Split | Stage 1 parse | Stage 2 parse | Delta |
|---|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.9993 | -0.0007 |
| `test_template_ood` | 1.0000 | 1.0000 | +0.0000 |
| `test_use_case_ood` | 0.9998 | 0.9995 | -0.0002 |
| `test_sector_ood` | 1.0000 | 1.0000 | +0.0000 |
| `test_adversarial` | 1.0000 | 0.9697 | -0.0303 |

Weak-layer normalized field F1 comparison, stage 1 -> stage 2:

| Split | Layer | Stage 1 | Stage 2 | Delta |
|---|---|---:|---:|---:|
| ID | `o1_nrm` | 0.3927 | 0.3906 | -0.0021 |
| ID | `a1_policy` | 0.6837 | 0.6787 | -0.0050 |
| ID | `tmf921_lifecycle_report` | 0.1667 | 0.1889 | +0.0222 |
| ID | `tmf921_lifecycle_monitor` | 0.5172 | 0.4926 | -0.0246 |
| ID | `tmf921_lifecycle_scale` | 0.9345 | 0.9453 | +0.0108 |
| Template OOD | `o1_nrm` | 0.3976 | 0.3993 | +0.0017 |
| Template OOD | `a1_policy` | 0.6763 | 0.6758 | -0.0004 |
| Template OOD | `tmf921_lifecycle_report` | 0.1799 | 0.1905 | +0.0106 |
| Template OOD | `tmf921_lifecycle_scale` | 0.5363 | 0.5560 | +0.0197 |
| Use-case OOD | `o1_nrm` | 0.3936 | 0.3895 | -0.0042 |
| Use-case OOD | `a1_policy` | 0.6808 | 0.6786 | -0.0023 |
| Use-case OOD | `tmf921_lifecycle_report` | 0.1531 | 0.1981 | +0.0450 |
| Use-case OOD | `tmf921_lifecycle_monitor` | 0.3875 | 0.4187 | +0.0312 |
| Use-case OOD | `tmf921_lifecycle_scale` | 0.6993 | 0.7411 | +0.0418 |
| Sector OOD | `o1_nrm` | 0.3858 | 0.3888 | +0.0029 |
| Sector OOD | `a1_policy` | 0.6740 | 0.6763 | +0.0023 |
| Sector OOD | `tmf921_lifecycle_report` | 0.1763 | 0.1830 | +0.0067 |
| Sector OOD | `tmf921_lifecycle_monitor` | 0.4310 | 0.4696 | +0.0385 |
| Sector OOD | `tmf921_lifecycle_scale` | 0.7279 | 0.7437 | +0.0158 |

### Interpretation

Stage 2 produced only marginal global changes and did not solve the main weak-layer problem.

Key observations:

1. Global normalized field F1 changed by less than 0.12 percentage points on all non-adversarial splits. This is effectively flat.
2. Normalized key F1 regressed slightly across all splits.
3. Adversarial performance regressed meaningfully:
   - normalized field F1: **0.9697 -> 0.9596**
   - normalized key F1: **1.0000 -> 0.9697**
   - parse rate: **1.0000 -> 0.9697**
4. `o1_nrm` did not improve in any meaningful way. Changes are between about -0.004 and +0.003, which is noise-level.
5. `a1_policy` also did not improve meaningfully.
6. Lifecycle report/monitor/scale improved on some OOD splits, especially use-case and sector OOD, but not consistently enough to justify replacing the stage-1 model.

The experiment is scientifically useful because it shows that simply continuing LoRA training on weak-layer examples is insufficient for O1 NRM and A1 policy value fidelity. The likely limitation is not lack of exposure alone, but either:

- insufficient semantic supervision in the data,
- inadequacy of flat field-F1 for some low-level configs,
- need for layer-specific validators and value extractors,
- or the need for Gen4 canonical scenario generation with explicit per-layer rendering rules.

### Decision

Stage 2 should **not** replace the stage-1 model as the main model.

The stage-1 adapter remains the current primary model because it has:

- slightly better global normalized metrics,
- better adversarial robustness,
- no meaningful disadvantage on O1/A1 compared with stage 2.

Stage 2 is retained as a diagnostic experiment and may be useful only as evidence that weak-layer continuation alone is not sufficient.

### Next step

Do **not** run another blind weak-layer fine-tune yet. The next scientifically sound step is to improve evaluation/data for weak layers:

1. Build a layer-specific semantic evaluator for `o1_nrm` and `a1_policy` that extracts and scores telecom-relevant fields rather than flat JSON values.
2. Inspect O1 NRM predictions manually to identify whether failures are wrong values, wrong cell identities, wrong PRB ratios, wrong S-NSSAI encoding, or volatile fields still not normalized.
3. For Gen4, generate canonical scenario objects first, then render all target layers from the same canonical object with explicit validators.
4. Add row-level canonical labels for critical values so evaluation does not depend on brittle JSON flattening.

### Updated project status

Primary model: **stage 1 Qwen3-8B QLoRA adapter**

Stage 2 status: **diagnostic / not promoted**

Current best headline metrics remain the stage-1 normalized results:

| Split | JSON parse | Normalized field F1 | Normalized key F1 |
|---|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 |
| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 |
| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 |
| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 |
| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 |