# Paper Tables This file contains draft tables for the manuscript. --- ## Table 1 — Research dataset splits | Split | Rows | Purpose | |---|---:|---| | `train_base` | 26,357 | Unaugmented training after OOD holdouts | | `train_sota` | 32,357 | Training split with lifecycle/adversarial upsampling and multi-turn wrappers | | `validation` | 1,547 | Validation during training | | `test_in_distribution` | 1,455 | In-distribution test | | `test_template_ood` | 3,503 | Held-out prompt-template family | | `test_use_case_ood` | 4,341 | Held-out use cases | | `test_sector_ood` | 4,579 | Held-out sectors | | `test_adversarial` | 33 | Held-out adversarial rejection examples | --- ## Table 2 — Qwen3 token-length audit | Statistic | Tokens | |---|---:| | Mean | 754.1 | | p50 | 705 | | p95 | 1293 | | p99 | 1300 | | Max | 1316 | | Fit under 2048 | 100% | Interpretation: `max_length=2048` is safe for Qwen3-8B on this dataset. --- ## Table 3 — Stage-1 training configuration | Item | Value | |---|---| | Base model | `Qwen/Qwen3-8B` | | Training method | QLoRA SFT | | Quantization | 4-bit NF4 + double quantization | | LoRA rank | 64 | | LoRA alpha | 16 | | LoRA dropout | 0.05 | | Target modules | `all-linear` | | Max length | 2048 | | Loss | Assistant-only SFT loss | | Learning rate | 2e-4 | | Scheduler | constant | | Optimizer | paged AdamW 32-bit | | Gradient checkpointing | enabled | | Hardware | RTX 6000 Ada 48/50GB | | Train split | `train_sota` | --- ## Table 4 — Stage-1 raw metrics | Split | JSON parse | Exact match | Field F1 | KPI presence | |---|---:|---:|---:|---:| | `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 | | `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 | | `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 | | `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 | | `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 | --- ## Table 5 — Stage-1 normalized metrics | Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact | |---|---:|---:|---:|---:| | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 | | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 | | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 | | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 | | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 | --- ## Table 6 — Stage-1 strong and weak target layers | Target layer | Normalized field F1 range | Interpretation | |---|---:|---| | `tmf921` | 0.93–0.94 | Strong high-level intent object generation | | `camara` | 0.81–0.87 | Strong after volatile-field normalization | | `intent_3gpp` | 0.80–0.82 | Strong/moderate | | `etsi_zsm` | 0.75–0.79 | Moderate/strong | | `a1_policy` | 0.67–0.68 | Moderate, value fidelity remains limited | | `o1_nrm` | 0.39–0.40 | Weak value fidelity despite correct structure | | `tmf921_lifecycle_report` | 0.15–0.18 | Weak, likely measurement/simulation mismatch | | `tmf921_lifecycle_monitor` | 0.39–0.52 | Weak/mixed | --- ## Table 7 — Stage 1 vs Stage 2 global comparison | Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta | |---|---:|---:|---:|---:|---:|---:| | `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 | | `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 | | `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 | | `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 | | `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 | Decision: Stage 2 is diagnostic and not promoted. --- ## Table 8 — Stage 1 vs Stage 2 weak-layer comparison | Split | Layer | Stage 1 | Stage 2 | Delta | |---|---|---:|---:|---:| | ID | `o1_nrm` | 0.3927 | 0.3906 | -0.0021 | | ID | `a1_policy` | 0.6837 | 0.6787 | -0.0050 | | ID | `tmf921_lifecycle_report` | 0.1667 | 0.1889 | +0.0222 | | ID | `tmf921_lifecycle_monitor` | 0.5172 | 0.4926 | -0.0246 | | ID | `tmf921_lifecycle_scale` | 0.9345 | 0.9453 | +0.0108 | | Template OOD | `o1_nrm` | 0.3976 | 0.3993 | +0.0017 | | Template OOD | `a1_policy` | 0.6763 | 0.6758 | -0.0004 | | Template OOD | `tmf921_lifecycle_report` | 0.1799 | 0.1905 | +0.0106 | | Template OOD | `tmf921_lifecycle_scale` | 0.5363 | 0.5560 | +0.0197 | | Use-case OOD | `o1_nrm` | 0.3936 | 0.3895 | -0.0042 | | Use-case OOD | `a1_policy` | 0.6808 | 0.6786 | -0.0023 | | Use-case OOD | `tmf921_lifecycle_report` | 0.1531 | 0.1981 | +0.0450 | | Use-case OOD | `tmf921_lifecycle_monitor` | 0.3875 | 0.4187 | +0.0312 | | Use-case OOD | `tmf921_lifecycle_scale` | 0.6993 | 0.7411 | +0.0418 | | Sector OOD | `o1_nrm` | 0.3858 | 0.3888 | +0.0029 | | Sector OOD | `a1_policy` | 0.6740 | 0.6763 | +0.0023 | | Sector OOD | `tmf921_lifecycle_report` | 0.1763 | 0.1830 | +0.0067 | | Sector OOD | `tmf921_lifecycle_monitor` | 0.4310 | 0.4696 | +0.0385 | | Sector OOD | `tmf921_lifecycle_scale` | 0.7279 | 0.7437 | +0.0158 | --- ## Table 9 — Limitations summary | Limitation | Impact | Mitigation / future work | |---|---|---| | Synthetic data | May not reflect real operator language | Add expert/human-authored validation subset | | No official standard validators | Cannot claim production compliance | Add TMF921/CAMARA/OpenAPI/YANG validators | | O1 NRM weak value fidelity | Low-level RAN configuration unreliable | Add semantic evaluator and canonical labels | | A1 policy moderate fidelity | Policy values may be wrong | Add policy-specific extractor/scorer | | Lifecycle report/monitor weak | Measurement fields may be hard to reproduce | Use tolerance/semantic scoring | | Exact match low | Raw exact match over-penalizes volatile fields | Report normalized metrics alongside raw |