Instructions to use nraptisss/tmf921-intent-training with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use nraptisss/tmf921-intent-training with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Upload LEAKAGE_ANALYSIS.md
Browse files- LEAKAGE_ANALYSIS.md +100 -0
LEAKAGE_ANALYSIS.md
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Train/Test Leakage Analysis
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-08-05
|
| 4 |
+
**Analyst:** ML Intern automated audit
|
| 5 |
+
**Dataset:** nraptisss/TMF921-intent-to-config-research-sota
|
| 6 |
+
|
| 7 |
+
## Summary
|
| 8 |
+
|
| 9 |
+
The journal entry from 2026-04-30 reports "near-duplicate prompt similarity was high" with 602/2,521 test prompts having ≥95% char-ngram similarity to train prompts. **This analysis confirms that this is structural similarity, NOT data leakage.** The OOD splits are scientifically valid.
|
| 10 |
+
|
| 11 |
+
## Key Findings
|
| 12 |
+
|
| 13 |
+
### 1. Exact Prompt Overlap: 0% across ALL splits
|
| 14 |
+
|
| 15 |
+
| Split | Exact Prompt Overlap |
|
| 16 |
+
|---|---|
|
| 17 |
+
| test_in_distribution | 0 / 1,455 (0.0%) |
|
| 18 |
+
| test_template_ood | 0 / 3,503 (0.0%) |
|
| 19 |
+
| test_use_case_ood | 0 / 4,341 (0.0%) |
|
| 20 |
+
| test_sector_ood | 0 / 4,579 (0.0%) |
|
| 21 |
+
| test_adversarial | 0 / 33 (0.0%) |
|
| 22 |
+
|
| 23 |
+
**No test example has the exact same prompt text as any training example.**
|
| 24 |
+
|
| 25 |
+
### 2. Template + Scenario Pairs: 0% overlap
|
| 26 |
+
|
| 27 |
+
| Split | Template+Scenario Pair Overlap |
|
| 28 |
+
|---|---|
|
| 29 |
+
| test_in_distribution | 0 / 1,455 (0.0%) |
|
| 30 |
+
| test_template_ood | 0 / 3,503 (0.0%) |
|
| 31 |
+
| test_use_case_ood | 0 / 4,341 (0.0%) |
|
| 32 |
+
| test_sector_ood | 0 / 4,579 (0.0%) |
|
| 33 |
+
| test_adversarial | 0 / 33 (0.0%) |
|
| 34 |
+
|
| 35 |
+
**No test example combines the same template AND scenario as training.** Even when templates overlap, the scenarios are always different.
|
| 36 |
+
|
| 37 |
+
### 3. OOD Split Construction Validation
|
| 38 |
+
|
| 39 |
+
| Split | OOD Criterion | OOD % | In-Train % |
|
| 40 |
+
|---|---|---|---|
|
| 41 |
+
| test_template_ood | prompt_template_id | **100.0%** (65/65) | 0.0% |
|
| 42 |
+
| test_use_case_ood | scenario_id | **100.0%** (2545/2545) | 0.0% |
|
| 43 |
+
| test_sector_ood | scenario_id | **100.0%** (2769/2769) | 0.0% |
|
| 44 |
+
| test_adversarial | prompt_template_id | **100.0%** (33/33) | 0.0% |
|
| 45 |
+
| test_in_distribution | — | 40.3% OOD scenarios | 59.7% ID |
|
| 46 |
+
|
| 47 |
+
### 4. Completion Overlap Analysis
|
| 48 |
+
|
| 49 |
+
| Split | Completion Overlap | Explanation |
|
| 50 |
+
|---|---|---|
|
| 51 |
+
| test_in_distribution | 60.3% | Original random split; some deterministic lifecycle outputs |
|
| 52 |
+
| test_template_ood | 44.6% | Same reason, fewer lifecycle ops in this split |
|
| 53 |
+
| **test_use_case_ood** | **0.0%** | **No identical completions — genuinely OOD** |
|
| 54 |
+
| **test_sector_ood** | **0.0%** | **No identical completions — genuinely OOD** |
|
| 55 |
+
| test_adversarial | 100.0% | Expected — standardized rejection responses |
|
| 56 |
+
|
| 57 |
+
### 5. What "High Char-Ngram Similarity" Actually Means
|
| 58 |
+
|
| 59 |
+
The journal reports:
|
| 60 |
+
- ≥90% similarity: 1,290 / 2,521
|
| 61 |
+
- ≥95% similarity: 602 / 2,521
|
| 62 |
+
- ≥98% similarity: 262 / 2,521
|
| 63 |
+
|
| 64 |
+
**This measures structural similarity, not content duplication.**
|
| 65 |
+
|
| 66 |
+
All prompts follow templated patterns:
|
| 67 |
+
- *"Set up a network slice for [use_case] at [region]"*
|
| 68 |
+
- *"Deploy a [slice_type] slice for [use_case] with [latency] ms latency"*
|
| 69 |
+
|
| 70 |
+
The `prompt_normalized` column confirms this — it replaces variables with placeholders like `<use_case>`, `<region>`, `<num>`.
|
| 71 |
+
|
| 72 |
+
**High char-ngram similarity = same sentence structure with different values = EXPECTED and CORRECT for a templated dataset.**
|
| 73 |
+
|
| 74 |
+
## Conclusion
|
| 75 |
+
|
| 76 |
+
**There is NO data leakage.** The OOD splits are scientifically valid:
|
| 77 |
+
|
| 78 |
+
1. `test_template_ood` uses **100% held-out templates**
|
| 79 |
+
2. `test_use_case_ood` uses **100% held-out scenarios** (use cases)
|
| 80 |
+
3. `test_sector_ood` uses **100% held-out scenarios** (sectors)
|
| 81 |
+
4. Zero exact prompt duplication
|
| 82 |
+
5. Zero template+scenario pair duplication
|
| 83 |
+
6. Zero completion overlap for use-case and sector OOD splits
|
| 84 |
+
|
| 85 |
+
## Recommendation for Paper
|
| 86 |
+
|
| 87 |
+
Add the following to the methodology section:
|
| 88 |
+
|
| 89 |
+
> "While char-ngram similarity between train and test prompts appears high due to shared templated sentence structures, we confirm zero exact prompt duplication and zero template+scenario pair overlap. The OOD splits are constructed by holding out distinct prompt templates (test_template_ood), use cases (test_use_case_ood), and sectors (test_sector_ood), with 100% of held-out examples using scenarios not present in training."
|
| 90 |
+
|
| 91 |
+
## Scripts Used
|
| 92 |
+
|
| 93 |
+
This analysis was performed with automated scripts that:
|
| 94 |
+
1. Loaded all splits from the dataset
|
| 95 |
+
2. Computed exact text overlap for prompts, completions, and normalized prompts
|
| 96 |
+
3. Checked template_id, scenario_id, and json_structure_id overlap
|
| 97 |
+
4. Verified template+scenario pair uniqueness
|
| 98 |
+
5. Analyzed completion overlap by target_layer
|
| 99 |
+
|
| 100 |
+
All computations are reproducible from the published dataset.
|