# Train/Test Leakage Analysis

**Date:** 2026-08-05
**Analyst:** ML Intern automated audit
**Dataset:** nraptisss/TMF921-intent-to-config-research-sota

## Summary

The journal entry from 2026-04-30 reports "near-duplicate prompt similarity was high" with 602/2,521 test prompts having ≥95% char-ngram similarity to train prompts. **This analysis confirms that this is structural similarity, NOT data leakage.** The OOD splits are scientifically valid.

## Key Findings

### 1. Exact Prompt Overlap: 0% across ALL splits

| Split | Exact Prompt Overlap |
|---|---|
| test_in_distribution | 0 / 1,455 (0.0%) |
| test_template_ood | 0 / 3,503 (0.0%) |
| test_use_case_ood | 0 / 4,341 (0.0%) |
| test_sector_ood | 0 / 4,579 (0.0%) |
| test_adversarial | 0 / 33 (0.0%) |

**No test example has the exact same prompt text as any training example.**

### 2. Template + Scenario Pairs: 0% overlap

| Split | Template+Scenario Pair Overlap |
|---|---|
| test_in_distribution | 0 / 1,455 (0.0%) |
| test_template_ood | 0 / 3,503 (0.0%) |
| test_use_case_ood | 0 / 4,341 (0.0%) |
| test_sector_ood | 0 / 4,579 (0.0%) |
| test_adversarial | 0 / 33 (0.0%) |

**No test example combines the same template AND scenario as training.** Even when templates overlap, the scenarios are always different.

### 3. OOD Split Construction Validation

| Split | OOD Criterion | OOD % | In-Train % |
|---|---|---|---|
| test_template_ood | prompt_template_id | **100.0%** (65/65) | 0.0% |
| test_use_case_ood | scenario_id | **100.0%** (2545/2545) | 0.0% |
| test_sector_ood | scenario_id | **100.0%** (2769/2769) | 0.0% |
| test_adversarial | prompt_template_id | **100.0%** (33/33) | 0.0% |
| test_in_distribution | — | 40.3% OOD scenarios | 59.7% ID |

### 4. Completion Overlap Analysis

| Split | Completion Overlap | Explanation |
|---|---|---|
| test_in_distribution | 60.3% | Original random split; some deterministic lifecycle outputs |
| test_template_ood | 44.6% | Same reason, fewer lifecycle ops in this split |
| **test_use_case_ood** | **0.0%** | **No identical completions — genuinely OOD** |
| **test_sector_ood** | **0.0%** | **No identical completions — genuinely OOD** |
| test_adversarial | 100.0% | Expected — standardized rejection responses |

### 5. What "High Char-Ngram Similarity" Actually Means

The journal reports:
- ≥90% similarity: 1,290 / 2,521
- ≥95% similarity: 602 / 2,521
- ≥98% similarity: 262 / 2,521

**This measures structural similarity, not content duplication.**

All prompts follow templated patterns:
- *"Set up a network slice for [use_case] at [region]"*
- *"Deploy a [slice_type] slice for [use_case] with [latency] ms latency"*

The `prompt_normalized` column confirms this — it replaces variables with placeholders like `<use_case>`, `<region>`, `<num>`.

**High char-ngram similarity = same sentence structure with different values = EXPECTED and CORRECT for a templated dataset.**

## Conclusion

**There is NO data leakage.** The OOD splits are scientifically valid:

1. `test_template_ood` uses **100% held-out templates**
2. `test_use_case_ood` uses **100% held-out scenarios** (use cases)
3. `test_sector_ood` uses **100% held-out scenarios** (sectors)
4. Zero exact prompt duplication
5. Zero template+scenario pair duplication
6. Zero completion overlap for use-case and sector OOD splits

## Recommendation for Paper

Add the following to the methodology section:

> "While char-ngram similarity between train and test prompts appears high due to shared templated sentence structures, we confirm zero exact prompt duplication and zero template+scenario pair overlap. The OOD splits are constructed by holding out distinct prompt templates (test_template_ood), use cases (test_use_case_ood), and sectors (test_sector_ood), with 100% of held-out examples using scenarios not present in training."

## Scripts Used

This analysis was performed with automated scripts that:
1. Loaded all splits from the dataset
2. Computed exact text overlap for prompts, completions, and normalized prompts
3. Checked template_id, scenario_id, and json_structure_id overlap
4. Verified template+scenario pair uniqueness
5. Analyzed completion overlap by target_layer

All computations are reproducible from the published dataset.