Update ML Intern artifact metadata

0a2edf2 verified about 1 month ago

7.49 kB

license: apache-2.0
tags:
  - qlora
  - sft
  - trl
  - peft
  - qwen3
  - tmf921
  - intent-based-networking
  - network-slicing
  - rtx-6000-ada
  - ml-intern
base_model:
  - Qwen/Qwen3-8B
datasets:
  - nraptisss/TMF921-intent-to-config-research-sota

TMF921 Intent-to-Config Training + Evaluation

Training and evaluation repo for nraptisss/TMF921-intent-to-config-research-sota on a single RTX 6000 Ada 48/50GB server.

The default recipe is Qwen3-8B + QLoRA NF4 + TRL SFTTrainer + PEFT LoRA.

Why this recipe

Dataset rows were audited with Qwen/Qwen3-8B chat-template tokenization.
Source max length: 1,316 tokens, p99: 1,300, so max_length=2048 is safe.
QLoRA NF4 + double quant follows the QLoRA recipe for fitting large models on one 48GB-class GPU.
LoRA uses target_modules="all-linear", recommended for QLoRA-style training.
assistant_only_loss=True trains only the JSON/config response tokens.
Evaluation is split by in-distribution and OOD splits; do not report only a single merged score.

Hardware target

Recommended server:

GPU: NVIDIA RTX 6000 Ada, 48GB/50GB VRAM
RAM: 64GB+
Disk: 200GB+ free
CUDA-compatible PyTorch

Default effective batch size:

per_device_train_batch_size = 2
gradient_accumulation_steps = 8
effective batch size = 16
max_length = 2048

If OOM occurs, preserve the effective batch size by changing:

per_device_train_batch_size: 1
gradient_accumulation_steps: 16

Do not reduce max_length unless you intentionally want a different training task.

Quick start with nohup, unique run dirs, and resumable checkpoints

git clone https://huggingface.co/nraptisss/tmf921-intent-training
cd tmf921-intent-training

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
bash scripts/install_rtx6000ada.sh
python scripts/check_gpu.py

export HF_TOKEN=hf_...
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH="$PWD/src"
export TOKENIZERS_PARALLELISM=false

# Optional Trackio dashboard
# export TRACKIO_SPACE_ID=nraptisss/tmf921-trackio

bash scripts/nohup_new_run.sh

The helper creates a fresh run directory every time:

runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS/
  configs/config.yaml
  logs/train.log
  outputs/adapter/checkpoint-*/
  eval/

Monitor:

RUN_DIR=runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS
bash scripts/status_run.sh "$RUN_DIR"
tail -f "$RUN_DIR/logs/train.log"
watch -n 2 nvidia-smi

Resume after crash/reboot:

cd tmf921-intent-training
source .venv/bin/activate
export HF_TOKEN=hf_...
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH="$PWD/src"

bash scripts/nohup_resume.sh runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS

Evaluate after training:

bash scripts/nohup_eval.sh runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS

Manual training command, if you do not want nohup:

python scripts/train_qlora.py \
  --config configs/rtx6000ada_qwen3_8b_qlora.yaml

Optional Trackio monitoring

The training script uses the native Transformers/TRL Trackio integration when project is set in the config.

Set a Trackio Space if desired:

export TRACKIO_SPACE_ID=nraptisss/tmf921-trackio

Or edit:

project: tmf921-intent-sft
trackio_space_id: nraptisss/tmf921-trackio

The trainer logs plain-text loss lines with:

disable_tqdm=True
logging_strategy="steps"
logging_first_step=True
report_to="trackio"

A callback emits Trackio alerts for NaN/Inf loss, high gradient norm, and high eval loss.

Configs

model_name_or_path: Qwen/Qwen3-8B
dataset_name: nraptisss/TMF921-intent-to-config-research-sota
train_split: train_sota
eval_split: validation
max_length: 2048
assistant_only_loss: true
load_in_4bit: true
lora_r: 64
lora_alpha: 16
lora_target_modules: all-linear
learning_rate: 0.0002
optim: paged_adamw_32bit
bf16: true
push_to_hub: true

Experimental 14B

configs/rtx6000ada_qwen3_14b_qlora_experimental.yaml

Use only after the 8B run, and expect tighter memory.

Evaluation

After training adapters:

python scripts/evaluate_model.py \
  --model Qwen/Qwen3-8B \
  --adapter outputs/qwen3-8b-tmf921-qlora \
  --dataset nraptisss/TMF921-intent-to-config-research-sota \
  --output_dir outputs/qwen3-8b-tmf921-qlora/eval \
  --load_in_4bit

Evaluated splits by default:

test_in_distribution
test_template_ood
test_use_case_ood
test_sector_ood
test_adversarial

Metrics:

JSON parse rate
canonical JSON exact match
field precision / recall / F1
slice/SST diagnostic pass
KPI text-presence diagnostic pass
adversarial status pass
stratified metrics by target_layer, slice_type, and lifecycle_operation

Outputs:

outputs/.../eval/all_metrics.json
outputs/.../eval/<split>/metrics.json
outputs/.../eval/<split>/predictions.json

Merge adapter for deployment

python scripts/merge_adapter.py \
  --base_model Qwen/Qwen3-8B \
  --adapter outputs/qwen3-8b-tmf921-qlora \
  --output_dir outputs/qwen3-8b-tmf921-merged

Push merged model:

python scripts/merge_adapter.py \
  --base_model Qwen/Qwen3-8B \
  --adapter nraptisss/Qwen3-8B-TMF921-Intent-QLoRA-ResearchSOTA \
  --output_dir outputs/merged \
  --push_to_hub \
  --hub_model_id nraptisss/Qwen3-8B-TMF921-Intent-Merged

Scientific reporting protocol

For research papers/reports, report at least:

validation loss,
test_in_distribution metrics,
test_template_ood metrics,
test_use_case_ood metrics,
test_sector_ood metrics,
test_adversarial metrics,
per-target-layer field F1,
JSON parse rate,
exact-match rate,
rare-class metrics for lifecycle operations and adversarial categories.

Do not claim production standards compliance from JSON validity alone. Official TMF921/3GPP/ETSI/CAMARA/O-RAN validators are still needed for schema-level certification.

Files

configs/
  rtx6000ada_qwen3_8b_qlora.yaml
  rtx6000ada_qwen3_14b_qlora_experimental.yaml
scripts/
  train_qlora.py
  evaluate_model.py
  merge_adapter.py
  run_rtx6000ada.sh
  nohup_new_run.sh
  nohup_resume.sh
  nohup_eval.sh
  status_run.sh
src/tmf921_train/
  utils.py
requirements.txt

References

QLoRA: https://huggingface.co/papers/2305.14314
LoRA: https://huggingface.co/papers/2106.09685
TRL SFTTrainer docs: https://huggingface.co/docs/trl/sft_trainer
TRL PEFT integration: https://huggingface.co/docs/trl/peft_integration
Source dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'nraptisss/tmf921-intent-training'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.