--- language: - nl tags: - phi-4 - long-context - 128k - n3 - lora - adapter base_model: microsoft/Phi-4-mini-instruct datasets: - UWV/wim-instruct-wiki-to-jsonld-agent-steps license: apache-2.0 library_name: peft --- # Phi-4-mini N3 Transform to Knowledge Graph Fine-tune This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for transforming entity and schema information into JSON-LD format, trained as part of the WIM (Wikipedia to Knowledge Graph) pipeline. ## Model Details ### Model Description - **Developed by:** UWV InnovatieHub - **Model type:** Causal Language Model with LoRA fine-tuning - **Language(s):** Dutch (nl) - **License:** MIT - **Finetuned from:** microsoft/Phi-4-mini-instruct (3.82B parameters) - **Training Framework:** Unsloth (optimized training for extreme context lengths) ### Training Details - **Dataset:** [UWV/wim-instruct-wiki-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-wiki-to-jsonld-agent-steps) - **Dataset Size:** 10,593 N3-specific examples (JSON-LD transformation tasks) - **Training Duration:** 41 hours 54 minutes - **Hardware:** NVIDIA A100 80GB - **Context Length:** 131,072 tokens (128K) - **Steps:** 1,000 - **Training Metrics:** - Final Training Loss: 0.11 - Final Eval Loss: 0.119 - Trainable Parameters: ~178M (4.4% of model) ### LoRA Configuration ```python { "r": 320, # Rank (Microsoft's recommended config) "lora_alpha": 320, # Alpha (1:1 ratio for Phi-4) "lora_dropout": 0.0, # No dropout "bias": "none", "task_type": "CAUSAL_LM", "target_modules": [ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ] } ``` ### Training Configuration ```python { "model": "phi4-mini", "max_seq_length": 131072, # 128K context "batch_size": 1, "gradient_accumulation_steps": 8, "effective_batch_size": 8, "learning_rate": 1e-5, "warmup_steps": 20, "max_grad_norm": 1.0, "lr_scheduler": "linear", "optimizer": "paged_adamw_8bit", "bf16": True, "gradient_checkpointing": True, "seed": 42 } ``` ## Intended Uses & Limitations ### Intended Uses - **JSON-LD Generation**: Transform entity and schema information into valid JSON-LD format - **Knowledge Graph Construction**: Third step (N3) in the WIM pipeline - **Structured Data Creation**: Convert unstructured entity descriptions to Schema.org-compliant JSON-LD - **Long Context Processing**: Handle extremely long input sequences (up to 128K tokens) ### Limitations - Requires extensive context (average input ~40K tokens) - Memory intensive due to long sequences - Best performance with Phi-4's specific prompt format - May require post-processing validation (N4 step) ## How to Use ### Option 1: Using the Merged Model (Recommended) ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch import json # Load the merged model (ready to use) model = AutoModelForCausalLM.from_pretrained( "UWV/wim-n3-phi4-mini-merged", # Update with actual repo torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n3-phi4-mini-merged") # Prepare input (typically very long with entity and schema information) entities = [ {"name": "Amsterdam", "type": "City"}, {"name": "Netherlands", "type": "Country"} ] schemas = { "City": "https://schema.org/City", "Country": "https://schema.org/Country" } messages = [ { "role": "system", "content": "You are an expert in creating JSON-LD representations using Schema.org vocabulary." }, { "role": "user", "content": f"""Transform the following entities into JSON-LD format using Schema.org: Entities: {json.dumps(entities, ensure_ascii=False)} Schemas: {json.dumps(schemas, ensure_ascii=False)} Create a complete JSON-LD representation with proper @context and @type declarations.""" } ] # Apply chat template and generate prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=131072) inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=4096, # JSON-LD can be long temperature=0.1, # Low temperature for valid JSON do_sample=True, top_p=0.95, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, ) # Decode and parse response response = tokenizer.decode(outputs[0], skip_special_tokens=True) if "assistant:" in response: json_ld = response.split("assistant:")[-1].strip() print(json_ld) ``` ### Option 2: Using the LoRA Adapter ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch # Load base model base_model = AutoModelForCausalLM.from_pretrained( "microsoft/Phi-4-mini-instruct", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) # Load adapter model = PeftModel.from_pretrained( base_model, "UWV/wim-n3-phi4-mini-adapter" # Update with actual repo ) tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n3-phi4-mini-adapter") # Use same inference code as above... ``` ## Expected Output Format The model outputs valid JSON-LD with Schema.org vocabulary: ```json { "@context": "https://schema.org", "@graph": [ { "@type": "City", "@id": "_:amsterdam", "name": "Amsterdam", "containedInPlace": { "@id": "_:netherlands" } }, { "@type": "Country", "@id": "_:netherlands", "name": "Netherlands" } ] } ``` ## Dataset Information The model was trained on the [UWV/wim-instruct-wiki-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-wiki-to-jsonld-agent-steps) dataset, which contains: - **Source**: Dutch Wikipedia articles processed through N1 and N2 steps - **Processing**: Multi-agent pipeline converting text to JSON-LD - **N3 Examples**: 10,593 transformation tasks - **Average Token Length**: ~40,388 tokens (extremely long sequences) - **Max Token Length**: 520,575 tokens - **Format**: ChatML-formatted instruction-following examples - **Task**: Transform entity and schema information into valid JSON-LD ## Training Results The model achieved exceptional performance with minimal overfitting: - **Final Loss**: 0.11 (excellent convergence) - **Eval Loss**: 0.119 (very close to training loss) - **Loss Ratio**: 0.92 (indicating good generalization) This was achieved despite the extreme context lengths and complex transformation task. ## Model Versions - **Merged Model**: `UWV/wim-n3-phi4-mini-merged` (681MB adapter + base model) - Ready to use without adapter loading - Recommended for production inference - **LoRA Adapter**: `UWV/wim-n3-phi4-mini-adapter` (681MB) - Requires base Phi-4-mini-instruct model - More flexible for further fine-tuning ## Pipeline Context This model is part of the WIM (Wikipedia to Knowledge Graph) pipeline: 1. **N1**: Entity Extraction 2. **N2**: Schema.org Type Selection 3. **N3 (This Model)**: Transform to JSON-LD 4. **N4**: Validation 5. **N5**: Add Human-Readable Labels N3 is the most computationally intensive step, handling the complex transformation from structured entity information to valid JSON-LD format. ## Technical Notes - **Memory Requirements**: ~53GB VRAM for 128K context inference - **Optimization**: Uses Unsloth's custom kernels for efficient long-context processing - **Special Configuration**: Requires `TORCH_COMPILE_DISABLE=1` for Phi-4 compatibility - **Context Handling**: Can process full Wikipedia articles with extensive entity information ## Citation If you use this model, please cite: ```bibtex @misc{wim-n3-phi4-mini, author = {UWV InnovatieHub}, title = {Phi-4-mini N3 Transform to JSON-LD Model}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/UWV/wim-n3-phi4-mini-merged} } ```