---
language:
- nl
tags:
- phi-4
- long-context
- 128k
- n3
- lora
- adapter
base_model: microsoft/Phi-4-mini-instruct
datasets:
- UWV/wim-instruct-wiki-to-jsonld-agent-steps
license: apache-2.0
library_name: peft
---

# Phi-4-mini N3 Transform to Knowledge Graph Fine-tune

This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for transforming entity and schema information into JSON-LD format, trained as part of the WIM (Wikipedia to Knowledge Graph) pipeline.

## Model Details

### Model Description

- **Developed by:** UWV InnovatieHub
- **Model type:** Causal Language Model with LoRA fine-tuning
- **Language(s):** Dutch (nl)
- **License:** MIT
- **Finetuned from:** microsoft/Phi-4-mini-instruct (3.82B parameters)
- **Training Framework:** Unsloth (optimized training for extreme context lengths)

### Training Details

- **Dataset:** [UWV/wim-instruct-wiki-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-wiki-to-jsonld-agent-steps)
- **Dataset Size:** 10,593 N3-specific examples (JSON-LD transformation tasks)
- **Training Duration:** 41 hours 54 minutes
- **Hardware:** NVIDIA A100 80GB
- **Context Length:** 131,072 tokens (128K)
- **Steps:** 1,000
- **Training Metrics:**
  - Final Training Loss: 0.11
  - Final Eval Loss: 0.119
  - Trainable Parameters: ~178M (4.4% of model)

### LoRA Configuration

```python
{
    "r": 320,                    # Rank (Microsoft's recommended config)
    "lora_alpha": 320,          # Alpha (1:1 ratio for Phi-4)
    "lora_dropout": 0.0,        # No dropout
    "bias": "none",
    "task_type": "CAUSAL_LM",
    "target_modules": [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ]
}
```

### Training Configuration

```python
{
    "model": "phi4-mini",
    "max_seq_length": 131072,    # 128K context
    "batch_size": 1,
    "gradient_accumulation_steps": 8,
    "effective_batch_size": 8,
    "learning_rate": 1e-5,
    "warmup_steps": 20,
    "max_grad_norm": 1.0,
    "lr_scheduler": "linear",
    "optimizer": "paged_adamw_8bit",
    "bf16": True,
    "gradient_checkpointing": True,
    "seed": 42
}
```

## Intended Uses & Limitations

### Intended Uses

- **JSON-LD Generation**: Transform entity and schema information into valid JSON-LD format
- **Knowledge Graph Construction**: Third step (N3) in the WIM pipeline
- **Structured Data Creation**: Convert unstructured entity descriptions to Schema.org-compliant JSON-LD
- **Long Context Processing**: Handle extremely long input sequences (up to 128K tokens)

### Limitations

- Requires extensive context (average input ~40K tokens)
- Memory intensive due to long sequences
- Best performance with Phi-4's specific prompt format
- May require post-processing validation (N4 step)

## How to Use

### Option 1: Using the Merged Model (Recommended)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import json

# Load the merged model (ready to use)
model = AutoModelForCausalLM.from_pretrained(
    "UWV/wim-n3-phi4-mini-merged",  # Update with actual repo
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n3-phi4-mini-merged")

# Prepare input (typically very long with entity and schema information)
entities = [
    {"name": "Amsterdam", "type": "City"},
    {"name": "Netherlands", "type": "Country"}
]
schemas = {
    "City": "https://schema.org/City",
    "Country": "https://schema.org/Country"
}

messages = [
    {
        "role": "system", 
        "content": "You are an expert in creating JSON-LD representations using Schema.org vocabulary."
    },
    {
        "role": "user", 
        "content": f"""Transform the following entities into JSON-LD format using Schema.org:

Entities: {json.dumps(entities, ensure_ascii=False)}
Schemas: {json.dumps(schemas, ensure_ascii=False)}

Create a complete JSON-LD representation with proper @context and @type declarations."""
    }
]

# Apply chat template and generate
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=131072)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=4096,  # JSON-LD can be long
        temperature=0.1,      # Low temperature for valid JSON
        do_sample=True,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode and parse response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
if "assistant:" in response:
    json_ld = response.split("assistant:")[-1].strip()

print(json_ld)
```

### Option 2: Using the LoRA Adapter

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-4-mini-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Load adapter
model = PeftModel.from_pretrained(
    base_model,
    "UWV/wim-n3-phi4-mini-adapter"  # Update with actual repo
)
tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n3-phi4-mini-adapter")

# Use same inference code as above...
```

## Expected Output Format

The model outputs valid JSON-LD with Schema.org vocabulary:

```json
{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "City",
      "@id": "_:amsterdam",
      "name": "Amsterdam",
      "containedInPlace": {
        "@id": "_:netherlands"
      }
    },
    {
      "@type": "Country",
      "@id": "_:netherlands",
      "name": "Netherlands"
    }
  ]
}
```

## Dataset Information

The model was trained on the [UWV/wim-instruct-wiki-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-wiki-to-jsonld-agent-steps) dataset, which contains:

- **Source**: Dutch Wikipedia articles processed through N1 and N2 steps
- **Processing**: Multi-agent pipeline converting text to JSON-LD
- **N3 Examples**: 10,593 transformation tasks
- **Average Token Length**: ~40,388 tokens (extremely long sequences)
- **Max Token Length**: 520,575 tokens
- **Format**: ChatML-formatted instruction-following examples
- **Task**: Transform entity and schema information into valid JSON-LD

## Training Results

The model achieved exceptional performance with minimal overfitting:
- **Final Loss**: 0.11 (excellent convergence)
- **Eval Loss**: 0.119 (very close to training loss)
- **Loss Ratio**: 0.92 (indicating good generalization)

This was achieved despite the extreme context lengths and complex transformation task.

## Model Versions

- **Merged Model**: `UWV/wim-n3-phi4-mini-merged` (681MB adapter + base model)
  - Ready to use without adapter loading
  - Recommended for production inference
  
- **LoRA Adapter**: `UWV/wim-n3-phi4-mini-adapter` (681MB)
  - Requires base Phi-4-mini-instruct model
  - More flexible for further fine-tuning

## Pipeline Context

This model is part of the WIM (Wikipedia to Knowledge Graph) pipeline:

1. **N1**: Entity Extraction
2. **N2**: Schema.org Type Selection
3. **N3 (This Model)**: Transform to JSON-LD
4. **N4**: Validation
5. **N5**: Add Human-Readable Labels

N3 is the most computationally intensive step, handling the complex transformation from structured entity information to valid JSON-LD format.

## Technical Notes

- **Memory Requirements**: ~53GB VRAM for 128K context inference
- **Optimization**: Uses Unsloth's custom kernels for efficient long-context processing
- **Special Configuration**: Requires `TORCH_COMPILE_DISABLE=1` for Phi-4 compatibility
- **Context Handling**: Can process full Wikipedia articles with extensive entity information

## Citation

If you use this model, please cite:

```bibtex
@misc{wim-n3-phi4-mini,
  author = {UWV InnovatieHub},
  title = {Phi-4-mini N3 Transform to JSON-LD Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/UWV/wim-n3-phi4-mini-merged}
}
```