---
base_model: LiquidAI/LFM2-350M-Extract
license: apache-2.0
language:
- en
tags:
- text-generation
- instruction-tuning
- structured-output
- toon
- lfm2
- unsloth
- lora
- transformers
datasets:
- yasserrmd/TOON-Unstructured-Structured
model-index:
- name: yasserrmd/LFM2-350M-Extract-TOON
  results:
  - task:
      name: TOON conversion (schema-driven extraction)
      type: text-generation
    dataset:
      name: yasserrmd/TOON-Unstructured-Structured
      type: text
    metrics:
    - name: Final Training Loss
      type: loss
      value: 0.2178
    - name: Lowest Loss
      type: loss
      value: 0.2043
    - name: Total Steps
      type: steps
      value: 430
---

# yasserrmd/LFM2-350M-Extract-TOON

`yasserrmd/LFM2-350M-Extract-TOON` is a **fine-tuned variant of LiquidAI’s LFM2-350M-Extract**, built using the **Unsloth AI** framework and the dataset [`yasserrmd/TOON-Unstructured-Structured`](https://huggingface.co/datasets/yasserrmd/TOON-Unstructured-Structured).

This model specializes in **schema-driven conversion of natural-language text into valid TOON (Token-Oriented Object Notation)** format — a compact, token-efficient alternative to JSON designed for large language models.

---

## Model Overview

| Property | Description |
|-----------|-------------|
| **Base Model** | LiquidAI/LFM2-350M-Extract |
| **Architecture** | LFM2-350M (Decoder-only Transformer) |
| **Fine-tuning Method** | LoRA (via Unsloth AI) |
| **Objective** | Structured extraction in TOON format |
| **Dataset** | yasserrmd/TOON-Unstructured-Structured |
| **Languages** | English |
| **Frameworks** | Transformers, Unsloth, PyTorch |
| **License** | LFM License v1.0 |
| **Final Loss** | 0.2178 (Step 430) |

---

## What is TOON?

**TOON (Token-Oriented Object Notation)** is a serialization format optimized for LLMs.  
It represents structured data with minimal tokens using a **header + rows** pattern:

```

users[2]{id,name,role}:
1,Alice,admin
2,Bob,user

````

Compared to JSON, TOON reduces token count by up to 60% and is easier for LLMs to generate deterministically.

---

## Training Summary

The model was trained on 430 steps with the following key trends:

- **Initial loss:** 1.3793  
- **Final loss:** 0.2178  
- **Lowest recorded loss:** 0.2043  
- **Steady convergence** after step 250 with consistent decline below 0.3.  
- **Training method:** Unsloth LoRA (rank 16, alpha 32, learning rate 2e-4, batch size 64).  
- **Hardware:** 1x NVIDIA L4 (24 GB VRAM).  
- **Duration:** 1.5 hours.

The training demonstrated strong stability and smooth convergence towards sub-0.25 loss, confirming excellent adaptation of the base model to TOON structure.

---

## 🧰 Usage Example

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "yasserrmd/LFM2-350M-Extract-TOON"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

schema = "animal{name,action,location}"
text = "The cat sat on the mat."

system = (
    "You are a precise extractor that outputs TOON format only. "
    "Header must be <label>[1]{fields}: followed by a single row of comma-separated values. "
    "No commentary."
)
user = f'Use schema: {schema}\nText: "{text}"'

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": user}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=80, temperature=0)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

**Expected Output:**

```
animal[1]{name,action,location}:
cat,sat,mat
```

---

## 📈 Evaluation (Fine-tune Metrics)

| Metric              | Value                     |
| ------------------- | ------------------------- |
| Final Training Loss | **0.2178**                |
| Lowest Loss         | **0.2043**                |
| Total Steps         | **430**                   |
| Stability           | Excellent (no divergence) |

---

## 🚀 Intended Use

* **Structured data extraction** from unstructured text.
* **Compact schema-based representations** for LLM pipelines.
* **Dataset generation** for downstream tasks (e.g., CSV, SQL, knowledge graph).
* Works best with short or medium-length text requiring structured outputs.

---

## Limitations

* Schema must be explicit; generic prompts reduce accuracy.
* English-only alignment (no multilingual fine-tuning yet).

---

## Future Work

* Fine-tune on multi-row (`[n]`) TOON conversions.
* Expand coverage to other domains (e.g., medical, legal, environmental).
* Evaluate zero-shot generalization on unseen schemas.
* Explore quantized (GGUF) release for CPU/edge inference.

---

## Citation

```bibtex
@misc{yasserrmd2025lfm2toon,
  title        = {LFM2-350M-Extract-TOON: Schema-driven TOON Output Model},
  author       = {Mohamed Yasser},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/yasserrmd/LFM2-350M-Extract-TOON}}
}
```

---

## 🙏 Acknowledgements

* **Base model:** LiquidAI team for LFM2-350M-Extract
* **Fine-tuning framework:** Unsloth AI
* **Dataset:** yasserrmd/TOON-Unstructured-Structured
* **Concept:** Token-Oriented Object Notation (TOON)

---

## 📜 Version History

| Version | Date       | Changes                                  |
| ------- | ---------- | ---------------------------------------- |
| v1.0    | 2025-11-11 | Initial release (Unsloth LoRA fine-tune) |
| v1.1    | TBD        | Planned quantized GGUF release           |

---

**Model performance summary:**
The model successfully converged from **1.37 → 0.21 loss** over 430 steps, showing a 6× reduction in training loss.
It produces deterministic, schema-accurate TOON outputs under the specified system instruction, making it an efficient structured extraction model for lightweight and edge deployments.

---