yasserrmd commited on
Commit
55e3b89
·
verified ·
1 Parent(s): b405684

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +195 -11
README.md CHANGED
@@ -1,21 +1,205 @@
1
  ---
2
  base_model: LiquidAI/LFM2-350M-Extract
3
- tags:
4
- - text-generation-inference
5
- - transformers
6
- - unsloth
7
- - lfm2
8
  license: apache-2.0
9
  language:
10
  - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- # Uploaded finetuned model
14
 
15
- - **Developed by:** yasserrmd
16
- - **License:** apache-2.0
17
- - **Finetuned from model :** LiquidAI/LFM2-350M-Extract
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- This lfm2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 
 
20
 
21
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
1
  ---
2
  base_model: LiquidAI/LFM2-350M-Extract
 
 
 
 
 
3
  license: apache-2.0
4
  language:
5
  - en
6
+ tags:
7
+ - text-generation
8
+ - instruction-tuning
9
+ - structured-output
10
+ - toon
11
+ - lfm2
12
+ - unsloth
13
+ - lora
14
+ - transformers
15
+ datasets:
16
+ - yasserrmd/TOON-Unstructured-Structured
17
+ model-index:
18
+ - name: yasserrmd/LFM2-350M-Extract-TOON
19
+ results:
20
+ - task:
21
+ name: TOON conversion (schema-driven extraction)
22
+ type: text-generation
23
+ dataset:
24
+ name: yasserrmd/TOON-Unstructured-Structured
25
+ type: text
26
+ metrics:
27
+ - name: Final Training Loss
28
+ type: loss
29
+ value: 0.2178
30
+ - name: Lowest Loss
31
+ type: loss
32
+ value: 0.2043
33
+ - name: Total Steps
34
+ type: steps
35
+ value: 430
36
+ ---
37
+
38
+ # yasserrmd/LFM2-350M-Extract-TOON
39
+
40
+ `yasserrmd/LFM2-350M-Extract-TOON` is a **fine-tuned variant of LiquidAI’s LFM2-350M-Extract**, built using the **Unsloth AI** framework and the dataset [`yasserrmd/TOON-Unstructured-Structured`](https://huggingface.co/datasets/yasserrmd/TOON-Unstructured-Structured).
41
+
42
+ This model specializes in **schema-driven conversion of natural-language text into valid TOON (Token-Oriented Object Notation)** format — a compact, token-efficient alternative to JSON designed for large language models.
43
+
44
+ ---
45
+
46
+ ## Model Overview
47
+
48
+ | Property | Description |
49
+ |-----------|-------------|
50
+ | **Base Model** | LiquidAI/LFM2-350M-Extract |
51
+ | **Architecture** | LFM2-350M (Decoder-only Transformer) |
52
+ | **Fine-tuning Method** | LoRA (via Unsloth AI) |
53
+ | **Objective** | Structured extraction in TOON format |
54
+ | **Dataset** | yasserrmd/TOON-Unstructured-Structured |
55
+ | **Languages** | English |
56
+ | **Frameworks** | Transformers, Unsloth, PyTorch |
57
+ | **License** | LFM License v1.0 |
58
+ | **Final Loss** | 0.2178 (Step 430) |
59
+
60
+ ---
61
+
62
+ ## What is TOON?
63
+
64
+ **TOON (Token-Oriented Object Notation)** is a serialization format optimized for LLMs.
65
+ It represents structured data with minimal tokens using a **header + rows** pattern:
66
+
67
+ ```
68
+
69
+ users[2]{id,name,role}:
70
+ 1,Alice,admin
71
+ 2,Bob,user
72
+
73
+ ````
74
+
75
+ Compared to JSON, TOON reduces token count by up to 60% and is easier for LLMs to generate deterministically.
76
+
77
+ ---
78
+
79
+ ## Training Summary
80
+
81
+ The model was trained on 430 steps with the following key trends:
82
+
83
+ - **Initial loss:** 1.3793
84
+ - **Final loss:** 0.2178
85
+ - **Lowest recorded loss:** 0.2043
86
+ - **Steady convergence** after step 250 with consistent decline below 0.3.
87
+ - **Training method:** Unsloth LoRA (rank 16, alpha 32, learning rate 2e-4, batch size 64).
88
+ - **Hardware:** 1x NVIDIA L4 (24 GB VRAM).
89
+ - **Duration:** 1.5 hours.
90
+
91
+ The training demonstrated strong stability and smooth convergence towards sub-0.25 loss, confirming excellent adaptation of the base model to TOON structure.
92
+
93
+ ---
94
+
95
+ ## 🧰 Usage Example
96
+
97
+ ```python
98
+ from transformers import AutoTokenizer, AutoModelForCausalLM
99
+
100
+ model_id = "yasserrmd/LFM2-350M-Extract-TOON"
101
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
102
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
103
+
104
+ schema = "animal{name,action,location}"
105
+ text = "The cat sat on the mat."
106
+
107
+ system = (
108
+ "You are a precise extractor that outputs TOON format only. "
109
+ "Header must be <label>[1]{fields}: followed by a single row of comma-separated values. "
110
+ "No commentary."
111
+ )
112
+ user = f'Use schema: {schema}\nText: "{text}"'
113
+
114
+ messages = [
115
+ {"role": "system", "content": system},
116
+ {"role": "user", "content": user}
117
+ ]
118
+
119
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
120
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
121
+ outputs = model.generate(**inputs, max_new_tokens=80, temperature=0)
122
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
123
+ ```
124
+
125
+ **Expected Output:**
126
+
127
+ ```
128
+ animal[1]{name,action,location}:
129
+ cat,sat,mat
130
+ ```
131
+
132
+ ---
133
+
134
+ ## 📈 Evaluation (Fine-tune Metrics)
135
+
136
+ | Metric | Value |
137
+ | ------------------- | ------------------------- |
138
+ | Final Training Loss | **0.2178** |
139
+ | Lowest Loss | **0.2043** |
140
+ | Total Steps | **430** |
141
+ | Stability | Excellent (no divergence) |
142
+
143
+ ---
144
+
145
+ ## 🚀 Intended Use
146
+
147
+ * **Structured data extraction** from unstructured text.
148
+ * **Compact schema-based representations** for LLM pipelines.
149
+ * **Dataset generation** for downstream tasks (e.g., CSV, SQL, knowledge graph).
150
+ * Works best with short or medium-length text requiring structured outputs.
151
+
152
+ ---
153
+
154
+ ## Limitations
155
+
156
+ * Schema must be explicit; generic prompts reduce accuracy.
157
+ * English-only alignment (no multilingual fine-tuning yet).
158
+
159
  ---
160
 
161
+ ## Future Work
162
 
163
+ * Fine-tune on multi-row (`[n]`) TOON conversions.
164
+ * Expand coverage to other domains (e.g., medical, legal, environmental).
165
+ * Evaluate zero-shot generalization on unseen schemas.
166
+ * Explore quantized (GGUF) release for CPU/edge inference.
167
+
168
+ ---
169
+
170
+ ## Citation
171
+
172
+ ```bibtex
173
+ @misc{yasserrmd2025lfm2toon,
174
+ title = {LFM2-350M-Extract-TOON: Schema-driven TOON Output Model},
175
+ author = {Mohamed Yasser},
176
+ year = {2025},
177
+ howpublished = {\url{https://huggingface.co/yasserrmd/LFM2-350M-Extract-TOON}}
178
+ }
179
+ ```
180
+
181
+ ---
182
+
183
+ ## 🙏 Acknowledgements
184
+
185
+ * **Base model:** LiquidAI team for LFM2-350M-Extract
186
+ * **Fine-tuning framework:** Unsloth AI
187
+ * **Dataset:** yasserrmd/TOON-Unstructured-Structured
188
+ * **Concept:** Token-Oriented Object Notation (TOON)
189
+
190
+ ---
191
+
192
+ ## 📜 Version History
193
+
194
+ | Version | Date | Changes |
195
+ | ------- | ---------- | ---------------------------------------- |
196
+ | v1.0 | 2025-11-11 | Initial release (Unsloth LoRA fine-tune) |
197
+ | v1.1 | TBD | Planned quantized GGUF release |
198
+
199
+ ---
200
 
201
+ **Model performance summary:**
202
+ The model successfully converged from **1.37 → 0.21 loss** over 430 steps, showing a 6× reduction in training loss.
203
+ It produces deterministic, schema-accurate TOON outputs under the specified system instruction, making it an efficient structured extraction model for lightweight and edge deployments.
204
 
205
+ ---