mohanprakash462 commited on
Commit
17f9530
·
verified ·
1 Parent(s): f1c1fe5

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +117 -13
README.md CHANGED
@@ -1,21 +1,125 @@
1
  ---
2
- base_model: unsloth/qwen2.5-7b-unsloth-bnb-4bit
3
- tags:
4
- - text-generation-inference
5
- - transformers
6
- - unsloth
7
- - qwen2
8
- license: apache-2.0
9
  language:
 
10
  - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- # Uploaded finetuned model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- - **Developed by:** Tamil-ai
16
- - **License:** apache-2.0
17
- - **Finetuned from model :** unsloth/qwen2.5-7b-unsloth-bnb-4bit
18
 
19
- This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
20
 
21
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
 
 
 
 
2
  language:
3
+ - ta
4
  - en
5
+ license: apache-2.0
6
+ base_model: Qwen/Qwen2.5-7B-Instruct
7
+ library_name: transformers
8
+ pipeline_tag: text-generation
9
+ tags:
10
+ - tamil
11
+ - qwen2
12
+ - qlora
13
+ - instruction-tuning
14
+ - morphology
15
+ - dravidian
16
+ datasets:
17
+ - Tamil-ai/samacheer-kalvi-tamil
18
+ model-index:
19
+ - name: Tamil-Qwen2.5-7B-Instruct
20
+ results: []
21
  ---
22
 
23
+ # Tamil-Qwen2.5-7B-Instruct
24
+
25
+ A Tamil-specialized instruction-tuned LLM built on Qwen2.5-7B-Instruct using QLoRA fine-tuning on 150K deduplicated Tamil instruction pairs.
26
+
27
+ **Paper:** *"A Thousand Language Problem: Morphological Understanding in Linguistic AI"*
28
+
29
+ ## Model Details
30
+
31
+ | Property | Value |
32
+ |----------|-------|
33
+ | Base model | [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
34
+ | Parameters | 7.6B |
35
+ | Method | QLoRA (r=64, alpha=128, dropout=0.05) |
36
+ | Training data | 150K deduplicated Tamil instruction-response pairs |
37
+ | Tokenizer efficiency | 4.62x ratio (best among tested models for Tamil) |
38
+ | Compute | RunPod RTX 5090, ~$5 total cost |
39
+ | Sequence length | 1024 |
40
+ | Batch size | 32 (effective) |
41
+ | Epochs | 1 |
42
+
43
+ ## Training Data
44
+
45
+ 150,000 deduplicated instruction-response pairs from 5 Tamil datasets:
46
+ - Tamil Alpaca
47
+ - Tamil Orca
48
+ - Tamil Dolly
49
+ - Tamil-ai/samacheer-kalvi-tamil (morphological drills + grammar QA)
50
+ - Additional Tamil instruction sets
51
+
52
+ ## Usage
53
+
54
+ ```python
55
+ from transformers import AutoModelForCausalLM, AutoTokenizer
56
+
57
+ model_id = "Tamil-ai/tamil-qwen25-7b-instruct"
58
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
59
+ model = AutoModelForCausalLM.from_pretrained(
60
+ model_id,
61
+ torch_dtype="auto",
62
+ device_map="auto",
63
+ )
64
+
65
+ messages = [
66
+ {"role": "system", "content": "You are a helpful Tamil language assistant."},
67
+ {"role": "user", "content": "வீடு என்ற சொல்லின் வேற்றுமை வடிவங்களைக் கூறுக."},
68
+ ]
69
+
70
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
71
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
72
+ outputs = model.generate(**inputs, max_new_tokens=256)
73
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
74
+ ```
75
+
76
+ ### 4-bit Quantized (for limited VRAM)
77
+
78
+ ```python
79
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
80
+
81
+ model = AutoModelForCausalLM.from_pretrained(
82
+ "Tamil-ai/tamil-qwen25-7b-instruct",
83
+ quantization_config=BitsAndBytesConfig(load_in_4bit=True),
84
+ device_map="auto",
85
+ )
86
+ ```
87
+
88
+ ## Why Qwen2.5?
89
+
90
+ Tokenizer analysis across 6 base models showed Qwen2.5 has the best Tamil tokenization efficiency:
91
+
92
+ | Model | Tamil Token Ratio | Verdict |
93
+ |-------|------------------|---------|
94
+ | **Qwen2.5** | **4.62x** | Best for Tamil |
95
+ | Llama 3.1 | 5.8x | |
96
+ | Gemma 2 | 6.1x | |
97
+ | Mistral | 7.2x | |
98
+ | Falcon | 10.5x | Worst |
99
+
100
+ Lower ratio = fewer tokens per Tamil word = more efficient training and inference.
101
+
102
+ ## Intended Use
103
+
104
+ - Tamil question answering and instruction following
105
+ - Tamil morphological analysis
106
+ - Tamil grammar and linguistics tasks
107
+ - Research on low-resource language LLMs
108
+
109
+ ## Limitations
110
 
111
+ - Trained primarily on instructional Tamil; may underperform on colloquial/slang
112
+ - Morphological accuracy varies by category (see benchmark results)
113
+ - English capabilities may degrade compared to base Qwen2.5
114
 
115
+ ## Citation
116
 
117
+ ```bibtex
118
+ @misc{tamilai2026,
119
+ title={A Thousand Language Problem: Morphological Understanding in Linguistic AI},
120
+ author={Tamil-AI},
121
+ year={2026},
122
+ publisher={HuggingFace},
123
+ url={https://huggingface.co/Tamil-ai/tamil-qwen25-7b-instruct}
124
+ }
125
+ ```