oralunal commited on
Commit
80110ac
·
1 Parent(s): b8ea2c1
README.md ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: distilbert/distilbert-base-multilingual-cased
3
+ language:
4
+ - en
5
+ - zh
6
+ - es
7
+ - hi
8
+ - ar
9
+ - bn
10
+ - pt
11
+ - ru
12
+ - ja
13
+ - de
14
+ - ms
15
+ - te
16
+ - vi
17
+ - ko
18
+ - fr
19
+ - tr
20
+ - it
21
+ - pl
22
+ - uk
23
+ - tl
24
+ - nl
25
+ - gsw
26
+ - sw
27
+ library_name: transformers
28
+ license: cc-by-nc-4.0
29
+ pipeline_tag: text-classification
30
+ tags:
31
+ - text-classification
32
+ - sentiment-analysis
33
+ - sentiment
34
+ - synthetic data
35
+ - multi-class
36
+ - social-media-analysis
37
+ - customer-feedback
38
+ - product-reviews
39
+ - brand-monitoring
40
+ - multilingual
41
+ - 🇪🇺
42
+ - region:eu
43
+ - synthetic
44
+ datasets:
45
+ - tabularisai/swahili_sentiment_dataset
46
+ ---
47
+
48
+
49
+ # 🚀 Multilingual Sentiment Classification Model (23 Languages)
50
+
51
+ <!-- TRY IT HERE: `coming soon`
52
+ -->
53
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/Discord%20button.png" width="200"/>](https://discord.gg/sznxwdqBXj)
54
+
55
+
56
+ # NEWS!
57
+ - 2025/8: Major model update +1 new language: **Swahili**! Also, general improvements accross all languages.
58
+
59
+ - 2025/8: Free DEMO API for our model! Please see below!
60
+
61
+ - 2025/7: We’ve just released ModernFinBERT, a model we’ve been working on for a while. It’s built on the ModernBERT architecture and trained on a mix of real and synthetic data, with LLM-based label correction applied to public datasets to fix human annotation errors.
62
+ It’s performing well across a range of benchmarks — in some cases improving accuracy by up to 48% over existing models like FinBERT.
63
+ You can check it out here on Hugging Face:
64
+ 👉 https://huggingface.co/tabularisai/ModernFinBERT
65
+
66
+ - 2024/12: We are excited to introduce a multilingual sentiment model! Now you can analyze sentiment across multiple languages, enhancing your global reach.
67
+
68
+
69
+ ## 🔌 Hosted DEMO API
70
+
71
+ We provide a hosted inference API:
72
+
73
+ **Example request body:**
74
+
75
+ ```json
76
+ curl -X POST https://api.tabularis.ai/ \
77
+ -H "Content-Type: application/json" \
78
+ -d '{"text":"I love the design","return_all_scores":false}'
79
+
80
+ ```
81
+
82
+ ## Model Details
83
+ - `Model Name:` tabularisai/multilingual-sentiment-analysis
84
+ - `Base Model:` distilbert/distilbert-base-multilingual-cased
85
+ - `Task:` Text Classification (Sentiment Analysis)
86
+ - `Languages:` Supports English plus Chinese (中文), Spanish (Español), Hindi (हिन्दी), Arabic (العربية), Bengali (বাংলা), Portuguese (Português), Russian (Русский), Japanese (日本語), German (Deutsch), Malay (Bahasa Melayu), Telugu (తెలుగు), Vietnamese (Tiếng Việt), Korean (한국어), French (Français), Turkish (Türkçe), Italian (Italiano), Polish (Polski), Ukrainian (Українська), Tagalog, Dutch (Nederlands), Swiss German (Schweizerdeutsch), and Swahili.
87
+ - `Number of Classes:` 5 (*Very Negative, Negative, Neutral, Positive, Very Positive*)
88
+ - `Usage:`
89
+ - Social media analysis
90
+ - Customer feedback analysis
91
+ - Product reviews classification
92
+ - Brand monitoring
93
+ - Market research
94
+ - Customer service optimization
95
+ - Competitive intelligence
96
+
97
+ > If you wish to use this model for commercial purposes, please obtain a license by contacting: info@tabularis.ai
98
+
99
+
100
+ ## Model Description
101
+
102
+ This model is a fine-tuned version of `distilbert/distilbert-base-multilingual-cased` for multilingual sentiment analysis. It leverages synthetic data from multiple sources to achieve robust performance across different languages and cultural contexts.
103
+
104
+ ### Training Data
105
+
106
+ Trained exclusively on synthetic multilingual data generated by advanced LLMs, ensuring wide coverage of sentiment expressions from various languages.
107
+
108
+ ### Training Procedure
109
+
110
+ - Fine-tuned for 3.5 epochs.
111
+ - Achieved a train_acc_off_by_one of approximately 0.93 on the validation dataset.
112
+
113
+ ## Intended Use
114
+
115
+ Ideal for:
116
+ - Multilingual social media monitoring
117
+ - International customer feedback analysis
118
+ - Global product review sentiment classification
119
+ - Worldwide brand sentiment tracking
120
+
121
+ ## How to Use
122
+
123
+ Using pipelines, it takes only 4 lines:
124
+
125
+ ```python
126
+ from transformers import pipeline
127
+
128
+ # Load the classification pipeline with the specified model
129
+ pipe = pipeline("text-classification", model="tabularisai/multilingual-sentiment-analysis")
130
+
131
+ # Classify a new sentence
132
+ sentence = "I love this product! It's amazing and works perfectly."
133
+ result = pipe(sentence)
134
+
135
+ # Print the result
136
+ print(result)
137
+ ```
138
+
139
+ Below is a Python example on how to use the multilingual sentiment model without pipelines:
140
+
141
+
142
+ ```python
143
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
144
+ import torch
145
+
146
+ model_name = "tabularisai/multilingual-sentiment-analysis"
147
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
148
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
149
+
150
+ def predict_sentiment(texts):
151
+ inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=512)
152
+ with torch.no_grad():
153
+ outputs = model(**inputs)
154
+ probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
155
+ sentiment_map = {0: "Very Negative", 1: "Negative", 2: "Neutral", 3: "Positive", 4: "Very Positive"}
156
+ return [sentiment_map[p] for p in torch.argmax(probabilities, dim=-1).tolist()]
157
+
158
+ texts = [
159
+ # English
160
+ "I absolutely love the new design of this app!", "The customer service was disappointing.", "The weather is fine, nothing special.",
161
+ # Chinese
162
+ "这家餐厅的菜味道非常棒!", "我对他的回答很失望。", "天气今天一般。",
163
+ # Spanish
164
+ "¡Me encanta cómo quedó la decoración!", "El servicio fue terrible y muy lento.", "El libro estuvo más o menos.",
165
+ # Arabic
166
+ "الخدمة في هذا الفندق رائعة جدًا!", "لم يعجبني الطعام في هذا المطعم.", "كانت الرحلة عادية。",
167
+ # Ukrainian
168
+ "Мені дуже сподобалася ця вистава!", "Обслуговування було жахливим.", "Книга була посередньою。",
169
+ # Hindi
170
+ "यह जगह सच में अद्भुत है!", "यह अनुभव बहुत खराब था।", "फिल्म ठीक-ठाक थी।",
171
+ # Bengali
172
+ "এখানকার পরিবেশ অসাধারণ!", "সেবার মান একেবারেই খারাপ।", "খাবারটা মোটামুটি ছিল।",
173
+ # Portuguese
174
+ "Este livro é fantástico! Eu aprendi muitas coisas novas e inspiradoras.",
175
+ "Não gostei do produto, veio quebrado.", "O filme foi ok, nada de especial.",
176
+ # Japanese
177
+ "このレストランの料理は本当に美味しいです!", "このホテルのサービスはがっかりしました。", "天気はまあまあです。",
178
+ # Russian
179
+ "Я в восторге от этого нового гаджета!", "Этот сервис оставил у меня только разочарование.", "Встреча была обычной, ничего особенного.",
180
+ # French
181
+ "J'adore ce restaurant, c'est excellent !", "L'attente était trop longue et frustrante.", "Le film était moyen, sans plus.",
182
+ # Turkish
183
+ "Bu otelin manzarasına bayıldım!", "Ürün tam bir hayal kırıklığıydı.", "Konser fena değildi, ortalamaydı.",
184
+ # Italian
185
+ "Adoro questo posto, è fantastico!", "Il servizio clienti è stato pessimo.", "La cena era nella media.",
186
+ # Polish
187
+ "Uwielbiam tę restaurację, jedzenie jest świetne!", "Obsługa klienta była rozczarowująca.", "Pogoda jest w porządku, nic szczególnego.",
188
+ # Tagalog
189
+ "Ang ganda ng lugar na ito, sobrang aliwalas!", "Hindi maganda ang serbisyo nila dito.", "Maayos lang ang palabas, walang espesyal.",
190
+ # Dutch
191
+ "Ik ben echt blij met mijn nieuwe aankoop!", "De klantenservice was echt slecht.", "De presentatie was gewoon oké, niet bijzonder.",
192
+ # Malay
193
+ "Saya suka makanan di sini, sangat sedap!", "Pengalaman ini sangat mengecewakan.", "Hari ini cuacanya biasa sahaja.",
194
+ # Korean
195
+ "이 가게의 케이크는 정말 맛있어요!", "서비스가 너무 별로였어요.", "날씨가 그저 그렇네요.",
196
+ # Swiss German
197
+ "Ich find dä Service i de Beiz mega guet!", "Däs Esä het mir nöd gfalle.", "D Wätter hüt isch so naja."
198
+ ]
199
+
200
+ for text, sentiment in zip(texts, predict_sentiment(texts)):
201
+ print(f"Text: {text}\nSentiment: {sentiment}\n")
202
+ ```
203
+
204
+ ## Ethical Considerations
205
+
206
+ Synthetic data reduces bias, but validation in real-world scenarios is advised.
207
+
208
+ ## Citation
209
+ ```bib
210
+ @misc{tabularisai_2025,
211
+ author = { tabularisai and Samuel Gyamfi and Vadim Borisov and Richard H. Schreiber },
212
+ title = { multilingual-sentiment-analysis (Revision 69afb83) },
213
+ year = 2025,
214
+ url = { https://huggingface.co/tabularisai/multilingual-sentiment-analysis },
215
+ doi = { 10.57967/hf/5968 },
216
+ publisher = { Hugging Face }
217
+ }
218
+ ```
219
+
220
+ ## Contact
221
+
222
+ For inquiries, data, private APIs, better models, contact info@tabularis.ai
223
+
224
+ tabularis.ai
225
+
226
+
227
+ <table align="center">
228
+ <tr>
229
+ <td align="center">
230
+ <a href="https://www.linkedin.com/company/tabularis-ai/">
231
+ <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/linkedin.svg" alt="LinkedIn" width="30" height="30">
232
+ </a>
233
+ </td>
234
+ <td align="center">
235
+ <a href="https://x.com/tabularis_ai">
236
+ <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/x.svg" alt="X" width="30" height="30">
237
+ </a>
238
+ </td>
239
+ <td align="center">
240
+ <a href="https://github.com/tabularis-ai">
241
+ <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/github.svg" alt="GitHub" width="30" height="30">
242
+ </a>
243
+ </td>
244
+ <td align="center">
245
+ <a href="https://tabularis.ai">
246
+ <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/internetarchive.svg" alt="Website" width="30" height="30">
247
+ </a>
248
+ </td>
249
+ </tr>
250
+ </table>
config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation": "gelu",
3
+ "architectures": [
4
+ "DistilBertForSequenceClassification"
5
+ ],
6
+ "attention_dropout": 0.1,
7
+ "dim": 768,
8
+ "dropout": 0.1,
9
+ "hidden_dim": 3072,
10
+ "id2label": {
11
+ "0": "Very Negative",
12
+ "1": "Negative",
13
+ "2": "Neutral",
14
+ "3": "Positive",
15
+ "4": "Very Positive"
16
+ },
17
+ "initializer_range": 0.02,
18
+ "label2id": {
19
+ "Negative": 1,
20
+ "Neutral": 2,
21
+ "Positive": 3,
22
+ "Very Negative": 0,
23
+ "Very Positive": 4
24
+ },
25
+ "max_position_embeddings": 512,
26
+ "model_type": "distilbert",
27
+ "n_heads": 12,
28
+ "n_layers": 6,
29
+ "output_past": true,
30
+ "pad_token_id": 0,
31
+ "problem_type": "single_label_classification",
32
+ "qa_dropout": 0.1,
33
+ "seq_classif_dropout": 0.2,
34
+ "sinusoidal_pos_embds": false,
35
+ "tie_weights_": true,
36
+ "torch_dtype": "float32",
37
+ "transformers_version": "4.55.0",
38
+ "vocab_size": 119547
39
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ab3cecb8605da0a240e5b4e18d969704d44e27c6ea48533ef6693d31dbb926a
3
+ size 541326604
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": false,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 512,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "DistilBertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff