init

Browse files

Files changed (7) hide show

README.md +250 -0
config.json +39 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +55 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,250 @@

+---
+base_model: distilbert/distilbert-base-multilingual-cased
+language:
+- en
+- zh
+- es
+- hi
+- ar
+- bn
+- pt
+- ru
+- ja
+- de
+- ms
+- te
+- vi
+- ko
+- fr
+- tr
+- it
+- pl
+- uk
+- tl
+- nl
+- gsw
+- sw
+library_name: transformers
+license: cc-by-nc-4.0
+pipeline_tag: text-classification
+tags:
+- text-classification
+- sentiment-analysis
+- sentiment
+- synthetic data
+- multi-class
+- social-media-analysis
+- customer-feedback
+- product-reviews
+- brand-monitoring
+- multilingual
+- 🇪🇺
+- region:eu
+- synthetic
+datasets:
+- tabularisai/swahili_sentiment_dataset
+---
+# 🚀 Multilingual Sentiment Classification Model (23 Languages)
+<!-- TRY IT HERE: `coming soon`
+ -->
+[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/Discord%20button.png" width="200"/>](https://discord.gg/sznxwdqBXj)
+# NEWS!
+- 2025/8: Major model update +1 new language: **Swahili**! Also, general improvements accross all languages.
+- 2025/8: Free DEMO API for our model! Please see below!
+- 2025/7: We’ve just released ModernFinBERT, a model we’ve been working on for a while. It’s built on the ModernBERT architecture and trained on a mix of real and synthetic data, with LLM-based label correction applied to public datasets to fix human annotation errors.
+It’s performing well across a range of benchmarks — in some cases improving accuracy by up to 48% over existing models like FinBERT.
+You can check it out here on Hugging Face:
+👉 https://huggingface.co/tabularisai/ModernFinBERT
+- 2024/12: We are excited to introduce a multilingual sentiment model! Now you can analyze sentiment across multiple languages, enhancing your global reach.
+## 🔌 Hosted DEMO API
+We provide a hosted inference API:
+**Example request body:**
+```json
+curl -X POST https://api.tabularis.ai/ \
+     -H "Content-Type: application/json" \
+     -d '{"text":"I love the design","return_all_scores":false}'
+```
+## Model Details
+- `Model Name:` tabularisai/multilingual-sentiment-analysis
+- `Base Model:` distilbert/distilbert-base-multilingual-cased
+- `Task:` Text Classification (Sentiment Analysis)
+- `Languages:` Supports English plus Chinese (中文), Spanish (Español), Hindi (हिन्दी), Arabic (العربية), Bengali (বাংলা), Portuguese (Português), Russian (Русский), Japanese (日本語), German (Deutsch), Malay (Bahasa Melayu), Telugu (తెలుగు), Vietnamese (Tiếng Việt), Korean (한국어), French (Français), Turkish (Türkçe), Italian (Italiano), Polish (Polski), Ukrainian (Українська), Tagalog, Dutch (Nederlands), Swiss German (Schweizerdeutsch), and Swahili.
+- `Number of Classes:` 5 (*Very Negative, Negative, Neutral, Positive, Very Positive*)
+- `Usage:`
+  - Social media analysis
+  - Customer feedback analysis
+  - Product reviews classification
+  - Brand monitoring
+  - Market research
+  - Customer service optimization
+  - Competitive intelligence
+> If you wish to use this model for commercial purposes, please obtain a license by contacting: info@tabularis.ai
+## Model Description
+This model is a fine-tuned version of `distilbert/distilbert-base-multilingual-cased` for multilingual sentiment analysis. It leverages synthetic data from multiple sources to achieve robust performance across different languages and cultural contexts.
+### Training Data
+Trained exclusively on synthetic multilingual data generated by advanced LLMs, ensuring wide coverage of sentiment expressions from various languages.
+### Training Procedure
+- Fine-tuned for 3.5 epochs.
+- Achieved a train_acc_off_by_one of approximately 0.93 on the validation dataset.
+## Intended Use
+Ideal for:
+- Multilingual social media monitoring
+- International customer feedback analysis
+- Global product review sentiment classification
+- Worldwide brand sentiment tracking
+## How to Use
+Using pipelines, it takes only 4 lines:
+```python
+from transformers import pipeline
+# Load the classification pipeline with the specified model
+pipe = pipeline("text-classification", model="tabularisai/multilingual-sentiment-analysis")
+# Classify a new sentence
+sentence = "I love this product! It's amazing and works perfectly."
+result = pipe(sentence)
+# Print the result
+print(result)
+```
+Below is a Python example on how to use the multilingual sentiment model without pipelines:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "tabularisai/multilingual-sentiment-analysis"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+def predict_sentiment(texts):
+    inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=512)
+    with torch.no_grad():
+        outputs = model(**inputs)
+    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    sentiment_map = {0: "Very Negative", 1: "Negative", 2: "Neutral", 3: "Positive", 4: "Very Positive"}
+    return [sentiment_map[p] for p in torch.argmax(probabilities, dim=-1).tolist()]
+texts = [
+    # English
+    "I absolutely love the new design of this app!", "The customer service was disappointing.", "The weather is fine, nothing special.",
+    # Chinese
+    "这家餐厅的菜味道非常棒！", "我对他的回答很失望。", "天气今天一般。",
+    # Spanish
+    "¡Me encanta cómo quedó la decoración!", "El servicio fue terrible y muy lento.", "El libro estuvo más o menos.",
+    # Arabic
+    "الخدمة في هذا الفندق رائعة جدًا!", "لم يعجبني الطعام في هذا المطعم.", "كانت الرحلة عادية。",
+    # Ukrainian
+    "Мені дуже сподобалася ця вистава!", "Обслуговування було жахливим.", "Книга була посередньою。",
+    # Hindi
+    "यह जगह सच में अद्भुत है!", "यह अनुभव बहुत खराब था।", "फिल्म ठीक-ठाक थी।",
+    # Bengali
+    "এখানকার পরিবেশ অসাধারণ!", "সেবার মান একেবারেই খারাপ।", "খাবারটা মোটামুটি ছিল।",
+    # Portuguese
+    "Este livro é fantástico! Eu aprendi muitas coisas novas e inspiradoras.",
+    "Não gostei do produto, veio quebrado.", "O filme foi ok, nada de especial.",
+    # Japanese
+    "このレストランの料理は本当に美味しいです！", "このホテルのサービスはがっかりしました。", "天気はまあまあです。",
+    # Russian
+    "Я в восторге от этого нового гаджета!", "Этот сервис оставил у меня только разочарование.", "Встреча была обычной, ничего особенного.",
+    # French
+    "J'adore ce restaurant, c'est excellent !", "L'attente était trop longue et frustrante.", "Le film était moyen, sans plus.",
+    # Turkish
+    "Bu otelin manzarasına bayıldım!", "Ürün tam bir hayal kırıklığıydı.", "Konser fena değildi, ortalamaydı.",
+    # Italian
+    "Adoro questo posto, è fantastico!", "Il servizio clienti è stato pessimo.", "La cena era nella media.",
+    # Polish
+    "Uwielbiam tę restaurację, jedzenie jest świetne!", "Obsługa klienta była rozczarowująca.", "Pogoda jest w porządku, nic szczególnego.",
+    # Tagalog
+    "Ang ganda ng lugar na ito, sobrang aliwalas!", "Hindi maganda ang serbisyo nila dito.", "Maayos lang ang palabas, walang espesyal.",
+    # Dutch
+    "Ik ben echt blij met mijn nieuwe aankoop!", "De klantenservice was echt slecht.", "De presentatie was gewoon oké, niet bijzonder.",
+    # Malay
+    "Saya suka makanan di sini, sangat sedap!", "Pengalaman ini sangat mengecewakan.", "Hari ini cuacanya biasa sahaja.",
+    # Korean
+    "이 가게의 케이크는 정말 맛있어요!", "서비스가 너무 별로였어요.", "날씨가 그저 그렇네요.",
+    # Swiss German
+    "Ich find dä Service i de Beiz mega guet!", "Däs Esä het mir nöd gfalle.", "D Wätter hüt isch so naja."
+]
+for text, sentiment in zip(texts, predict_sentiment(texts)):
+    print(f"Text: {text}\nSentiment: {sentiment}\n")
+```
+## Ethical Considerations
+Synthetic data reduces bias, but validation in real-world scenarios is advised.
+## Citation
+```bib
+@misc{tabularisai_2025,
+	author       = { tabularisai and Samuel Gyamfi and Vadim Borisov and Richard H. Schreiber },
+	title        = { multilingual-sentiment-analysis (Revision 69afb83) },
+	year         = 2025,
+	url          = { https://huggingface.co/tabularisai/multilingual-sentiment-analysis },
+	doi          = { 10.57967/hf/5968 },
+	publisher    = { Hugging Face }
+}
+```
+## Contact
+For inquiries, data, private APIs, better models, contact info@tabularis.ai
+tabularis.ai
+<table align="center">
+  <tr>
+    <td align="center">
+      <a href="https://www.linkedin.com/company/tabularis-ai/">
+        <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/linkedin.svg" alt="LinkedIn" width="30" height="30">
+      </a>
+    </td>
+    <td align="center">
+      <a href="https://x.com/tabularis_ai">
+        <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/x.svg" alt="X" width="30" height="30">
+      </a>
+    </td>
+    <td align="center">
+      <a href="https://github.com/tabularis-ai">
+        <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/github.svg" alt="GitHub" width="30" height="30">
+      </a>
+    </td>
+    <td align="center">
+      <a href="https://tabularis.ai">
+        <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/internetarchive.svg" alt="Website" width="30" height="30">
+      </a>
+    </td>
+  </tr>
+</table>

config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "Very Negative",
+    "1": "Negative",
+    "2": "Neutral",
+    "3": "Positive",
+    "4": "Very Positive"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "Negative": 1,
+    "Neutral": 2,
+    "Positive": 3,
+    "Very Negative": 0,
+    "Very Positive": 4
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "output_past": true,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.55.0",
+  "vocab_size": 119547
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3ab3cecb8605da0a240e5b4e18d969704d44e27c6ea48533ef6693d31dbb926a
+size 541326604

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": false,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff