Instructions to use Sami92/mmbert-is-political with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Sami92/mmbert-is-political with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Sami92/mmbert-is-political")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Sami92/mmbert-is-political") model = AutoModelForSequenceClassification.from_pretrained("Sami92/mmbert-is-political") - Notebooks
- Google Colab
- Kaggle
mmbert-is-political
mmbert-is-political is a fine-tuned ModernBERT sequence classification model for detecting whether a text is political or non-political.
The model predicts one of two labels:
non_politicalpolitical
Model Details
- Model type: ModernBERT for sequence classification
- Architecture:
ModernBertForSequenceClassification - Task: Binary text classification
- Labels:
0:non_political1:political
- Maximum sequence length: 8192 tokens
- Pooling: Mean pooling
- Problem type: Single-label classification
Intended Use
This model is intended to classify social media posts and news-style texts as political or non-political.
A text is considered political if it discusses political actors or institutions, elections, public policy, governance, macroeconomic issues, or international/geopolitical affairs. Examples include texts about politicians, parties, immigration policy, healthcare reform, inflation, NATO, the EU, or the war in Ukraine.
A text is considered non-political if it focuses on topics unrelated to politics or public policy. Examples include entertainment, sports, lifestyle, travel, food, technology products, weather, nature, or personal well-being.
Training Data
The model was trained on texts from multiple source types:
- Social media posts from politicians on Instagram, X, and Facebook
- Newspaper articles from German, British, and US outlets
The political actors and outlets represented in the training data come from Germany, the United Kingdom, and the United States.
The training labels are synthetic labels generated using Llama 3 70B. The model was trained on these synthetic annotations.
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
repo_id = "Sami92/mmbert-is-political"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)
classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
truncation=True,
)
text = "The government announced a new immigration policy today."
result = classifier(text)
print(result)
Evaluation
Metrics
- Accuracy
Results
The model was tested on a dataset of 100 texts (UK, US, DE news articles and social media posts), which were labeled by two annotators.
- Overall Accuracy: 0.85
- Accuracy on news: 0.88
- Accuracy on posts: 0.82
- Accuracy EN: 0.88
- Accuracy DE: 0.80
Limitations
The model was trained on synthetic labels rather than manually verified annotations. As a result, predictions may reflect labeling errors, ambiguities, or biases from the annotation process.
The training data focuses on German, British, and US political and media contexts. Performance may differ for texts from other countries, languages, political systems, or media environments.
- Downloads last month
- 43