Instructions to use wissamantoun/WebOrganizer-TopicClassifier-ModernBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use wissamantoun/WebOrganizer-TopicClassifier-ModernBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="wissamantoun/WebOrganizer-TopicClassifier-ModernBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("wissamantoun/WebOrganizer-TopicClassifier-ModernBERT") model = AutoModelForSequenceClassification.from_pretrained("wissamantoun/WebOrganizer-TopicClassifier-ModernBERT") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| datasets: | |
| - WebOrganizer/TopicAnnotations-Llama-3.1-8B | |
| - WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8 | |
| base_model: | |
| - answerdotai/ModernBERT-base | |
| # wissamantoun/WebOrganizer-TopicClassifier-ModernBERT | |
| [[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)] | |
| *All credit goes to the original authors of the model and dataset. This is a retraining of the original model with a different base model* | |
| The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages. | |
| The model is a [ModernBERT-base](answerdotai/ModernBERT-base) with 140M parameters fine-tuned on the following training data: | |
| 1. [WebOrganizer/TopicAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training) | |
| 2. [WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training) | |
| #### All Domain Classifiers | |
| - [wissamantoun/WebOrganizer-FormatClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-FormatClassifier-ModernBERT) | |
| - [wissamantoun/WebOrganizer-TopicClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT) *← you are here!* | |
| ## Usage | |
| This classifier expects input in the following input format: | |
| ``` | |
| {url} | |
| {text} | |
| ``` | |
| Example: | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| tokenizer = AutoTokenizer.from_pretrained("wissamantoun/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT") | |
| model = AutoModelForSequenceClassification.from_pretrained( | |
| "wissamantoun/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT", | |
| trust_remote_code=True, | |
| use_memory_efficient_attention=False) | |
| web_page = """http://www.example.com | |
| How to build a computer from scratch? Here are the components you need...""" | |
| inputs = tokenizer([web_page], return_tensors="pt") | |
| outputs = model(**inputs) | |
| probs = outputs.logits.softmax(dim=-1) | |
| print(probs.argmax(dim=-1)) | |
| # -> 5 ("Hardware" topic) | |
| ``` | |
| You can convert the `logits` of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see `id2label` and `label2id` in the model config): | |
| 1. Adult | |
| 2. Art & Design | |
| 3. Software Dev. | |
| 4. Crime & Law | |
| 5. Education & Jobs | |
| 6. Hardware | |
| 7. Entertainment | |
| 8. Social Life | |
| 9. Fashion & Beauty | |
| 10. Finance & Business | |
| 11. Food & Dining | |
| 12. Games | |
| 13. Health | |
| 14. History | |
| 15. Home & Hobbies | |
| 16. Industrial | |
| 17. Literature | |
| 18. Politics | |
| 19. Religion | |
| 20. Science & Tech. | |
| 21. Software | |
| 22. Sports & Fitness | |
| 23. Transportation | |
| 24. Travel | |
| The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/topics.yaml). | |
| # Scores | |
| ``` | |
| ***** pred metrics ***** | |
| test_accuracy = 0.8585 | |
| test_accuracy__0 = 0.9346 | |
| test_accuracy__1 = 0.7317 | |
| test_accuracy__10 = 0.9148 | |
| test_accuracy__11 = 0.8927 | |
| test_accuracy__12 = 0.8687 | |
| test_accuracy__13 = 0.814 | |
| test_accuracy__14 = 0.8616 | |
| test_accuracy__15 = 0.7179 | |
| test_accuracy__16 = 0.855 | |
| test_accuracy__17 = 0.8246 | |
| test_accuracy__18 = 0.907 | |
| test_accuracy__19 = 0.8333 | |
| test_accuracy__2 = 0.866 | |
| test_accuracy__20 = 0.8294 | |
| test_accuracy__21 = 0.9441 | |
| test_accuracy__22 = 0.8788 | |
| test_accuracy__23 = 0.9 | |
| test_accuracy__3 = 0.847 | |
| test_accuracy__4 = 0.8442 | |
| test_accuracy__5 = 0.8189 | |
| test_accuracy__6 = 0.8997 | |
| test_accuracy__7 = 0.7295 | |
| test_accuracy__8 = 0.8937 | |
| test_accuracy__9 = 0.8665 | |
| test_accuracy_conf50 = 0.8674 | |
| test_accuracy_conf50__0 = 0.9434 | |
| test_accuracy_conf50__1 = 0.7453 | |
| test_accuracy_conf50__10 = 0.93 | |
| test_accuracy_conf50__11 = 0.8958 | |
| test_accuracy_conf50__12 = 0.8768 | |
| test_accuracy_conf50__13 = 0.8193 | |
| test_accuracy_conf50__14 = 0.8691 | |
| test_accuracy_conf50__15 = 0.7237 | |
| test_accuracy_conf50__16 = 0.864 | |
| test_accuracy_conf50__17 = 0.8358 | |
| test_accuracy_conf50__18 = 0.91 | |
| test_accuracy_conf50__19 = 0.8481 | |
| test_accuracy_conf50__2 = 0.8768 | |
| test_accuracy_conf50__20 = 0.8434 | |
| test_accuracy_conf50__21 = 0.9505 | |
| test_accuracy_conf50__22 = 0.8844 | |
| test_accuracy_conf50__23 = 0.9028 | |
| test_accuracy_conf50__3 = 0.8571 | |
| test_accuracy_conf50__4 = 0.851 | |
| test_accuracy_conf50__5 = 0.8206 | |
| test_accuracy_conf50__6 = 0.9071 | |
| test_accuracy_conf50__7 = 0.7442 | |
| test_accuracy_conf50__8 = 0.9006 | |
| test_accuracy_conf50__9 = 0.8761 | |
| test_accuracy_conf75 = 0.9178 <--- Metric from the paper | |
| test_accuracy_conf75__0 = 0.95 | |
| test_accuracy_conf75__1 = 0.8413 | |
| test_accuracy_conf75__10 = 0.9556 | |
| test_accuracy_conf75__11 = 0.9298 | |
| test_accuracy_conf75__12 = 0.9299 | |
| test_accuracy_conf75__13 = 0.8788 | |
| test_accuracy_conf75__14 = 0.9126 | |
| test_accuracy_conf75__15 = 0.8253 | |
| test_accuracy_conf75__16 = 0.8885 | |
| test_accuracy_conf75__17 = 0.8968 | |
| test_accuracy_conf75__18 = 0.938 | |
| test_accuracy_conf75__19 = 0.9113 | |
| test_accuracy_conf75__2 = 0.9029 | |
| test_accuracy_conf75__20 = 0.8966 | |
| test_accuracy_conf75__21 = 0.968 | |
| test_accuracy_conf75__22 = 0.9225 | |
| test_accuracy_conf75__23 = 0.9444 | |
| test_accuracy_conf75__3 = 0.9319 | |
| test_accuracy_conf75__4 = 0.8976 | |
| test_accuracy_conf75__5 = 0.9167 | |
| test_accuracy_conf75__6 = 0.9483 | |
| test_accuracy_conf75__7 = 0.804 | |
| test_accuracy_conf75__8 = 0.9448 | |
| test_accuracy_conf75__9 = 0.932 | |
| test_accuracy_label_average = 0.8531 | |
| test_accuracy_label_average_conf50 = 0.8615 | |
| test_accuracy_label_average_conf75 = 0.9111 | |
| test_accuracy_label_min = 0.7179 | |
| test_accuracy_label_min_conf50 = 0.7237 | |
| test_accuracy_label_min_conf75 = 0.804 <--- Metric from the paper | |
| test_loss = 0.4694 | |
| test_proportion_conf50 = 0.9811 | |
| test_proportion_conf75 = 0.8535 | |
| test_runtime = 0:00:08.39 | |
| test_samples_per_second = 1191.144 | |
| test_steps_per_second = 37.283 | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @article{wettig2025organize, | |
| title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation}, | |
| author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini}, | |
| journal={arXiv preprint arXiv:2502.10341}, | |
| year={2025} | |
| } | |
| ``` |