--- language: - en - pt license: cc-by-nc-nd-4.0 tags: - named-entity-recognition - metadata-extraction - bert - meeting-minutes - municipal-documents library_name: transformers base_model: - xlm-roberta-large metrics: - f1 - recall - precision pipeline_tag: token-classification --- # XLMR-large-metadata-council-en: Metadata Extraction for Municipal Meeting Minutes ## Model Description **XLMR-large-metadata-council-en** is a fine-tuned BERT model for **Named Entity Recognition (NER)**, trained to automatically extract metadata such as meeting number, date, location, participants, and time expressions from **Portuguese municipal meeting minutes**. It was developed as part of a study on information extraction and indexing of administrative documents. ### Key Features - šŸ›ļø **Domain-Specific**: Trained on translated municipal meeting minutes - 🧠 **Entity-Level Extraction**: Identifies key metadata (minute ID, date, location, start/end times, participants) - āš™ļø **Transformer-based Architecture**: BERTimbau backbone with fine-tuning for token classification - šŸ“ˆ **Strong NER Performance**: Achieves F1-score above 0.90 on the English and Portuguese dataset ## Model Details - **Base Model**: `xlm-roberta-large` - **Architecture**: BERT for token classification (NER) - **Parameters**: 559M - **Max Sequence Length**: 512 tokens - **Fine-tuning Dataset**: Municipal meeting minutes (20 municipal minutes per 6 Portuguese municipalities totalling 120 documents) - **Entity Types**: `minute_id`, `date`, `meeting_type`, `location`, `begin_time`, `end_time`, `participant` - **Training Framework**: PyTorch + Hugging Face Transformers - **Evaluation Metric**: Micro F1-score, Precision and Recall (seqeval) ## How It Works The model assigns a label to each token in the input sequence, using the BIO scheme (Begin–Inside–Outside). It can recognize and extract structured metadata from free-form text, even when expressed with stylistic variation. ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch # Load model and tokenizer MODEL_NAME = "anonymous13542/XLMR-large-metadata-council-en" tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME) model.eval() # Example text text = "MINUTE NO. 01 \nORDINARY MEETING 01/03/2024 \nMUNICIPALITY OF ALANDROAL \nMr. JoĆ£o Maria Aranha Grilo, Mayor of Alandroal, presided.\nCouncillors JoĆ£o Carlos CamƵes Roma Balsante\nPaulo Jorge da Silva GonƧalves\nFernanda Manuela Brites RomĆ£o\nElisabete de Jesus dos Passos Galhardas\nHe was the secretary of the meeting ************************************************\nIn the Headquarters Building of the Municipality of Alandroal, the Mayor, JoĆ£o Maria Aranha Grilo, declared the meeting open, it was 10.35 am. \n\n1.INFORMATION\n" # Tokenize with offset mapping inputs = tokenizer( text, return_tensors="en", truncation=True, max_length=512, return_offsets_mapping=True ) offsets = inputs.pop("offset_mapping")[0] # Predict with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2)[0] labels = [model.config.id2label[p.item()] for p in predictions] # Extract entities using character spans entities = [] current = None for (start, end), label in zip(offsets.tolist(), labels): if label == "O" or start == end: if current: entities.append(current) current = None continue if label.startswith("B-"): if current: entities.append(current) current = {"label": label[2:], "start": start, "end": end} elif label.startswith("I-") and current and label[2:] == current["label"]: current["end"] = end else: if current: entities.append(current) current = None if current: entities.append(current) # Print results print("\nDetected Entities:") for ent in entities: span = text[ent["start"]:ent["end"]] print(f"- {ent['label']}: {span}") ``` ## Evaluation Results ### Municipal Meeting Minutes Test Set | Metric | Score | |--------|-------| | **F1 score** | 0.93 | | **Precision** | 0.92 | | **Recall** | 0.94 | ## Limitations - **Domain Specificity**: Best performance on administrative/governmental meeting minutes - **Language**: Optimized for English but also good performance on Portuguese - **Context Window**: Limited to 512 tokens ## License This model is released under the cc-by-nc-nd-4.0 license.