---
language:
- en
- pt
license: cc-by-nc-nd-4.0
tags:
- named-entity-recognition
- metadata-extraction
- bert
- meeting-minutes
- municipal-documents
library_name: transformers
base_model:
- xlm-roberta-large
metrics:
- f1
- recall
- precision
pipeline_tag: token-classification
---

# XLMR-large-metadata-council-en: Metadata Extraction for Municipal Meeting Minutes

## Model Description

**XLMR-large-metadata-council-en** is a fine-tuned BERT model for **Named Entity Recognition (NER)**, trained to automatically extract metadata such as meeting number, date, location, participants, and time expressions from **Portuguese municipal meeting minutes**.  
It was developed as part of a study on information extraction and indexing of administrative documents.

### Key Features

- 🏛️ **Domain-Specific**: Trained on translated municipal meeting minutes
- 🧠 **Entity-Level Extraction**: Identifies key metadata (minute ID, date, location, start/end times, participants)
- ⚙️ **Transformer-based Architecture**: BERTimbau backbone with fine-tuning for token classification
- 📈 **Strong NER Performance**: Achieves F1-score above 0.90 on the English and Portuguese dataset

## Model Details

- **Base Model**: `xlm-roberta-large`
- **Architecture**: BERT for token classification (NER)
- **Parameters**: 559M
- **Max Sequence Length**: 512 tokens
- **Fine-tuning Dataset**: Municipal meeting minutes (20 municipal minutes per 6 Portuguese municipalities totalling 120 documents)
- **Entity Types**: `minute_id`, `date`, `meeting_type`, `location`, `begin_time`, `end_time`, `participant`
- **Training Framework**: PyTorch + Hugging Face Transformers
- **Evaluation Metric**: Micro F1-score, Precision and Recall (seqeval)

## How It Works

The model assigns a label to each token in the input sequence, using the BIO scheme (Begin–Inside–Outside).  
It can recognize and extract structured metadata from free-form text, even when expressed with stylistic variation.

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
MODEL_NAME = "anonymous13542/XLMR-large-metadata-council-en"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)
model.eval()

# Example text
text = "MINUTE NO. 01 \nORDINARY MEETING 01/03/2024 \nMUNICIPALITY OF ALANDROAL \nMr. João Maria Aranha Grilo, Mayor of Alandroal, presided.\nCouncillors João Carlos Camões Roma Balsante\nPaulo Jorge da Silva Gonçalves\nFernanda Manuela Brites Romão\nElisabete de Jesus dos Passos Galhardas\nHe was the secretary of the meeting ************************************************\nIn the Headquarters Building of the Municipality of Alandroal, the Mayor, João Maria Aranha Grilo, declared the meeting open, it was 10.35 am. \n\n1.INFORMATION\n"

# Tokenize with offset mapping
inputs = tokenizer(
    text,
    return_tensors="en",
    truncation=True,
    max_length=512,
    return_offsets_mapping=True
)
offsets = inputs.pop("offset_mapping")[0]

# Predict
with torch.no_grad():
    outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=2)[0]
labels = [model.config.id2label[p.item()] for p in predictions]

# Extract entities using character spans
entities = []
current = None

for (start, end), label in zip(offsets.tolist(), labels):
    if label == "O" or start == end:
        if current:
            entities.append(current)
            current = None
        continue

    if label.startswith("B-"):
        if current:
            entities.append(current)
        current = {"label": label[2:], "start": start, "end": end}
    elif label.startswith("I-") and current and label[2:] == current["label"]:
        current["end"] = end
    else:
        if current:
            entities.append(current)
        current = None

if current:
    entities.append(current)

# Print results
print("\nDetected Entities:")
for ent in entities:
    span = text[ent["start"]:ent["end"]]
    print(f"- {ent['label']}: {span}")
```

## Evaluation Results

### Municipal Meeting Minutes Test Set

| Metric | Score |
|--------|-------|
| **F1 score** | 0.93 |
| **Precision** | 0.92 |
| **Recall** | 0.94 |


## Limitations

- **Domain Specificity**: Best performance on administrative/governmental meeting minutes
- **Language**: Optimized for English but also good performance on Portuguese
- **Context Window**: Limited to 512 tokens

## License

This model is released under the cc-by-nc-nd-4.0 license.