---
language: pt
license: other
license_name: custom-agplv3-dual-license
license_link: https://huggingface.co/inesctec/Citilink-BERTimbau-large-Vote-Identification-pt-baseline/blob/main/LICENSE
tags:
- token-classification
- named-entity-recognition
- portuguese
- voting
- municipal-councils
- bert
- bertimbau
datasets:
- Citilink-Minutes
metrics:
- precision
- recall
- f1
model-index:
- name: Baseline_BERTimbau-large_Vote_Identification-Council-PT
  results:
  - task:
      type: token-classification
      name: Named Entity Recognition
    metrics:
    - type: f1
      value: 0.8870
      name: Entity F1
    - type: precision
      value: 0.8375
      name: Entity Precision
    - type: recall
      value: 0.9497
      name: Entity Recall
---

# Baseline_BERTimbau-large_Vote_Identification-Council-PT

## Model Description

This model is a Named Entity Recognition (NER) system specifically designed for extracting voting information from Portuguese municipal council meeting minutes (atas de câmara municipal). It identifies and classifies entities related to voting processes, including subjects being voted on, counting results, and voter participation patterns.

The model is built on top of [BERTimbau Large](https://huggingface.co/neuralmind/bert-large-portuguese-cased), a BERT-based language model pre-trained on Portuguese text, with a custom linear classification head optimized for voting information extraction.

### Key Features

- Extracts structured voting information from Portuguese text
- Identifies 8 distinct entity types with BIO tagging
- Optimized for municipal council meeting minutes
- High recall (94.97%) ensuring most voting entities are captured
- Built on state-of-the-art Portuguese language model

## Model Details

- **Architecture**: BertimbauLinearVotIE (Custom NER architecture)
- **Base Model**: [neuralmind/bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased)
- **Model Type**: Token Classification (NER)
- **Parameters**: ~335M
  - 24 transformer layers
  - 1024 hidden dimensions
  - 16 attention heads
  - 4096 intermediate size
- **Max Sequence Length**: 512 tokens
- **Learning Rate**: 5e-5
- **Warmup**: 0.1
- **Batch Size**: 16
- **Optimizer**: AdamW
- **Weight Decay**: 0.01
- **Number of Labels**: 17 (8 entity types + BIO tagging)
- **Framework**: PyTorch + Hugging Face Transformers

## Intended Uses

### Primary Use Cases

- Extracting structured voting data from Portuguese municipal council minutes
- Automating analysis of voting patterns in Portuguese governmental documents
- Creating datasets of voting records from unstructured text
- Supporting transparency and civic engagement initiatives

### Out-of-Scope Uses

- General-purpose NER for Portuguese (use domain-general models instead)
- Voting information extraction in other languages
- Real-time classification (model is large and may be slow)
- Legal decision-making without human review

## Labels

The model predicts 17 labels using BIO (Begin-Inside-Outside) tagging scheme:

### Entity Types

| Label | Description | Example |
|-------|-------------|---------|
| **VOTING** | The voting action itself | "votação", "aprovação por votação" |
| **SUBJECT** | What is being voted on | "proposta de orçamento", "regulamento municipal" |
| **COUNTING-MAJORITY** | Majority-based vote counting | "maioria", "por maioria" |
| **COUNTING-UNANIMITY** | Unanimous vote counting | "unanimidade", "por unanimidade" |
| **VOTER-FAVOR** | Voters in favor | "a favor", "votos favoráveis" |
| **VOTER-AGAINST** | Voters against | "contra", "votos contra" |
| **VOTER-ABSTENTION** | Voters abstaining | "abstenção", "absteve-se" |
| **VOTER-ABSENT** | Absent voters | "ausente", "falta" |
| **O** | Outside any entity | - |

### BIO Tagging

- **B-{ENTITY}**: Beginning of an entity
- **I-{ENTITY}**: Inside/continuation of an entity
- **O**: Outside any entity (not part of a voting-related span)

## Performance

Evaluated on a test set of 529 examples from Portuguese municipal council minutes:

### Entity-Level Metrics (Strict Matching)

| Metric | Value |
|--------|-------|
| Precision | 83.75% |
| Recall | 94.97% |
| F1 Score | **88.70%** |
| Accuracy | 98.87% |

### Per-Entity Type Performance

| Entity Type | Precision | Recall | F1 Score | Support |
|-------------|-----------|--------|----------|---------|
| VOTING | 94.09% | 98.93% | **96.45%** | 467 |
| VOTER-FAVOR | 92.43% | 97.67% | **94.98%** | 300 |
| VOTER-ABSTENTION | 95.56% | 96.99% | **96.27%** | 133 |
| COUNTING-MAJORITY | 92.54% | 100.00% | **96.12%** | 62 |
| COUNTING-UNANIMITY | 92.92% | 99.02% | **95.87%** | 305 |
| VOTER-AGAINST | 81.82% | 100.00% | **90.00%** | 36 |
| SUBJECT | 65.06% | 83.81% | **73.26%** | 420 |
| VOTER-ABSENT | 55.56% | 83.33% | **66.67%** | 18 |


## Usage

### Installation

```bash
pip install transformers torch
```

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "your-username/bertimbau-large-vote-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example text
text = "A proposta foi aprovada por maioria, com 5 votos a favor e 2 contra."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[pred.item()] for pred in predictions[0]]

# Print results
for token, label in zip(tokens, labels):
    if label != "O":
        print(f"{token}: {label}")
```

### Advanced Usage with Pipeline

```python
from transformers import pipeline

# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"  # Groups B- and I- tags
)

# Extract entities
text = "A câmara deliberou por unanimidade aprovar o regulamento municipal."
entities = ner_pipeline(text)

for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} (score: {entity['score']:.4f})")
```

### Output Example

```
proposta: SUBJECT (score: 0.9234)
aprovada: VOTING (score: 0.9678)
maioria: COUNTING-MAJORITY (score: 0.9123)
votos a favor: VOTER-FAVOR (score: 0.8956)
contra: VOTER-AGAINST (score: 0.9345)
```

## Training Procedure

### Hyperparameters

- **Base Model**: neuralmind/bert-large-portuguese-cased
- **Architecture**: Linear classification layer on top of BERT embeddings
- **Best Checkpoint**: Step 2100
- **Evaluation Examples**: 529

### Computational Resources

- **Model Size**: ~647 MB (safetensors format)
- **Precision**: float32

## Limitations and Biases

### Limitations

1. **Domain-Specific**: Optimized for municipal council minutes; may not generalize well to other document types or voting contexts
2. **Language Variant**: Trained on European Portuguese; performance on Brazilian Portuguese may vary
3. **Model Size**: 647 MB model may be too large for resource-constrained environments
4. **Context Length**: Limited to 512 tokens (BERT constraint)

### Known Issues

- **SUBJECT entities**: Moderate precision (65.06%) suggests subject boundaries can be imprecise
- Subject extraction may include extraneous context or miss complete subject descriptions
- **VOTER-ABSENT**: Lower precision (55.56%) due to small sample size (18 examples)

### Potential Biases

- Training data reflects Portuguese municipal governance language and may not capture regional variations
- May reflect biases present in governmental documents
- Performance may vary across different municipalities or time periods.


## License

This project uses a custom dual-license based on AGPL v3.

See the full license terms here: [LICENSE](./LICENSE)

## Acknowledgments

- Built on [BERTimbau](https://huggingface.co/neuralmind/bert-large-portuguese-cased) by NeuralMind
- Developed for improving transparency in Portuguese municipal governance