--- language: pt license: other license_name: custom-agplv3-dual-license license_link: https://huggingface.co/inesctec/Citilink-BERTimbau-large-Vote-Identification-pt-baseline/blob/main/LICENSE tags: - token-classification - named-entity-recognition - portuguese - voting - municipal-councils - bert - bertimbau datasets: - Citilink-Minutes metrics: - precision - recall - f1 model-index: - name: Baseline_BERTimbau-large_Vote_Identification-Council-PT results: - task: type: token-classification name: Named Entity Recognition metrics: - type: f1 value: 0.8870 name: Entity F1 - type: precision value: 0.8375 name: Entity Precision - type: recall value: 0.9497 name: Entity Recall --- # Baseline_BERTimbau-large_Vote_Identification-Council-PT ## Model Description This model is a Named Entity Recognition (NER) system specifically designed for extracting voting information from Portuguese municipal council meeting minutes (atas de câmara municipal). It identifies and classifies entities related to voting processes, including subjects being voted on, counting results, and voter participation patterns. The model is built on top of [BERTimbau Large](https://huggingface.co/neuralmind/bert-large-portuguese-cased), a BERT-based language model pre-trained on Portuguese text, with a custom linear classification head optimized for voting information extraction. ### Key Features - Extracts structured voting information from Portuguese text - Identifies 8 distinct entity types with BIO tagging - Optimized for municipal council meeting minutes - High recall (94.97%) ensuring most voting entities are captured - Built on state-of-the-art Portuguese language model ## Model Details - **Architecture**: BertimbauLinearVotIE (Custom NER architecture) - **Base Model**: [neuralmind/bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased) - **Model Type**: Token Classification (NER) - **Parameters**: ~335M - 24 transformer layers - 1024 hidden dimensions - 16 attention heads - 4096 intermediate size - **Max Sequence Length**: 512 tokens - **Learning Rate**: 5e-5 - **Warmup**: 0.1 - **Batch Size**: 16 - **Optimizer**: AdamW - **Weight Decay**: 0.01 - **Number of Labels**: 17 (8 entity types + BIO tagging) - **Framework**: PyTorch + Hugging Face Transformers ## Intended Uses ### Primary Use Cases - Extracting structured voting data from Portuguese municipal council minutes - Automating analysis of voting patterns in Portuguese governmental documents - Creating datasets of voting records from unstructured text - Supporting transparency and civic engagement initiatives ### Out-of-Scope Uses - General-purpose NER for Portuguese (use domain-general models instead) - Voting information extraction in other languages - Real-time classification (model is large and may be slow) - Legal decision-making without human review ## Labels The model predicts 17 labels using BIO (Begin-Inside-Outside) tagging scheme: ### Entity Types | Label | Description | Example | |-------|-------------|---------| | **VOTING** | The voting action itself | "votação", "aprovação por votação" | | **SUBJECT** | What is being voted on | "proposta de orçamento", "regulamento municipal" | | **COUNTING-MAJORITY** | Majority-based vote counting | "maioria", "por maioria" | | **COUNTING-UNANIMITY** | Unanimous vote counting | "unanimidade", "por unanimidade" | | **VOTER-FAVOR** | Voters in favor | "a favor", "votos favoráveis" | | **VOTER-AGAINST** | Voters against | "contra", "votos contra" | | **VOTER-ABSTENTION** | Voters abstaining | "abstenção", "absteve-se" | | **VOTER-ABSENT** | Absent voters | "ausente", "falta" | | **O** | Outside any entity | - | ### BIO Tagging - **B-{ENTITY}**: Beginning of an entity - **I-{ENTITY}**: Inside/continuation of an entity - **O**: Outside any entity (not part of a voting-related span) ## Performance Evaluated on a test set of 529 examples from Portuguese municipal council minutes: ### Entity-Level Metrics (Strict Matching) | Metric | Value | |--------|-------| | Precision | 83.75% | | Recall | 94.97% | | F1 Score | **88.70%** | | Accuracy | 98.87% | ### Per-Entity Type Performance | Entity Type | Precision | Recall | F1 Score | Support | |-------------|-----------|--------|----------|---------| | VOTING | 94.09% | 98.93% | **96.45%** | 467 | | VOTER-FAVOR | 92.43% | 97.67% | **94.98%** | 300 | | VOTER-ABSTENTION | 95.56% | 96.99% | **96.27%** | 133 | | COUNTING-MAJORITY | 92.54% | 100.00% | **96.12%** | 62 | | COUNTING-UNANIMITY | 92.92% | 99.02% | **95.87%** | 305 | | VOTER-AGAINST | 81.82% | 100.00% | **90.00%** | 36 | | SUBJECT | 65.06% | 83.81% | **73.26%** | 420 | | VOTER-ABSENT | 55.56% | 83.33% | **66.67%** | 18 | ## Usage ### Installation ```bash pip install transformers torch ``` ### Basic Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch # Load model and tokenizer model_name = "your-username/bertimbau-large-vote-ner" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) # Example text text = "A proposta foi aprovada por maioria, com 5 votos a favor e 2 contra." # Tokenize and predict inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2) # Decode predictions tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) labels = [model.config.id2label[pred.item()] for pred in predictions[0]] # Print results for token, label in zip(tokens, labels): if label != "O": print(f"{token}: {label}") ``` ### Advanced Usage with Pipeline ```python from transformers import pipeline # Create NER pipeline ner_pipeline = pipeline( "token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple" # Groups B- and I- tags ) # Extract entities text = "A câmara deliberou por unanimidade aprovar o regulamento municipal." entities = ner_pipeline(text) for entity in entities: print(f"{entity['word']}: {entity['entity_group']} (score: {entity['score']:.4f})") ``` ### Output Example ``` proposta: SUBJECT (score: 0.9234) aprovada: VOTING (score: 0.9678) maioria: COUNTING-MAJORITY (score: 0.9123) votos a favor: VOTER-FAVOR (score: 0.8956) contra: VOTER-AGAINST (score: 0.9345) ``` ## Training Procedure ### Hyperparameters - **Base Model**: neuralmind/bert-large-portuguese-cased - **Architecture**: Linear classification layer on top of BERT embeddings - **Best Checkpoint**: Step 2100 - **Evaluation Examples**: 529 ### Computational Resources - **Model Size**: ~647 MB (safetensors format) - **Precision**: float32 ## Limitations and Biases ### Limitations 1. **Domain-Specific**: Optimized for municipal council minutes; may not generalize well to other document types or voting contexts 2. **Language Variant**: Trained on European Portuguese; performance on Brazilian Portuguese may vary 3. **Model Size**: 647 MB model may be too large for resource-constrained environments 4. **Context Length**: Limited to 512 tokens (BERT constraint) ### Known Issues - **SUBJECT entities**: Moderate precision (65.06%) suggests subject boundaries can be imprecise - Subject extraction may include extraneous context or miss complete subject descriptions - **VOTER-ABSENT**: Lower precision (55.56%) due to small sample size (18 examples) ### Potential Biases - Training data reflects Portuguese municipal governance language and may not capture regional variations - May reflect biases present in governmental documents - Performance may vary across different municipalities or time periods. ## License This project uses a custom dual-license based on AGPL v3. See the full license terms here: [LICENSE](./LICENSE) ## Acknowledgments - Built on [BERTimbau](https://huggingface.co/neuralmind/bert-large-portuguese-cased) by NeuralMind - Developed for improving transparency in Portuguese municipal governance