--- library_name: transformers license: apache-2.0 base_model: Qwen/Qwen2.5-7B-Instruct tags: - generated_from_trainer metrics: - rouge model-index: - name: EpiQwen2.5-7B results: [] language: - en --- # EpiQwen2.5-7B: Fine-tuned Qwen for Epidemiological Information Extraction ## Model Description EpiQwen2.5-7B is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) specialized for extracting structured epidemiological information from unstructured disease outbreak reports. The model was trained on the WHO Disease Outbreak News (DONs) curated database [(Carlson et al., 2023)](https://doi.org/10.1371/journal.pgph.0001083) to automatically extract key epidemiological features including disease classification, geographical locations, case counts, temporal information, and outbreak characteristics. ### Model Details - **Base Model**: Qwen/Qwen2.5-7B-Instruct - **Base Model License**: [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0) - **Model Type**: Causal Language Model (Decoder-only Transformer) - **Fine-tuning Method**: Parameter-Efficient Fine-Tuning (PEFT) with LoRA (Low-Rank Adaptation) - **Adapter Weights License**: CC0-1.0 (Public Domain Dedication) - **Note**: Only the LoRA adapter weights are released under CC0. The base model weights remain under Apache 2.0. - **Training Data**: WHO Disease Outbreak News curated database (3,338 records through 2019) - **Language**: English - **Application Domain**: Public health surveillance, epidemic intelligence, epidemiological information extraction ## License ### Licensing Information This repository contains **LoRA adapter weights only**, not the full model weights. - **Base Model (Qwen 2.5 7B)**: Licensed under [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0) - Copyright (c) Alibaba Cloud - Permissive open-source license allowing commercial use, modification, and distribution - Requires preservation of copyright and license notices - **LoRA Adapter Weights**: Released under [CC0 1.0 Universal (Public Domain Dedication)](https://creativecommons.org/publicdomain/zero/1.0/) - The adapter weights can be used without restriction - To use these adapters, you must have access to the base Qwen 2.5 7B model (available under Apache 2.0) **Attribution**: When using this model, please include appropriate attribution: ``` Qwen 2.5 is licensed under the Apache License 2.0, Copyright (c) Alibaba Cloud. EpiQwen2.5-7B LoRA adapter weights are released under CC0 1.0 Universal (Public Domain). ``` ### Distribution Notes - This repository distributes only the fine-tuned LoRA adapter parameters - Base model weights are unchanged and must be obtained separately from Alibaba Cloud/Hugging Face - Users benefit from Apache 2.0's permissive terms for the base model - The LoRA adapters are applied on top of the base model weights at inference time ## Performance The model achieved the following results on the evaluation set: | Metric | Score | |:-------|:------| | Rouge-1 | 0.918 ± 0.055 | | Rouge-2 | 0.864 ± 0.063 | | Rouge-L | 0.908 ± 0.056 | | Rouge-Lsum | 0.908 ± 0.055 | These scores represent overall performance across 5-fold stratified cross-validation, demonstrating **high accuracy** in extracting structured epidemiological information from unstructured outbreak reports. ### Training Summary - **Best Training Step**: 5,330 (final) - **Final Training Loss**: 0.5623 - **Initial Training Loss**: 1.7273 - **Total Improvement**: 1.1650 - **Total Training Steps**: 5,330 - **Training Epochs**: 10 ## Intended Uses & Limitations ### Intended Uses This model is designed for: - Automated extraction of epidemiological information from disease outbreak reports - Public health surveillance systems requiring structured data from unstructured sources - Epidemic intelligence pipelines for rapid outbreak detection and monitoring - Research purposes in computational epidemiology and public health informatics ### Limitations - The model is trained specifically on [WHO DONs](https://www.who.int/emergencies/disease-outbreak-news) format and may require adaptation for other report formats - Performance on diseases not well-represented in the training data may vary - The model extracts information present in the text and does not generate or infer missing data - Designed for English-language outbreak reports only - Should be used as a decision-support tool, with human verification for critical public health decisions ## Extracted Features The model extracts the following structured epidemiological information: **Disease Information:** - DiseaseLevel1 (primary disease classification) - DiseaseLevel2 (disease subtype/variant) **Geographical Information:** - Country - ISO country code - OutbreakEpicenter (specific location within country) **Case Counts:** - CasesTotal - CasesSuspected - CasesProbable - CasesConfirmed - Deaths **Temporal Information:** - Outbreak start date (year, month, day) - Outbreak detection date (year, month, day) - Outbreak verification date (year, month, day) - Outbreak end date and status ## Training Procedure ### Training Data The model was trained on the WHO Disease Outbreak News curated database [(Carlson et al., 2023)](https://doi.org/10.1371/journal.pgph.0001083), which contains: - **3,338 structured records** of disease outbreaks (data through 2019) - Curated epidemiological information manually extracted from WHO DONs reports - Standardized format for disease classifications, geographical locations, case counts, and temporal data ### Training Approach The training followed an instruction-tuning paradigm where unstructured outbreak report text is paired with structured JSON output containing extracted epidemiological features. The prompt format used was: ``` Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Extract disease outbreak information from the given text and format it as JSON. Return a list containing one JSON object per outbreak mentioned. Use "None" for missing information. Never invent or guess data. ### Input: [Outbreak report text] ### Response: [Extracted JSON with epidemiological features] ``` ### Fine-tuning Configuration **LoRA (Low-Rank Adaptation) Parameters:** - Rank (r): 16 - Alpha (α): 16 - Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj - Dropout: 0.05 - Task type: CAUSAL_LM **Training Hyperparameters:** - Learning rate: 1e-5 (with linear decay) - Optimizer: AdamW (8-bit paged) - Training batch size: 4 per device (8 GPUs) - Gradient accumulation steps: 4 - Number of epochs: 10 - Warmup steps: Adaptive (10% of training steps, max 10) - FP16 mixed precision training - Weight decay: 0.01 - LR scheduler: Linear - Seed: 41 **Evaluation Strategy:** - 5-fold stratified cross-validation - Evaluation metric: Training loss (model selection based on lowest training loss) - Early stopping: After 6 consecutive evaluations without improvement - Logging steps: 10 - Save steps: Adaptive (10% of training steps) **Hardware:** - Infrastructure: [JRC Big Data Analytics Platform](https://jeodpp.jrc.ec.europa.eu/bdap/) - System: Linux cluster, Ubuntu 22.04.5 LTS - CPU: Intel Xeon Platinum 8470 (208 CPUs) - RAM: 1TB - GPUs: 8x NVIDIA H100 - Training time: ~20-22 hours per fold ### Quantization The model uses 8-bit quantization with LoRA during training: - Load in 8-bit: True - Quantization type: Standard 8-bit - Compute dtype: bfloat16 ## Usage ### Installation ```bash pip install transformers==4.52.4 pip install torch==2.3.1 pip install peft==0.12.0 pip install accelerate==1.7.0 pip install bitsandbytes==0.43.3 ``` ### Basic Usage **Note**: You need access to the base Qwen 2.5 7B model (available under Apache 2.0) to use these adapter weights. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel # Load base model (Apache 2.0 licensed) base_model_id = "Qwen/Qwen2.5-7B-Instruct" adapter_model_id = "jrc-ai/EpiQwen2.5-7B" # LoRA adapters device = "cuda" if torch.cuda.is_available() else "cpu" # Load tokenizer from base model tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True) # Load base model base_model = AutoModelForCausalLM.from_pretrained( base_model_id, device_map="auto", torch_dtype=torch.bfloat16, ) # Load and apply LoRA adapters model = PeftModel.from_pretrained(base_model, adapter_model_id) # Example outbreak report outbreak_text = """ WHO has reported 3 suspected cases of yellow fever in Maryland county, in the south-eastern part of the country. One case with disease onset on 1 August has been confirmed (IgM positive) by the Institut Pasteur in Abidjan, Côte d'Ivoire. All three cases have died. """ # Format prompt prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Extract disease outbreak information from the given text and format it as JSON. Return a list containing one JSON object per outbreak mentioned. Always return a list of JSON objects, even for single outbreaks. Use "None" for missing information. If no outbreak information is found, return an empty list []. Never invent or guess data. ### Input: {outbreak_text} ### Response: """ # Tokenize and generate inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=600, temperature=0.1, do_sample=False, pad_token_id=tokenizer.eos_token_id ) # Decode output extracted_info = tokenizer.decode(outputs[0], skip_special_tokens=True) print(extracted_info) ``` ### Expected Output Format ```json [{ "DiseaseLevel1": "Yellow fever", "DiseaseLevel2": "", "Country": "Liberia", "ISO": "LBR", "OutbreakEpicenter": "Maryland county", "CasesTotal": 3, "CasesSuspected": 2, "CasesProbable": null, "CasesConfirmed": 1, "Deaths": 3, "OutbreakStartYear": 2001, "OutbreakStartMonth": 8, "OutbreakStartDay": 1, "OutbreakDetectionYear": null, "OutbreakDetectionMonth": null, "OutbreakDetectionDay": null, "OutbreakVerificationYear": null, "OutbreakVerificationMonth": null, "OutbreakVerificationDay": null, "OutbreakEnd": null, "OutbreakEndYear": null, "OutbreakEndMonth": null, "OutbreakEndDay": null }] ``` ## Comparison with Other Approaches ### In-Context Learning vs Fine-Tuning This fine-tuned model significantly outperforms in-context learning (iCL) approaches: | Approach | Rouge-1 | Rouge-2 | Rouge-L | Rouge-Lsum | |:---------|:--------|:--------|:--------|:-----------| | **EpiQwen 2.5-7B (fine-tuned)** | **0.918** | **0.864** | **0.908** | **0.908** | | Qwen 2.5-7B (16-shot iCL) | 0.819 | 0.682 | 0.801 | 0.819 | | LLaMA 3.3-70B (16-shot iCL) | 0.840 | 0.698 | 0.824 | 0.841 | **Performance gain from fine-tuning: ~10 percentage points** across all ROUGE metrics. ### Comparison with Other Fine-tuned Models | Model | Parameters | Rouge-1 | Rouge-2 | Rouge-L | |:------|:-----------|:--------|:--------|:--------| | EpiLLaMA 3.3-70B | 70B | **0.937** | **0.896** | **0.928** | | **EpiQwen 2.5-7B** | **7B** | **0.918** | **0.864** | **0.908** | | EpiMistral-7B | 7B | 0.899 | 0.853 | 0.889 | All pairwise comparisons are statistically significant (p < 0.001, Nemenyi post-hoc test with Bonferroni correction). **Key Advantage:** EpiQwen 2.5-7B achieves performance within 0.019-0.032 points of the much larger EpiLLaMA 3.3-70B (70B parameters) while being 10× smaller, making it ideal for resource-constrained deployment scenarios. ## Citation If you use this model in your research, please cite: ```bibtex @article{consoli2026generative, title={Generative AI for Structured Epidemiological Information Extraction: Comparing In-Context Learning and Fine-Tuning Approaches}, author={Consoli, Sergio and Bertolini, Lorenzo and Stefanovitch, Nicolas and Spagnolo, Luigi and Espinosa, Laura and Stilianakis, Nikolaos I.}, journal={PLoS Digital Health}, volume={submitted, currently under revision}, year={2026} } ``` Please also acknowledge the base model: ```bibtex @article{qwen2.5, title={Qwen2.5: A Party of Foundation Models}, author={Qwen Team}, year={2024}, url={https://qwenlm.github.io/blog/qwen2.5/} } ``` ## Ethical Considerations & Dual-Use Implications Upon evaluation, we identified no dual-use implications for this model. The model is designed specifically for public health surveillance and epidemic intelligence applications to support global health initiatives. **Important Notes:** - The model should be used as a decision-support tool with appropriate human oversight - Extracted information should be verified by public health professionals before making critical decisions - The model does not replace human expertise in epidemiological analysis - Privacy and data protection regulations should be followed when processing outbreak reports ## Acknowledgments We acknowledge: - **Alibaba Cloud** for developing and releasing Qwen 2.5 7B under the Apache License 2.0 - The GPT@JRC initiative for providing access to LLMs - The JRC Big Data Analytics Platform for computational infrastructure - The [WHO Epidemic Intelligence from Open Sources (EIOS)](https://www.who.int/initiatives/eios) initiative for support - Colleagues at the European Commission Joint Research Centre (JRC) and the European Centre for Disease Prevention and Control (ECDC) ## Framework Versions - Transformers: 4.52.4 - PyTorch: 2.3.1 - PEFT: 0.12.0 - Accelerate: 1.7.0 - BitsAndBytes: 0.43.3 - Datasets: 2.20.0 --- **Disclaimer**: The views expressed are purely those of the authors and may not in any circumstance be regarded as stating an official position of the European Commission.