--- license: mit language: - en metrics: - f1 base_model: - google/flan-t5-large pipeline_tag: text2text-generation tags: - medication - ner - drug - seq2seq - twitter --- # Flan-T5-Large Fine-Tuned for Medication Mention Extraction ## Model Description This is a fine-tuned version of the [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) model for the automatic extraction of medication mentions from social media text, specifically tweets. The model reformulates named entity recognition (NER) as a sequence-to-sequence generation task, directly outputting a structured list of medications mentioned in the input text. ## Training Data The model was fine-tuned using the publicly available datasets from: * [BioCreative VII Shared Task 3](https://academic.oup.com/database/article/doi/10.1093/database/baac108/7025388) * [\#SMM4H 2018 Task 1](https://healthlanguageprocessing.org/smm4h18/social-media-mining-for-health-applications-smm4h-workshop-shared-task/) (with span-level annotations) In total, the training set included 98,610 tweets, with approximately 5% containing medication mentions. ## Intended Use * Extraction of medication mentions from social media data (primarily Twitter). * Suitable for applications in digital epidemiology, pharmacovigilance, and health-related large-scale analysis of social media data. ## How to Use ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("guilopgar/flan-t5-large-medication-ner") model = AutoModelForSeq2SeqLM.from_pretrained("guilopgar/flan-t5-large-medication-ner") input_text = ("You are given a tweet followed by a specific question asking about the content of the tweet. " "Your objective is to identify and list any drug names, medications, or dietary supplements mentioned " "in the tweet. If one or more are mentioned, list each distinctly, separated by a comma. " "If none are mentioned, return an empty list [].\n\n" "Input: Tweet: Benadryl, bedtime snack, and New Girl. The party is getting real.\n" "Question: What are the drugs, medications or dietary supplements mentioned in the tweet?\n" "Output:") inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Citation If you use this model, please cite our work: ```bibtex @article{Lopez-Garcia2025.05.16.25327791, author = {Lopez-Garcia, Guillermo and Xu, Dongfang and Gonzalez-Hernandez, Graciela}, title = {Detecting Medication Mentions in Social Media Data Using Large Language Models}, year = {2025}, doi = {10.1101/2025.05.16.25327791}, publisher = {Cold Spring Harbor Laboratory Press}, URL = {https://www.medrxiv.org/content/early/2025/05/18/2025.05.16.25327791}, journal = {medRxiv} } ```