--- base_model: unsloth/llama-3-8b-Instruct-bnb-4bit tags: - llama-3 - ner - bionlp - text-generation - unsloth - qlora license: apache-2.0 language: - en datasets: - tner/bionlp2004 --- # LLaMA-3-8B Fine-tuned for BioNLP Named Entity Recognition This is a fine-tuned version of `meta-llama/Meta-Llama-3-8B-Instruct` specifically adapted for Named Entity Recognition (NER) in the biomedical domain. The model was trained using parameter-efficient fine-tuning (PEFT) with QLoRA on the `tner/bionlp2004` dataset. The entire training process was accelerated and memory-optimized using **Unsloth**. # Model Description This model takes a medical or biological text as input and identifies and extracts the following five entity types: * `DNA` * `RNA` * `protein` * `cell_type` * `cell_line` The output is a clean, machine-readable Python list of tuples. ## Intended Use This model is intended for researchers, bioinformaticians, and developers working on applications that require the parsing of biomedical literature. It can be used as a foundation for information extraction systems, knowledge graph population, and data analysis pipelines. **⚠️ Disclaimer:** This model is a research tool and should **not** be used for clinical diagnosis or any real-world medical decision-making. ## How to Use This model was trained with Unsloth, and using it for inference is highly recommended for optimal performance. First, install the necessary libraries: ```bash pip install "unsloth[kaggle-torch] @ git+[https://github.com/unslothai/unsloth.git](https://github.com/unslothai/unsloth.git)" pip install "trl>=0.8.6" "peft>=0.10.0" "accelerate>=0.28.0" ``` Next, use the following Python code to run inference: ```python from unsloth import FastLanguageModel from transformers import pipeline import torch # Load the fine-tuned model from the Hub model, tokenizer = FastLanguageModel.from_pretrained( model_name = "Arnic/llama-3-8b-bionlp-ner", max_seq_length = 2048, dtype = None, load_in_4bit = True, ) # Configure the model for inference FastLanguageModel.for_inference(model) # The Alpaca prompt template used during training alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Input: {} ### Response: {}""" # The instruction for the NER task instruction = "You are an expert in medical text analysis. Your task is to identify and extract specific biological entities from the given text. The entity types to extract are: DNA, RNA, protein, cell_type, and cell_line." # Your input text input_text = "Interactions between the N-terminal domains of p53 and the human papillomavirus E6 protein." # Format the prompt prompt = alpaca_prompt.format(instruction, input_text, "") # Use the text-generation pipeline fast_pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) # Define terminators to stop generation cleanly terminators = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>") ] # Get the model's response outputs = fast_pipe( prompt, max_new_tokens=128, do_sample=False, eos_token_id=terminators, ) # Print the clean response print(outputs[0]['generated_text'].split("### Response:")[1].strip()) # Expected output: [('protein', 'p53'), ('protein', 'human papillomavirus E6 protein')] ``` # Evaluation This model has not been formally evaluated on a held-out test set for metrics. Qualitative analysis on examples from the bionlp2004 test set shows a strong ability to correctly identify and format the target entities. For a formal evaluation, one could run predictions on the test set and use a library like seqeval. # Limitations and Bias - Domain Specificity: The model is highly specialized for the bionlp2004 dataset. Its performance may degrade on biomedical texts from different sub-domains (e.g., clinical patient notes). - Limited Entity Scope: The model can only identify the five entity types it was trained on. It will not recognize other common medical entities like "Disease" or "Symptom." - Hallucination: Like all LLMs, this model can make mistakes or hallucinate entities, especially on ambiguous or out-of-domain text. All outputs should be validated by a human expert if used in a critical workflow. # Uploaded model - **Developed by:** Arnic - **License:** apache-2.0 - **Finetuned from model :** unsloth/llama-3-8b-Instruct-bnb-4bit This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. [](https://github.com/unslothai/unsloth)