---
base_model: unsloth/llama-3-8b-Instruct-bnb-4bit
tags:
- llama-3
- ner
- bionlp
- text-generation
- unsloth
- qlora
license: apache-2.0
language:
- en
datasets:
- tner/bionlp2004
---

# LLaMA-3-8B Fine-tuned for BioNLP Named Entity Recognition

This is a fine-tuned version of `meta-llama/Meta-Llama-3-8B-Instruct` specifically adapted for Named Entity Recognition (NER) in the biomedical domain.

The model was trained using parameter-efficient fine-tuning (PEFT) with QLoRA on the `tner/bionlp2004` dataset. The entire training process was accelerated and memory-optimized using **Unsloth**.

# Model Description

This model takes a medical or biological text as input and identifies and extracts the following five entity types:
* `DNA`
* `RNA`
* `protein`
* `cell_type`
* `cell_line`

The output is a clean, machine-readable Python list of tuples.

## Intended Use

This model is intended for researchers, bioinformaticians, and developers working on applications that require the parsing of biomedical literature. It can be used as a foundation for information extraction systems, knowledge graph population, and data analysis pipelines.

**⚠️ Disclaimer:** This model is a research tool and should **not** be used for clinical diagnosis or any real-world medical decision-making.

## How to Use

This model was trained with Unsloth, and using it for inference is highly recommended for optimal performance.

First, install the necessary libraries:
```bash
pip install "unsloth[kaggle-torch] @ git+[https://github.com/unslothai/unsloth.git](https://github.com/unslothai/unsloth.git)"
pip install "trl>=0.8.6" "peft>=0.10.0" "accelerate>=0.28.0"
```

Next, use the following Python code to run inference:

```python
from unsloth import FastLanguageModel
from transformers import pipeline
import torch

# Load the fine-tuned model from the Hub
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Arnic/llama-3-8b-bionlp-ner", 
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

# Configure the model for inference
FastLanguageModel.for_inference(model)

# The Alpaca prompt template used during training
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# The instruction for the NER task
instruction = "You are an expert in medical text analysis. Your task is to identify and extract specific biological entities from the given text. The entity types to extract are: DNA, RNA, protein, cell_type, and cell_line."

# Your input text
input_text = "Interactions between the N-terminal domains of p53 and the human papillomavirus E6 protein."

# Format the prompt
prompt = alpaca_prompt.format(instruction, input_text, "")

# Use the text-generation pipeline
fast_pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Define terminators to stop generation cleanly
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

# Get the model's response
outputs = fast_pipe(
    prompt,
    max_new_tokens=128,
    do_sample=False,
    eos_token_id=terminators,
)

# Print the clean response
print(outputs[0]['generated_text'].split("### Response:")[1].strip())
# Expected output: [('protein', 'p53'), ('protein', 'human papillomavirus E6 protein')]

```
# Evaluation
This model has not been formally evaluated on a held-out test set for metrics. Qualitative analysis on examples from the bionlp2004 test set shows a strong ability to correctly identify and format the target entities.

For a formal evaluation, one could run predictions on the test set and use a library like seqeval.

# Limitations and Bias
- Domain Specificity: The model is highly specialized for the bionlp2004 dataset. Its performance may degrade on biomedical texts from different sub-domains (e.g., clinical patient notes).

- Limited Entity Scope: The model can only identify the five entity types it was trained on. It will not recognize other common medical entities like "Disease" or "Symptom."

- Hallucination: Like all LLMs, this model can make mistakes or hallucinate entities, especially on ambiguous or out-of-domain text. All outputs should be validated by a human expert if used in a critical workflow.


# Uploaded  model

- **Developed by:** Arnic
- **License:** apache-2.0
- **Finetuned from model :** unsloth/llama-3-8b-Instruct-bnb-4bit

This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)