# Data Files

This folder contains the JSONL files used by the Wolof LoRA demo.

## Training Format

Each training row must be one JSON object per line:

```json
{"instruction": "question or task", "input": "education", "output": "Wolof answer"}
```

Required fields:

- `instruction`: user question or task.
- `input`: category/context.
- `output`: expected Wolof answer.

Supported categories:

- `education`
- `agriculture`
- `sante`
- `transport`
- `culture`

## Main Files

- `wolof_instruction_data.jsonl`: main training dataset.
- `wolof_instruction_sample.jsonl`: small sample for pipeline tests.
- `wolof_eval_examples.jsonl`: evaluation set with references and predictions.

Additional generated files may appear here, for example:

- `wolof_culture_salutations_1000.jsonl`
- `1000_wol_instruct_data.jsonl`
- `273_wol_instruct_data.jsonl`

Validate a training file:

```bash
python -c 'from src.data_utils import load_instruction_examples; print(len(load_instruction_examples("data/wolof_instruction_data.jsonl")))'
```

## Evaluation Format

Each evaluation row contains:

```json
{"instruction": "question", "input": "education", "reference": "expected answer", "prediction": "model answer"}
```

Run evaluation:

```bash
python evaluation.py --data data/wolof_eval_examples.jsonl
```

Generate predictions with the adapter before evaluating:

```bash
python evaluation.py --data data/wolof_eval_examples.jsonl --generate
```

## Appending New Data

After generating new JSONL rows, validate the file first:

```bash
python -c 'from src.data_utils import load_instruction_examples; print(len(load_instruction_examples("data/wolof_culture_salutations_1000.jsonl")))'
```

Then append:

```bash
cat data/wolof_culture_salutations_1000.jsonl >> data/wolof_instruction_data.jsonl
```