papasega's picture
Upload data/README.md
7fc444a verified
|
Raw
History Blame Contribute Delete
1.79 kB

Data Files

This folder contains the JSONL files used by the Wolof LoRA demo.

Training Format

Each training row must be one JSON object per line:

{"instruction": "question or task", "input": "education", "output": "Wolof answer"}

Required fields:

  • instruction: user question or task.
  • input: category/context.
  • output: expected Wolof answer.

Supported categories:

  • education
  • agriculture
  • sante
  • transport
  • culture

Main Files

  • wolof_instruction_data.jsonl: main training dataset.
  • wolof_instruction_sample.jsonl: small sample for pipeline tests.
  • wolof_eval_examples.jsonl: evaluation set with references and predictions.

Additional generated files may appear here, for example:

  • wolof_culture_salutations_1000.jsonl
  • 1000_wol_instruct_data.jsonl
  • 273_wol_instruct_data.jsonl

Validate a training file:

python -c 'from src.data_utils import load_instruction_examples; print(len(load_instruction_examples("data/wolof_instruction_data.jsonl")))'

Evaluation Format

Each evaluation row contains:

{"instruction": "question", "input": "education", "reference": "expected answer", "prediction": "model answer"}

Run evaluation:

python evaluation.py --data data/wolof_eval_examples.jsonl

Generate predictions with the adapter before evaluating:

python evaluation.py --data data/wolof_eval_examples.jsonl --generate

Appending New Data

After generating new JSONL rows, validate the file first:

python -c 'from src.data_utils import load_instruction_examples; print(len(load_instruction_examples("data/wolof_culture_salutations_1000.jsonl")))'

Then append:

cat data/wolof_culture_salutations_1000.jsonl >> data/wolof_instruction_data.jsonl