---
license: mit
language:
- en
base_model:
- distilbert/distilbert-base-uncased
pipeline_tag: text-classification
tags:
- documents
- distilbert
- document-classifier
---
# DistilBERT-document-classifier

## Summary

This model presents a method for fine-tuning `distilbert/distilbert-base-uncased` on a dataset of approximately 8,000 synthetic and randomized document samples. 
Randomization has been introduced with token shuffling and python `Faker` library.

## Dataset

The dataset was generated using LangChain's wrapper around GPT-4o-mini, with additional randomization performed by GPT-4.5. The goal was to create a dataset that is 90% clean, while intentionally introducing 10% of samples with OCR-like noise and artifacts. These imperfections are characterized by:

- Excessive spacing between words (e.g., three or more spaces instead of one),
- Erratic line breaks,
- Common OCR misreads (e.g., the number "1" in place of a capital "I", or "3" in place of "E").

## Supported Document Types

This model is designed to work exclusively with the following document types:

- Invoices  
- UK Driving Licenses  
- US Driving Licenses  
- Contracts  
- Passports (all nationalities)

## Supported Languages
Currently, the model supports documents written in `English` only.

## Prediction Output

This model's prediction is a numeric label, that you can match with its string equivalent, by introducing the following mapping in your code:
```json
  {
    0: 'invoice',
    1: 'driving_license',
    2: 'contract',
    3: 'passport'
  }
```