--- license: mit language: - en base_model: - distilbert/distilbert-base-uncased pipeline_tag: text-classification tags: - documents - distilbert - document-classifier --- # DistilBERT-document-classifier ## Summary This model presents a method for fine-tuning `distilbert/distilbert-base-uncased` on a dataset of approximately 8,000 synthetic and randomized document samples. Randomization has been introduced with token shuffling and python `Faker` library. ## Dataset The dataset was generated using LangChain's wrapper around GPT-4o-mini, with additional randomization performed by GPT-4.5. The goal was to create a dataset that is 90% clean, while intentionally introducing 10% of samples with OCR-like noise and artifacts. These imperfections are characterized by: - Excessive spacing between words (e.g., three or more spaces instead of one), - Erratic line breaks, - Common OCR misreads (e.g., the number "1" in place of a capital "I", or "3" in place of "E"). ## Supported Document Types This model is designed to work exclusively with the following document types: - Invoices - UK Driving Licenses - US Driving Licenses - Contracts - Passports (all nationalities) ## Supported Languages Currently, the model supports documents written in `English` only. ## Prediction Output This model's prediction is a numeric label, that you can match with its string equivalent, by introducing the following mapping in your code: ```json { 0: 'invoice', 1: 'driving_license', 2: 'contract', 3: 'passport' } ```