BroLaurens
/

finer-distilbert

Token Classification

Model card Files Files and versions

BroLaurens commited on Apr 1, 2024

Commit

f867e3f

·

verified ·

1 Parent(s): 9e7248c

Update readme

Files changed (1) hide show

README.md +62 -3

README.md CHANGED Viewed

@@ -1,3 +1,62 @@
----
-license: apache-2.0
----

+#finer-distilbert
+## Model description
+**finer-distilbert** is a fine-tuned distilbert model trained on the task of **Named Entity Recognition**. It is a proof-of-concept model trained to recognize the top 4 entity types in the nlpaueb/finer-139 dataset. Due to limited time the model has not undergone any hyperparameter tuning. The model's output structure matches the **IOB2** annotation scheme of the original training dataset. The label ids are as followed:
+```
+0: O
+1: B-DebtInstrumentBasisSpreadOnVariableRate1
+2: B-DebtInstrumentFaceAmount
+3: I-DebtInstrumentFaceAmount
+4: I-LineOfCreditFacilityMaximumBorrowingCapacity
+5: B-DebtInstrumentInterestRateStatedPercentage
+6: I-DebtInstrumentBasisSpreadOnVariableRate1
+7: I-DebtInstrumentInterestRateStatedPercentage
+8: B-LineOfCreditFacilityMaximumBorrowingCapacity
+```
+## Running the model
+A basic example on how to run the model and obtain the predicted labels per token per text:
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+# Preparing labels for reference
+int2str = {
+  0: 'O',
+  1: 'B-DebtInstrumentBasisSpreadOnVariableRate1',
+  2: 'B-DebtInstrumentFaceAmount',
+  3: 'I-DebtInstrumentFaceAmount',
+  4: 'I-LineOfCreditFacilityMaximumBorrowingCapacity',
+  5: 'B-DebtInstrumentInterestRateStatedPercentage',
+  6: 'I-DebtInstrumentBasisSpreadOnVariableRate1',
+  7: 'I-DebtInstrumentInterestRateStatedPercentage',
+  8: 'B-LineOfCreditFacilityMaximumBorrowingCapacity',
+}
+str2int = {v:k for k,v in int2str.items()}
+# Load model dependencies
+model = AutoModelForTokenClassification.from_pretrained(
+    "brolaurens/finer-distilbert", num_labels=len(int2str), id2label=int2str, label2id=str2int
+)
+tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", model_max_length=512)
+# Text
+texts = [
+  "Of the amount drawn, $ 3,721,583 was used to pay the principal amount of $ 3,700,000 and accrued interest of $ 21,583 due under the Company 's Loan Agreement with Capital Preservation Solutions, LLC entered into on September 4, 2015."
+]
+# Tokenize input
+model_input = tokenizer(texts, return_tensors='pt')
+# Obtain model output
+predictions = model(**model_input).logits
+predictions = predictions.argmax(axis=2)
+predicted_labels = [[int2str[x] for x in t] for t in predictions.tolist()]
+```