oneryalcin
/

financial-filings-sparse-encoder-v1

Feature Extraction

sentence-transformers

sparse-retrieval

financial-filings

Model card Files Files and versions

oneryalcin commited on May 13

Commit

e025ace

·

verified ·

1 Parent(s): 16cbe15

Clarify query and document encoder usage

Files changed (1) hide show

README.md +11 -0

README.md CHANGED Viewed

@@ -22,6 +22,8 @@ This is a Sentence Transformers `SparseEncoder` / SPLADE-style model fine-tuned
 The practical recommendation from the experiments below is to index document vectors after **top-128 pruning**. In the current proxy retrieval benchmark, top-128 keeps almost all unpruned quality while reducing each document to about 126 active sparse terms.
 ## TL;DR
 | setting | value |
@@ -64,6 +66,15 @@ opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
 Reason: it already has separate query/document sparse encoding behavior and is aligned with OpenSearch neural sparse retrieval. Starting here means fine-tuning adapts a serving-compatible sparse model rather than building a new retrieval stack from scratch.
 ## Dataset
 Dataset:

 The practical recommendation from the experiments below is to index document vectors after **top-128 pruning**. In the current proxy retrieval benchmark, top-128 keeps almost all unpruned quality while reducing each document to about 126 active sparse terms.
+Naming note: the model is not document-only. It is an asymmetric query/document sparse encoder, following the OpenSearch model-family convention where the heavier document-side path is emphasized. Use `encode_query(...)` for online queries and `encode_document(...)` for offline document/chunk indexing.
 ## TL;DR
 | setting | value |
 Reason: it already has separate query/document sparse encoding behavior and is aligned with OpenSearch neural sparse retrieval. Starting here means fine-tuning adapts a serving-compatible sparse model rather than building a new retrieval stack from scratch.
+The `doc` wording in the base model name does not mean queries are encoded with the document encoder. This model should be used with the routed Sentence Transformers sparse API:
+```python
+query_vectors = model.encode_query(queries)
+document_vectors = model.encode_document(documents)
+```
+In practice, query encoding is the lightweight online side, while document encoding is the heavier offline/indexing side. The top-k pruning recommendation applies to document vectors before indexing.
 ## Dataset
 Dataset: