Instructions to use oneryalcin/financial-filings-sparse-encoder-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use oneryalcin/financial-filings-sparse-encoder-v1 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("oneryalcin/financial-filings-sparse-encoder-v1") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
Next Steps: Production-Readiness Roadmap
This file captures follow-up experiments for oneryalcin/financial-filings-sparse-encoder-v1 so the work can resume from the Hub repo without relying on chat history.
Current status
The v1 model is a useful open experiment and candidate learned-sparse baseline for financial filing retrieval.
Current recommendation:
Use encode_query(...) for online queries.
Use encode_document(...) offline for document/chunk indexing.
Prune document vectors to top-128 terms before indexing.
Current best proxy result:
| model | pruning | Recall@1 | Recall@10 | MRR@10 | nDCG@10 |
|---|---|---|---|---|---|
| fine-tuned sparse | top-128 | 38.6% | 67.2% | 0.473 | 0.521 |
| base sparse | top-128 | 31.4% | 62.7% | 0.422 | 0.472 |
| BM25 | lexical | 24.0% | 64.1% | 0.397 | 0.457 |
Caveat: this is a 1,000-query in-memory proxy over 1,912 held-out candidate chunks, not a production OpenSearch/Elasticsearch benchmark.
Phase 1: evaluate v1 harder, no retraining
Run these before changing the training data or model recipe.
Full held-out retrieval proxy
Run the in-memory retrieval proxy on the full test split, not only the first 1,000 usable rows.
Report:
- Recall@1
- Recall@5
- Recall@10
- Recall@20
- MRR@10
- nDCG@10
- median rank
- mean rank
Slice analysis
Break metrics down by available metadata:
query_typecompanydoc_type- filing year, if available
- section, if available
- query length bucket
- document length bucket
Goal: identify whether the model is genuinely broad or only strong on dominant slices.
Error analysis
Sample failures from:
- top-1 wrong results
- positive rank greater than 10
- cases where BM25 wins and sparse loses
- cases where sparse wins and BM25 loses
For each failure, classify likely cause:
- false negative or incomplete label
- query too broad
- chunk lacks enough context
- same-company wrong section
- same-topic wrong company
- lexical miss
- semantic overmatch
Encoding-speed benchmark
Run a warm-cache benchmark for:
- query encoding latency
- document encoding throughput
- top-128 pruning overhead
- serialization overhead
Do not compare cold Hub load times. Download/cache models first, run warmup batches, then measure.
Index-size estimate
Estimate sparse index footprint for:
- unpruned
- top-256
- top-128
- top-64
Report average active terms and approximate postings count.
Phase 2: OpenSearch/Elasticsearch benchmark
Build a real sparse retrieval benchmark using OpenSearch or Elasticsearch.
Variants to compare:
| variant | purpose |
|---|---|
| BM25 only | lexical baseline |
| learned sparse top-128 only | current recommended model path |
| learned sparse top-64 only | compact model path |
| BM25 + learned sparse RRF | likely robust hybrid baseline |
| BM25 + learned sparse weighted score | tuneable hybrid candidate |
Report:
- Recall@1
- Recall@5
- Recall@10
- Recall@20
- MRR@10
- nDCG@10
- p50 query latency
- p95 query latency
- index size
- average sparse terms per document
- query encoding latency
- total query latency including model encoding
Important: validate that the OpenSearch/Elasticsearch query DSL is doing the same sparse dot-product style scoring expected by the in-memory proxy.
Phase 3: dataset v2
Create a new dataset revision or new dataset repo. Do not overwrite the v1 training data silently.
Candidate data improvements:
Metadata prefixes
Add available structured context to documents:
Company: Apple Inc. Ticker: AAPL Form: 10-K Filing year: 2024 Section: Risk Factors [chunk text]Expected benefit: better company/form/year/section matching without changing tokenizer vocabulary.
Hard-negative mining
For each query, mine candidate negatives from:
- BM25
- base sparse model
- v1 fine-tuned sparse model
- future hybrid retriever
Useful negative types:
- same company, wrong section
- same company, wrong filing year
- same form type, different company
- same financial concept, wrong context
- lexical match but semantically wrong
- semantic match but unsupported by the chunk
False-negative filtering
Avoid training against chunks that are actually relevant.
Flag suspicious negatives using:
- near-duplicate detection against the positive
- high lexical overlap with the positive
- high sparse score against the query
- same company and same section
- optional reranker/LLM adjudication
Suspicious negatives should be dropped or downgraded, not blindly used as hard negatives.
Multiple negatives per query
Preserve multiple hard negatives per query instead of only using the first non-empty negative.
Target row shape:
{ "query": "...", "positive": "Company: ...\n\n...", "negatives": [ "Company: ... wrong but plausible ...", "Company: ... lexical hard negative ...", "Company: ... semantic hard negative ..." ], "metadata": { "company": "...", "doc_type": "...", "query_type": "..." } }Slice balancing
Inspect and optionally rebalance by:
- query type
- company
- document type
- filing year
- section
- query length
- document length
Goal: prevent the model from becoming strong only on the most frequent companies/forms/query patterns.
Frozen eval split
Before training v2, freeze a dev/test set that is not massaged by the training-data pipeline.
Preferred split strategies:
- hold out companies entirely
- hold out later filing dates
- hold out specific query types
- maintain a fixed human-reviewed eval set
Phase 4: stronger sparse training
Run controlled v2 sweeps after dataset improvements.
Keep the current v1 model as the baseline.
Candidate sweep dimensions:
| variable | candidates |
|---|---|
| training rows | v1 raw, v2 metadata, v2 mined negatives |
| steps | 1500, 3000 |
| document regularization | 5e-3, 1e-2, 2e-2 |
| query regularization | start with 1e-4; adjust only if query vectors become too dense |
| document top-k | 64, 128, 256 |
| negatives per query | 1, 2, 4+ |
Primary selection metric should not be triplet accuracy alone.
Use:
- retrieval proxy nDCG@10
- retrieval proxy MRR@10
- Recall@1
- Recall@10
- OpenSearch hybrid benchmark, once available
- document active dimensions / index footprint
Phase 5: reranker and distillation
A CrossEncoder reranker can improve both training data and production retrieval.
Use cases:
- Rerank top candidates after BM25 + sparse retrieval.
- Judge whether mined negatives are truly negative.
- Create better hard negatives for sparse training.
- Act as a teacher for distillation back into the sparse encoder.
Candidate workflow:
- Retrieve candidate pools using BM25, base sparse, and v1 sparse.
- Score query-document pairs with a stronger reranker or LLM-based judge.
- Keep high-confidence positives and high-confidence hard negatives.
- Retrain the sparse model on cleaner supervision.
- Compare v2 sparse against v1 sparse, BM25, and hybrid retrieval.
Relevant Sentence Transformers / HF skill levers
The current v1 work used the SparseEncoder training path. The following skill capabilities are relevant for the next iterations:
| capability | use here |
|---|---|
| SparseEncoder training | Continue v2 learned-sparse experiments. |
| Hard-negative mining | Build stronger training negatives from BM25/base/v1 retrieval. |
| Evaluator selection | Add more official retrieval evaluators where useful. |
| CrossEncoder training | Train or use a reranker for second-stage ranking and data cleaning. |
| Distillation | Distill a reranker/teacher into the sparse encoder. |
| HF Datasets tooling | Inspect slices, parquet metadata, and dataset revisions. |
| HF Jobs | Run larger sweeps off local hardware if needed. |
| Trackio | Track metrics across sweeps instead of relying on ad hoc logs. |
Less relevant right now:
- Matryoshka: mostly useful for dense embeddings, not the main SPLADE-style sparse path.
- Multilingual training: not needed unless the filing/query workload becomes multilingual.
- Vocabulary changes: not recommended as a first lever; metadata prefixes and data improvements are cheaper and safer.
Recommended immediate next command sequence
- Run full-test retrieval proxy for v1.
- Add slice analysis to the retrieval proxy script.
- Run BM25 and base sparse on the same full-test proxy.
- Build a small OpenSearch benchmark with top-128 vectors.
- Only then start v2 data massaging and retraining.
The main principle: prove where v1 fails before changing both the data and the model at the same time.