Next Steps: Production-Readiness Roadmap

This file captures follow-up experiments for oneryalcin/financial-filings-sparse-encoder-v1 so the work can resume from the Hub repo without relying on chat history.

Current status

The v1 model is a useful open experiment and candidate learned-sparse baseline for financial filing retrieval.

Current recommendation:

Use encode_query(...) for online queries.
Use encode_document(...) offline for document/chunk indexing.
Prune document vectors to top-128 terms before indexing.

Current best proxy result:

model	pruning	Recall@1	Recall@10	MRR@10	nDCG@10
fine-tuned sparse	top-128	38.6%	67.2%	0.473	0.521
base sparse	top-128	31.4%	62.7%	0.422	0.472
BM25	lexical	24.0%	64.1%	0.397	0.457

Caveat: this is a 1,000-query in-memory proxy over 1,912 held-out candidate chunks, not a production OpenSearch/Elasticsearch benchmark.

Phase 1: evaluate v1 harder, no retraining

Run these before changing the training data or model recipe.

Full held-out retrieval proxy

Run the in-memory retrieval proxy on the full test split, not only the first 1,000 usable rows.

Report:
- Recall@1
- Recall@5
- Recall@10
- Recall@20
- MRR@10
- nDCG@10
- median rank
- mean rank
Slice analysis

Break metrics down by available metadata:
- query_type
- company
- doc_type
- filing year, if available
- section, if available
- query length bucket
- document length bucket
Goal: identify whether the model is genuinely broad or only strong on dominant slices.
Error analysis

Sample failures from:
- top-1 wrong results
- positive rank greater than 10
- cases where BM25 wins and sparse loses
- cases where sparse wins and BM25 loses
For each failure, classify likely cause:
- false negative or incomplete label
- query too broad
- chunk lacks enough context
- same-company wrong section
- same-topic wrong company
- lexical miss
- semantic overmatch
Encoding-speed benchmark

Run a warm-cache benchmark for:
- query encoding latency
- document encoding throughput
- top-128 pruning overhead
- serialization overhead
Do not compare cold Hub load times. Download/cache models first, run warmup batches, then measure.
Index-size estimate

Estimate sparse index footprint for:
- unpruned
- top-256
- top-128
- top-64
Report average active terms and approximate postings count.

Phase 2: OpenSearch/Elasticsearch benchmark

Build a real sparse retrieval benchmark using OpenSearch or Elasticsearch.

Variants to compare:

variant	purpose
BM25 only	lexical baseline
learned sparse top-128 only	current recommended model path
learned sparse top-64 only	compact model path
BM25 + learned sparse RRF	likely robust hybrid baseline
BM25 + learned sparse weighted score	tuneable hybrid candidate

Report:

Recall@1
Recall@5
Recall@10
Recall@20
MRR@10
nDCG@10
p50 query latency
p95 query latency
index size
average sparse terms per document
query encoding latency
total query latency including model encoding

Important: validate that the OpenSearch/Elasticsearch query DSL is doing the same sparse dot-product style scoring expected by the in-memory proxy.

Phase 3: dataset v2

Create a new dataset revision or new dataset repo. Do not overwrite the v1 training data silently.

Candidate data improvements:

Metadata prefixes

Add available structured context to documents:
```
Company: Apple Inc.
Ticker: AAPL
Form: 10-K
Filing year: 2024
Section: Risk Factors

[chunk text]
```
Expected benefit: better company/form/year/section matching without changing tokenizer vocabulary.
Hard-negative mining

For each query, mine candidate negatives from:
- BM25
- base sparse model
- v1 fine-tuned sparse model
- future hybrid retriever
Useful negative types:
- same company, wrong section
- same company, wrong filing year
- same form type, different company
- same financial concept, wrong context
- lexical match but semantically wrong
- semantic match but unsupported by the chunk
False-negative filtering

Avoid training against chunks that are actually relevant.

Flag suspicious negatives using:
- near-duplicate detection against the positive
- high lexical overlap with the positive
- high sparse score against the query
- same company and same section
- optional reranker/LLM adjudication
Suspicious negatives should be dropped or downgraded, not blindly used as hard negatives.

Multiple negatives per query

Preserve multiple hard negatives per query instead of only using the first non-empty negative.

Target row shape:

{
  "query": "...",
  "positive": "Company: ...\n\n...",
  "negatives": [
    "Company: ... wrong but plausible ...",
    "Company: ... lexical hard negative ...",
    "Company: ... semantic hard negative ..."
  ],
  "metadata": {
    "company": "...",
    "doc_type": "...",
    "query_type": "..."
  }
}

Slice balancing

Inspect and optionally rebalance by:
- query type
- company
- document type
- filing year
- section
- query length
- document length
Goal: prevent the model from becoming strong only on the most frequent companies/forms/query patterns.
Frozen eval split

Before training v2, freeze a dev/test set that is not massaged by the training-data pipeline.

Preferred split strategies:
- hold out companies entirely
- hold out later filing dates
- hold out specific query types
- maintain a fixed human-reviewed eval set

Phase 4: stronger sparse training

Run controlled v2 sweeps after dataset improvements.

Keep the current v1 model as the baseline.

Candidate sweep dimensions:

variable	candidates
training rows	v1 raw, v2 metadata, v2 mined negatives
steps	1500, 3000
document regularization	`5e-3`, `1e-2`, `2e-2`
query regularization	start with `1e-4`; adjust only if query vectors become too dense
document top-k	64, 128, 256
negatives per query	1, 2, 4+

Primary selection metric should not be triplet accuracy alone.

Use:

retrieval proxy nDCG@10
retrieval proxy MRR@10
Recall@1
Recall@10
OpenSearch hybrid benchmark, once available
document active dimensions / index footprint

Phase 5: reranker and distillation

A CrossEncoder reranker can improve both training data and production retrieval.

Use cases:

Rerank top candidates after BM25 + sparse retrieval.
Judge whether mined negatives are truly negative.
Create better hard negatives for sparse training.
Act as a teacher for distillation back into the sparse encoder.

Candidate workflow:

Retrieve candidate pools using BM25, base sparse, and v1 sparse.
Score query-document pairs with a stronger reranker or LLM-based judge.
Keep high-confidence positives and high-confidence hard negatives.
Retrain the sparse model on cleaner supervision.
Compare v2 sparse against v1 sparse, BM25, and hybrid retrieval.

Relevant Sentence Transformers / HF skill levers

The current v1 work used the SparseEncoder training path. The following skill capabilities are relevant for the next iterations:

capability	use here
SparseEncoder training	Continue v2 learned-sparse experiments.
Hard-negative mining	Build stronger training negatives from BM25/base/v1 retrieval.
Evaluator selection	Add more official retrieval evaluators where useful.
CrossEncoder training	Train or use a reranker for second-stage ranking and data cleaning.
Distillation	Distill a reranker/teacher into the sparse encoder.
HF Datasets tooling	Inspect slices, parquet metadata, and dataset revisions.
HF Jobs	Run larger sweeps off local hardware if needed.
Trackio	Track metrics across sweeps instead of relying on ad hoc logs.

Less relevant right now:

Matryoshka: mostly useful for dense embeddings, not the main SPLADE-style sparse path.
Multilingual training: not needed unless the filing/query workload becomes multilingual.
Vocabulary changes: not recommended as a first lever; metadata prefixes and data improvements are cheaper and safer.

Recommended immediate next command sequence

Run full-test retrieval proxy for v1.
Add slice analysis to the retrieval proxy script.
Run BM25 and base sparse on the same full-test proxy.
Build a small OpenSearch benchmark with top-128 vectors.
Only then start v2 data massaging and retraining.

The main principle: prove where v1 fails before changing both the data and the model at the same time.