# Next Steps: Production-Readiness Roadmap This file captures follow-up experiments for `oneryalcin/financial-filings-sparse-encoder-v1` so the work can resume from the Hub repo without relying on chat history. ## Current status The v1 model is a useful open experiment and candidate learned-sparse baseline for financial filing retrieval. Current recommendation: ```text Use encode_query(...) for online queries. Use encode_document(...) offline for document/chunk indexing. Prune document vectors to top-128 terms before indexing. ``` Current best proxy result: | model | pruning | Recall@1 | Recall@10 | MRR@10 | nDCG@10 | |---|---|---:|---:|---:|---:| | fine-tuned sparse | top-128 | 38.6% | 67.2% | 0.473 | 0.521 | | base sparse | top-128 | 31.4% | 62.7% | 0.422 | 0.472 | | BM25 | lexical | 24.0% | 64.1% | 0.397 | 0.457 | Caveat: this is a 1,000-query in-memory proxy over 1,912 held-out candidate chunks, not a production OpenSearch/Elasticsearch benchmark. ## Phase 1: evaluate v1 harder, no retraining Run these before changing the training data or model recipe. 1. Full held-out retrieval proxy Run the in-memory retrieval proxy on the full test split, not only the first 1,000 usable rows. Report: - Recall@1 - Recall@5 - Recall@10 - Recall@20 - MRR@10 - nDCG@10 - median rank - mean rank 2. Slice analysis Break metrics down by available metadata: - `query_type` - `company` - `doc_type` - filing year, if available - section, if available - query length bucket - document length bucket Goal: identify whether the model is genuinely broad or only strong on dominant slices. 3. Error analysis Sample failures from: - top-1 wrong results - positive rank greater than 10 - cases where BM25 wins and sparse loses - cases where sparse wins and BM25 loses For each failure, classify likely cause: - false negative or incomplete label - query too broad - chunk lacks enough context - same-company wrong section - same-topic wrong company - lexical miss - semantic overmatch 4. Encoding-speed benchmark Run a warm-cache benchmark for: - query encoding latency - document encoding throughput - top-128 pruning overhead - serialization overhead Do not compare cold Hub load times. Download/cache models first, run warmup batches, then measure. 5. Index-size estimate Estimate sparse index footprint for: - unpruned - top-256 - top-128 - top-64 Report average active terms and approximate postings count. ## Phase 2: OpenSearch/Elasticsearch benchmark Build a real sparse retrieval benchmark using OpenSearch or Elasticsearch. Variants to compare: | variant | purpose | |---|---| | BM25 only | lexical baseline | | learned sparse top-128 only | current recommended model path | | learned sparse top-64 only | compact model path | | BM25 + learned sparse RRF | likely robust hybrid baseline | | BM25 + learned sparse weighted score | tuneable hybrid candidate | Report: - Recall@1 - Recall@5 - Recall@10 - Recall@20 - MRR@10 - nDCG@10 - p50 query latency - p95 query latency - index size - average sparse terms per document - query encoding latency - total query latency including model encoding Important: validate that the OpenSearch/Elasticsearch query DSL is doing the same sparse dot-product style scoring expected by the in-memory proxy. ## Phase 3: dataset v2 Create a new dataset revision or new dataset repo. Do not overwrite the v1 training data silently. Candidate data improvements: 1. Metadata prefixes Add available structured context to documents: ```text Company: Apple Inc. Ticker: AAPL Form: 10-K Filing year: 2024 Section: Risk Factors [chunk text] ``` Expected benefit: better company/form/year/section matching without changing tokenizer vocabulary. 2. Hard-negative mining For each query, mine candidate negatives from: - BM25 - base sparse model - v1 fine-tuned sparse model - future hybrid retriever Useful negative types: - same company, wrong section - same company, wrong filing year - same form type, different company - same financial concept, wrong context - lexical match but semantically wrong - semantic match but unsupported by the chunk 3. False-negative filtering Avoid training against chunks that are actually relevant. Flag suspicious negatives using: - near-duplicate detection against the positive - high lexical overlap with the positive - high sparse score against the query - same company and same section - optional reranker/LLM adjudication Suspicious negatives should be dropped or downgraded, not blindly used as hard negatives. 4. Multiple negatives per query Preserve multiple hard negatives per query instead of only using the first non-empty negative. Target row shape: ```json { "query": "...", "positive": "Company: ...\n\n...", "negatives": [ "Company: ... wrong but plausible ...", "Company: ... lexical hard negative ...", "Company: ... semantic hard negative ..." ], "metadata": { "company": "...", "doc_type": "...", "query_type": "..." } } ``` 5. Slice balancing Inspect and optionally rebalance by: - query type - company - document type - filing year - section - query length - document length Goal: prevent the model from becoming strong only on the most frequent companies/forms/query patterns. 6. Frozen eval split Before training v2, freeze a dev/test set that is not massaged by the training-data pipeline. Preferred split strategies: - hold out companies entirely - hold out later filing dates - hold out specific query types - maintain a fixed human-reviewed eval set ## Phase 4: stronger sparse training Run controlled v2 sweeps after dataset improvements. Keep the current v1 model as the baseline. Candidate sweep dimensions: | variable | candidates | |---|---| | training rows | v1 raw, v2 metadata, v2 mined negatives | | steps | 1500, 3000 | | document regularization | `5e-3`, `1e-2`, `2e-2` | | query regularization | start with `1e-4`; adjust only if query vectors become too dense | | document top-k | 64, 128, 256 | | negatives per query | 1, 2, 4+ | Primary selection metric should not be triplet accuracy alone. Use: - retrieval proxy nDCG@10 - retrieval proxy MRR@10 - Recall@1 - Recall@10 - OpenSearch hybrid benchmark, once available - document active dimensions / index footprint ## Phase 5: reranker and distillation A CrossEncoder reranker can improve both training data and production retrieval. Use cases: 1. Rerank top candidates after BM25 + sparse retrieval. 2. Judge whether mined negatives are truly negative. 3. Create better hard negatives for sparse training. 4. Act as a teacher for distillation back into the sparse encoder. Candidate workflow: 1. Retrieve candidate pools using BM25, base sparse, and v1 sparse. 2. Score query-document pairs with a stronger reranker or LLM-based judge. 3. Keep high-confidence positives and high-confidence hard negatives. 4. Retrain the sparse model on cleaner supervision. 5. Compare v2 sparse against v1 sparse, BM25, and hybrid retrieval. ## Relevant Sentence Transformers / HF skill levers The current v1 work used the SparseEncoder training path. The following skill capabilities are relevant for the next iterations: | capability | use here | |---|---| | SparseEncoder training | Continue v2 learned-sparse experiments. | | Hard-negative mining | Build stronger training negatives from BM25/base/v1 retrieval. | | Evaluator selection | Add more official retrieval evaluators where useful. | | CrossEncoder training | Train or use a reranker for second-stage ranking and data cleaning. | | Distillation | Distill a reranker/teacher into the sparse encoder. | | HF Datasets tooling | Inspect slices, parquet metadata, and dataset revisions. | | HF Jobs | Run larger sweeps off local hardware if needed. | | Trackio | Track metrics across sweeps instead of relying on ad hoc logs. | Less relevant right now: - Matryoshka: mostly useful for dense embeddings, not the main SPLADE-style sparse path. - Multilingual training: not needed unless the filing/query workload becomes multilingual. - Vocabulary changes: not recommended as a first lever; metadata prefixes and data improvements are cheaper and safer. ## Recommended immediate next command sequence 1. Run full-test retrieval proxy for v1. 2. Add slice analysis to the retrieval proxy script. 3. Run BM25 and base sparse on the same full-test proxy. 4. Build a small OpenSearch benchmark with top-128 vectors. 5. Only then start v2 data massaging and retraining. The main principle: prove where v1 fails before changing both the data and the model at the same time.