oneryalcin's picture
Add production-readiness next steps
c654aec verified

Next Steps: Production-Readiness Roadmap

This file captures follow-up experiments for oneryalcin/financial-filings-sparse-encoder-v1 so the work can resume from the Hub repo without relying on chat history.

Current status

The v1 model is a useful open experiment and candidate learned-sparse baseline for financial filing retrieval.

Current recommendation:

Use encode_query(...) for online queries.
Use encode_document(...) offline for document/chunk indexing.
Prune document vectors to top-128 terms before indexing.

Current best proxy result:

model pruning Recall@1 Recall@10 MRR@10 nDCG@10
fine-tuned sparse top-128 38.6% 67.2% 0.473 0.521
base sparse top-128 31.4% 62.7% 0.422 0.472
BM25 lexical 24.0% 64.1% 0.397 0.457

Caveat: this is a 1,000-query in-memory proxy over 1,912 held-out candidate chunks, not a production OpenSearch/Elasticsearch benchmark.

Phase 1: evaluate v1 harder, no retraining

Run these before changing the training data or model recipe.

  1. Full held-out retrieval proxy

    Run the in-memory retrieval proxy on the full test split, not only the first 1,000 usable rows.

    Report:

    • Recall@1
    • Recall@5
    • Recall@10
    • Recall@20
    • MRR@10
    • nDCG@10
    • median rank
    • mean rank
  2. Slice analysis

    Break metrics down by available metadata:

    • query_type
    • company
    • doc_type
    • filing year, if available
    • section, if available
    • query length bucket
    • document length bucket

    Goal: identify whether the model is genuinely broad or only strong on dominant slices.

  3. Error analysis

    Sample failures from:

    • top-1 wrong results
    • positive rank greater than 10
    • cases where BM25 wins and sparse loses
    • cases where sparse wins and BM25 loses

    For each failure, classify likely cause:

    • false negative or incomplete label
    • query too broad
    • chunk lacks enough context
    • same-company wrong section
    • same-topic wrong company
    • lexical miss
    • semantic overmatch
  4. Encoding-speed benchmark

    Run a warm-cache benchmark for:

    • query encoding latency
    • document encoding throughput
    • top-128 pruning overhead
    • serialization overhead

    Do not compare cold Hub load times. Download/cache models first, run warmup batches, then measure.

  5. Index-size estimate

    Estimate sparse index footprint for:

    • unpruned
    • top-256
    • top-128
    • top-64

    Report average active terms and approximate postings count.

Phase 2: OpenSearch/Elasticsearch benchmark

Build a real sparse retrieval benchmark using OpenSearch or Elasticsearch.

Variants to compare:

variant purpose
BM25 only lexical baseline
learned sparse top-128 only current recommended model path
learned sparse top-64 only compact model path
BM25 + learned sparse RRF likely robust hybrid baseline
BM25 + learned sparse weighted score tuneable hybrid candidate

Report:

  • Recall@1
  • Recall@5
  • Recall@10
  • Recall@20
  • MRR@10
  • nDCG@10
  • p50 query latency
  • p95 query latency
  • index size
  • average sparse terms per document
  • query encoding latency
  • total query latency including model encoding

Important: validate that the OpenSearch/Elasticsearch query DSL is doing the same sparse dot-product style scoring expected by the in-memory proxy.

Phase 3: dataset v2

Create a new dataset revision or new dataset repo. Do not overwrite the v1 training data silently.

Candidate data improvements:

  1. Metadata prefixes

    Add available structured context to documents:

    Company: Apple Inc.
    Ticker: AAPL
    Form: 10-K
    Filing year: 2024
    Section: Risk Factors
    
    [chunk text]
    

    Expected benefit: better company/form/year/section matching without changing tokenizer vocabulary.

  2. Hard-negative mining

    For each query, mine candidate negatives from:

    • BM25
    • base sparse model
    • v1 fine-tuned sparse model
    • future hybrid retriever

    Useful negative types:

    • same company, wrong section
    • same company, wrong filing year
    • same form type, different company
    • same financial concept, wrong context
    • lexical match but semantically wrong
    • semantic match but unsupported by the chunk
  3. False-negative filtering

    Avoid training against chunks that are actually relevant.

    Flag suspicious negatives using:

    • near-duplicate detection against the positive
    • high lexical overlap with the positive
    • high sparse score against the query
    • same company and same section
    • optional reranker/LLM adjudication

    Suspicious negatives should be dropped or downgraded, not blindly used as hard negatives.

  4. Multiple negatives per query

    Preserve multiple hard negatives per query instead of only using the first non-empty negative.

    Target row shape:

    {
      "query": "...",
      "positive": "Company: ...\n\n...",
      "negatives": [
        "Company: ... wrong but plausible ...",
        "Company: ... lexical hard negative ...",
        "Company: ... semantic hard negative ..."
      ],
      "metadata": {
        "company": "...",
        "doc_type": "...",
        "query_type": "..."
      }
    }
    
  5. Slice balancing

    Inspect and optionally rebalance by:

    • query type
    • company
    • document type
    • filing year
    • section
    • query length
    • document length

    Goal: prevent the model from becoming strong only on the most frequent companies/forms/query patterns.

  6. Frozen eval split

    Before training v2, freeze a dev/test set that is not massaged by the training-data pipeline.

    Preferred split strategies:

    • hold out companies entirely
    • hold out later filing dates
    • hold out specific query types
    • maintain a fixed human-reviewed eval set

Phase 4: stronger sparse training

Run controlled v2 sweeps after dataset improvements.

Keep the current v1 model as the baseline.

Candidate sweep dimensions:

variable candidates
training rows v1 raw, v2 metadata, v2 mined negatives
steps 1500, 3000
document regularization 5e-3, 1e-2, 2e-2
query regularization start with 1e-4; adjust only if query vectors become too dense
document top-k 64, 128, 256
negatives per query 1, 2, 4+

Primary selection metric should not be triplet accuracy alone.

Use:

  • retrieval proxy nDCG@10
  • retrieval proxy MRR@10
  • Recall@1
  • Recall@10
  • OpenSearch hybrid benchmark, once available
  • document active dimensions / index footprint

Phase 5: reranker and distillation

A CrossEncoder reranker can improve both training data and production retrieval.

Use cases:

  1. Rerank top candidates after BM25 + sparse retrieval.
  2. Judge whether mined negatives are truly negative.
  3. Create better hard negatives for sparse training.
  4. Act as a teacher for distillation back into the sparse encoder.

Candidate workflow:

  1. Retrieve candidate pools using BM25, base sparse, and v1 sparse.
  2. Score query-document pairs with a stronger reranker or LLM-based judge.
  3. Keep high-confidence positives and high-confidence hard negatives.
  4. Retrain the sparse model on cleaner supervision.
  5. Compare v2 sparse against v1 sparse, BM25, and hybrid retrieval.

Relevant Sentence Transformers / HF skill levers

The current v1 work used the SparseEncoder training path. The following skill capabilities are relevant for the next iterations:

capability use here
SparseEncoder training Continue v2 learned-sparse experiments.
Hard-negative mining Build stronger training negatives from BM25/base/v1 retrieval.
Evaluator selection Add more official retrieval evaluators where useful.
CrossEncoder training Train or use a reranker for second-stage ranking and data cleaning.
Distillation Distill a reranker/teacher into the sparse encoder.
HF Datasets tooling Inspect slices, parquet metadata, and dataset revisions.
HF Jobs Run larger sweeps off local hardware if needed.
Trackio Track metrics across sweeps instead of relying on ad hoc logs.

Less relevant right now:

  • Matryoshka: mostly useful for dense embeddings, not the main SPLADE-style sparse path.
  • Multilingual training: not needed unless the filing/query workload becomes multilingual.
  • Vocabulary changes: not recommended as a first lever; metadata prefixes and data improvements are cheaper and safer.

Recommended immediate next command sequence

  1. Run full-test retrieval proxy for v1.
  2. Add slice analysis to the retrieval proxy script.
  3. Run BM25 and base sparse on the same full-test proxy.
  4. Build a small OpenSearch benchmark with top-128 vectors.
  5. Only then start v2 data massaging and retraining.

The main principle: prove where v1 fails before changing both the data and the model at the same time.