Prompt_Squirrel_RAG / docs /retrieval_contract.md
Food Desert
Add alias-based character tag filtering for Stage 3
c6be992
|
Raw
History Blame
7.78 kB

Retrieval Contract -- Stage 2 (Retrieval Grounding / Candidate Generation)

Stage 2 performs retrieval grounding over a closed vocabulary of canonical e621-style tags. It does not "tag images", and it does not do free-form generation. Its job is to produce a high-recall, inspectable candidate pool for downstream closed-set selection.


Inputs

  • rewrite_phrases: list[str]

    • Output of Stage 1 query rewriting (comma-separated "tag-shaped" phrases).
    • Not canonical tags. Not underscored. High recall is preferred.
  • allow_nsfw_tags: bool

    • If false, filter out tags in the project's nsfw_tags set.
  • verbose: bool

    • If true, return per-phrase debug reports.

Normalization and phrase expansion

  1. Normalize rewrite phrases for internal processing:
  • lowercase
  • strip leading/trailing whitespace
  • collapse internal whitespace to a single space
  1. Treat the phrase list as a set (dedupe after normalization).

  2. Head-noun expansion:

  • For each multi-token phrase, add its head noun (last token) as an additional phrase.
  • Apply the same set semantics so duplicates are processed once.

Example:

  • Input phrases: ["big shirt", "grey shirt"]
  • Final phrase set: {"big shirt", "grey shirt", "shirt"}

Candidate generation per phrase (FastText neighbors + canonicalization)

For each phrase p in the final phrase set:

  1. Convert to lookup form:
  • lookup = p.replace(" ", "_")
  1. Retrieve neighbors using FastText:
  • neighbors = fasttext.most_similar(lookup, topn=per_phrase_k)
  • Note: FastText neighbors may include alias tokens and other non-canonical strings.
  1. Project neighbor tokens to canonical tags (alias -> canonical):
  • If a neighbor token is already a canonical tag (token is in tag_counts OR token has a TF-IDF row in tag_to_row_index), it maps to itself.
  • Else if it is an alias, map it via alias2tags[token] (may map to multiple canonical tags).
  • Else, drop it (not in closed vocabulary).
  1. Deduplicate by canonical tag within this phrase:
  • Keep the canonical tag with the highest FastText similarity among all tokens that mapped to it.
  • Record the token that achieved that max similarity as alias_token for verbose reporting ("best token wins").
  1. Exact-match injection:
  • Project the phrase's own lookup through the same projection logic.
  • For each canonical tag produced by that projection, inject it into the candidate set with:
    • score_fasttext = 1.0
    • alias_token = lookup
  • This ensures the phrase canonical appears even though most_similar() often does not return the query token itself.
  1. Apply NSFW filtering (if allow_nsfw_tags=False):
  • Drop candidate canonical tags that are present in nsfw_tags.

Result: for each phrase, we have a set of canonical candidate tags with:

  • score_fasttext
  • alias_token (token that produced the best FastText score for that canonical tag)

Context similarity (TF-IDF -> SVD cosine)

Stage 2 computes one query context vector for the entire request:

  1. Build a pseudo TF-IDF vector from the final phrase set (deduped + head nouns):
  • Convert each phrase to underscore form (same lookup rule).
  • Terms that exist in the TF-IDF vocabulary (underscore lookups) contribute (term_count * idf(term)).
  • OOV terms contribute nothing (but may be reported in verbose mode).
  1. Project to SVD space and L2-normalize:
  • query_vec = normalize(svd.transform(tfidf_vec))

If the query vector has zero norm (no recognized TF-IDF terms), then query_has_context = False and:

  • score_context = None for all candidates
  • score_combined = score_fasttext (FastText-only)

If query_has_context = True, compute per-candidate cosine similarity when possible:

  • For tags that have a TF-IDF/SVD row: score_context_by_tag[tag] = dot(query_vec, reduced_matrix_norm[row])
  • For tags that lack a TF-IDF/SVD row: initial score_context = None (may be imputed per-phrase)

Missing context policy (per-phrase, q=0.10)

If query_has_context = True and a candidate tag has score_context = None:

  • For that phrase, compute default_context_for_phrase as the 10th percentile (q=0.10) of the available (non-None) context scores among that phrase's candidates.
  • If there are no available context scores for that phrase, default_context_for_phrase = 0.0.
  • Impute missing context scores using default_context_for_phrase and mark:
    • context_imputed = True Otherwise:
  • context_imputed = False

Score fusion (FastText + Context)

Compute a fused score per phrase candidate:

  • If query_has_context = False:

    • score_combined = score_fasttext
  • Else:

    • score_combined = (1 - context_weight) * score_fasttext + context_weight * score_context
    • (score_context may be imputed as described above)

Per-phrase truncation and must-include rule

After scoring candidates for a phrase:

  • Sort by score_combined descending.
  • Keep top per_phrase_final_k (typically 10).

Must-include rule (pinned exact phrase tags):

  • Let required_tags be the canonical tag(s) produced by projecting the phrase's own lookup (projected_lookup).
  • Each required tag must appear in that phrase's final top per_phrase_final_k list, even if its fused score would otherwise place it below the cutoff.
  • If the list is full, evict the lowest-ranked tag that is not required.
  • Note: required_tags may contain multiple canonicals if alias2tags maps a token to multiple tags.

This rule applies only to the phrase's own required tags. It does not inject tags into other phrases' lists.


Merge across phrases (global candidate pool)

A canonical tag may appear in multiple per-phrase top-K lists. Stage 2 deduplicates tags into a single global record.

  • sources is the union of phrases whose per-phrase lists contained the tag.
  • score_fasttext is the maximum FastText score observed for the tag across those phrases.
  • score_context is the maximum context cosine observed for the tag across those phrases (with None treated as missing).
  • score_combined is the maximum fused score observed for the tag across those phrases.

Note:

  • These maxima may come from different phrases; the global candidate row does not necessarily correspond to any single phrase's row.
  • For tags with a TF-IDF row, score_context is phrase-invariant. Differences across phrases only arise for tags whose context score was imputed.

Finally:

  • Sort global candidates by score_combined descending.
  • Return top global_k candidates (and optionally all candidates if the app needs them).

Output schema

Stage 2 return (non-verbose)

  • candidates: list[Candidate] (ordered)
    • tag: str (canonical)
    • score_combined: float
    • score_fasttext: float | None
    • score_context: float | None (None only when query_has_context=False or when missing)
    • count: int | None
    • sources: list[str]

Optional per-phrase debug report (verbose)

For each phrase:

  • phrase: str
  • normalized: str
  • lookup: str
  • tfidf_vocab: bool (lookup is in TF-IDF vocabulary)
  • oov_terms: list[str]
  • candidates: list[CandidateRow] (top per-phrase list)
    • tag: str
    • alias_token: str
    • score_fasttext: float
    • score_context: float | None
    • score_combined: float
    • context_imputed: bool
    • count: int | None

Determinism and performance constraints

  • Artifact loading is lazy (load-on-first-use, cached thereafter).
  • No feature flags for old/new behavior: delete old code paths.
  • Logging must be read-only and must not affect results.

NSFW tag source

  • nsfw_tags is sourced from word_rating_probabilities.csv with NSFW_THRESHOLD=0.95 as implemented in psq_rag.retrieval.state.