Spaces:

FoodDesert
/

Prompt_Squirrel_RAG

Running

App Files Files Community

Prompt_Squirrel_RAG / docs /retrieval_contract.md

Food Desert

Add alias-based character tag filtering for Stage 3

c6be992 4 months ago

preview code

Raw

History Blame

7.78 kB

	# Retrieval Contract -- Stage 2 (Retrieval Grounding / Candidate Generation)

	Stage 2 performs retrieval grounding over a closed vocabulary of canonical e621-style tags.
	It does not "tag images", and it does not do free-form generation. Its job is to produce a high-recall,
	inspectable candidate pool for downstream closed-set selection.

	---

	## Inputs

	- `rewrite_phrases: list[str]`
	- Output of Stage 1 query rewriting (comma-separated "tag-shaped" phrases).
	- Not canonical tags. Not underscored. High recall is preferred.

	- `allow_nsfw_tags: bool`
	- If false, filter out tags in the project's `nsfw_tags` set.

	- `verbose: bool`
	- If true, return per-phrase debug reports.

	---

	## Normalization and phrase expansion

	1) Normalize rewrite phrases for internal processing:
	- lowercase
	- strip leading/trailing whitespace
	- collapse internal whitespace to a single space

	2) Treat the phrase list as a set (dedupe after normalization).

	3) Head-noun expansion:
	- For each multi-token phrase, add its head noun (last token) as an additional phrase.
	- Apply the same set semantics so duplicates are processed once.

	Example:
	- Input phrases: `["big shirt", "grey shirt"]`
	- Final phrase set: `{"big shirt", "grey shirt", "shirt"}`

	---

	## Candidate generation per phrase (FastText neighbors + canonicalization)

	For each phrase `p` in the final phrase set:

	1) Convert to lookup form:
	- `lookup = p.replace(" ", "_")`

	2) Retrieve neighbors using FastText:
	- `neighbors = fasttext.most_similar(lookup, topn=per_phrase_k)`
	- Note: FastText neighbors may include alias tokens and other non-canonical strings.

	3) Project neighbor tokens to canonical tags (alias -> canonical):
	- If a neighbor token is already a canonical tag (token is in `tag_counts` OR token has a TF-IDF row in `tag_to_row_index`), it maps to itself.
	- Else if it is an alias, map it via `alias2tags[token]` (may map to multiple canonical tags).
	- Else, drop it (not in closed vocabulary).

	4) Deduplicate by canonical tag within this phrase:
	- Keep the canonical tag with the highest FastText similarity among all tokens that mapped to it.
	- Record the token that achieved that max similarity as `alias_token` for verbose reporting ("best token wins").

	5) Exact-match injection:
	- Project the phrase's own `lookup` through the same projection logic.
	- For each canonical tag produced by that projection, inject it into the candidate set with:
	- `score_fasttext = 1.0`
	- `alias_token = lookup`
	- This ensures the phrase canonical appears even though `most_similar()` often does not return the query token itself.

	6) Apply NSFW filtering (if `allow_nsfw_tags=False`):
	- Drop candidate canonical tags that are present in `nsfw_tags`.

	Result: for each phrase, we have a set of canonical candidate tags with:
	- `score_fasttext`
	- `alias_token` (token that produced the best FastText score for that canonical tag)

	---

	## Context similarity (TF-IDF -> SVD cosine)

	Stage 2 computes one query context vector for the entire request:

	1) Build a pseudo TF-IDF vector from the final phrase set (deduped + head nouns):
	- Convert each phrase to underscore form (same `lookup` rule).
	- Terms that exist in the TF-IDF vocabulary (underscore lookups) contribute `(term_count * idf(term))`.
	- OOV terms contribute nothing (but may be reported in verbose mode).

	2) Project to SVD space and L2-normalize:
	- `query_vec = normalize(svd.transform(tfidf_vec))`

	If the query vector has zero norm (no recognized TF-IDF terms), then `query_has_context = False` and:
	- `score_context = None` for all candidates
	- `score_combined = score_fasttext` (FastText-only)

	If `query_has_context = True`, compute per-candidate cosine similarity when possible:
	- For tags that have a TF-IDF/SVD row: `score_context_by_tag[tag] = dot(query_vec, reduced_matrix_norm[row])`
	- For tags that lack a TF-IDF/SVD row: initial `score_context = None` (may be imputed per-phrase)

	### Missing context policy (per-phrase, q=0.10)

	If `query_has_context = True` and a candidate tag has `score_context = None`:
	- For that phrase, compute `default_context_for_phrase` as the 10th percentile (q=0.10) of the available (non-None) context scores among that phrase's candidates.
	- If there are no available context scores for that phrase, `default_context_for_phrase = 0.0`.
	- Impute missing context scores using `default_context_for_phrase` and mark:
	- `context_imputed = True`
	Otherwise:
	- `context_imputed = False`

	---

	## Score fusion (FastText + Context)

	Compute a fused score per phrase candidate:

	- If `query_has_context = False`:
	- `score_combined = score_fasttext`

	- Else:
	- `score_combined = (1 - context_weight) * score_fasttext + context_weight * score_context`
	- (`score_context` may be imputed as described above)

	---

	## Per-phrase truncation and must-include rule

	After scoring candidates for a phrase:
	- Sort by `score_combined` descending.
	- Keep top `per_phrase_final_k` (typically 10).

	Must-include rule (pinned exact phrase tags):
	- Let `required_tags` be the canonical tag(s) produced by projecting the phrase's own `lookup` (`projected_lookup`).
	- Each required tag must appear in that phrase's final top `per_phrase_final_k` list, even if its fused score would otherwise place it below the cutoff.
	- If the list is full, evict the lowest-ranked tag that is not required.
	- Note: `required_tags` may contain multiple canonicals if `alias2tags` maps a token to multiple tags.

	This rule applies only to the phrase's own required tags. It does not inject tags into other phrases' lists.

	---

	## Merge across phrases (global candidate pool)

	A canonical tag may appear in multiple per-phrase top-K lists. Stage 2 deduplicates tags into a single global record.

	- `sources` is the union of phrases whose per-phrase lists contained the tag.
	- `score_fasttext` is the maximum FastText score observed for the tag across those phrases.
	- `score_context` is the maximum context cosine observed for the tag across those phrases (with `None` treated as missing).
	- `score_combined` is the maximum fused score observed for the tag across those phrases.

	Note:
	- These maxima may come from different phrases; the global candidate row does not necessarily correspond to any single phrase's row.
	- For tags with a TF-IDF row, `score_context` is phrase-invariant. Differences across phrases only arise for tags whose context score was imputed.

	Finally:
	- Sort global candidates by `score_combined` descending.
	- Return top `global_k` candidates (and optionally all candidates if the app needs them).

	---

	## Output schema

	### Stage 2 return (non-verbose)
	- `candidates: list[Candidate]` (ordered)
	- `tag: str` (canonical)
	- `score_combined: float`
	- `score_fasttext: float \| None`
	- `score_context: float \| None` (None only when `query_has_context=False` or when missing)
	- `count: int \| None`
	- `sources: list[str]`

	### Optional per-phrase debug report (verbose)
	For each phrase:
	- `phrase: str`
	- `normalized: str`
	- `lookup: str`
	- `tfidf_vocab: bool` (lookup is in TF-IDF vocabulary)
	- `oov_terms: list[str]`
	- `candidates: list[CandidateRow]` (top per-phrase list)
	- `tag: str`
	- `alias_token: str`
	- `score_fasttext: float`
	- `score_context: float \| None`
	- `score_combined: float`
	- `context_imputed: bool`
	- `count: int \| None`

	---

	## Determinism and performance constraints

	- Artifact loading is lazy (load-on-first-use, cached thereafter).
	- No feature flags for old/new behavior: delete old code paths.
	- Logging must be read-only and must not affect results.

	---

	## NSFW tag source

	- `nsfw_tags` is sourced from `word_rating_probabilities.csv` with `NSFW_THRESHOLD=0.95` as implemented in `psq_rag.retrieval.state`.