Spaces:
Running
Running
| # Retrieval Contract -- Stage 2 (Retrieval Grounding / Candidate Generation) | |
| Stage 2 performs **retrieval grounding** over a **closed vocabulary** of canonical e621-style tags. | |
| It does not "tag images", and it does not do free-form generation. Its job is to produce a high-recall, | |
| inspectable candidate pool for downstream **closed-set selection**. | |
| --- | |
| ## Inputs | |
| - `rewrite_phrases: list[str]` | |
| - Output of Stage 1 query rewriting (comma-separated "tag-shaped" phrases). | |
| - Not canonical tags. Not underscored. High recall is preferred. | |
| - `allow_nsfw_tags: bool` | |
| - If false, filter out tags in the project's `nsfw_tags` set. | |
| - `verbose: bool` | |
| - If true, return per-phrase debug reports. | |
| --- | |
| ## Normalization and phrase expansion | |
| 1) Normalize rewrite phrases for internal processing: | |
| - lowercase | |
| - strip leading/trailing whitespace | |
| - collapse internal whitespace to a single space | |
| 2) Treat the phrase list as a **set** (dedupe after normalization). | |
| 3) **Head-noun expansion**: | |
| - For each multi-token phrase, add its head noun (last token) as an additional phrase. | |
| - Apply the same set semantics so duplicates are processed once. | |
| Example: | |
| - Input phrases: `["big shirt", "grey shirt"]` | |
| - Final phrase set: `{"big shirt", "grey shirt", "shirt"}` | |
| --- | |
| ## Candidate generation per phrase (FastText neighbors + canonicalization) | |
| For each phrase `p` in the final phrase set: | |
| 1) Convert to lookup form: | |
| - `lookup = p.replace(" ", "_")` | |
| 2) Retrieve neighbors using FastText: | |
| - `neighbors = fasttext.most_similar(lookup, topn=per_phrase_k)` | |
| - Note: FastText neighbors may include alias tokens and other non-canonical strings. | |
| 3) **Project neighbor tokens to canonical tags** (alias -> canonical): | |
| - If a neighbor token is already a canonical tag (token is in `tag_counts` OR token has a TF-IDF row in `tag_to_row_index`), it maps to itself. | |
| - Else if it is an alias, map it via `alias2tags[token]` (may map to multiple canonical tags). | |
| - Else, drop it (not in closed vocabulary). | |
| 4) **Deduplicate by canonical tag** within this phrase: | |
| - Keep the canonical tag with the highest FastText similarity among all tokens that mapped to it. | |
| - Record the token that achieved that max similarity as `alias_token` for verbose reporting ("best token wins"). | |
| 5) **Exact-match injection**: | |
| - Project the phrase's own `lookup` through the same projection logic. | |
| - For each canonical tag produced by that projection, inject it into the candidate set with: | |
| - `score_fasttext = 1.0` | |
| - `alias_token = lookup` | |
| - This ensures the phrase canonical appears even though `most_similar()` often does not return the query token itself. | |
| 6) Apply NSFW filtering (if `allow_nsfw_tags=False`): | |
| - Drop candidate canonical tags that are present in `nsfw_tags`. | |
| Result: for each phrase, we have a set of canonical candidate tags with: | |
| - `score_fasttext` | |
| - `alias_token` (token that produced the best FastText score for that canonical tag) | |
| --- | |
| ## Context similarity (TF-IDF -> SVD cosine) | |
| Stage 2 computes one **query context vector** for the entire request: | |
| 1) Build a pseudo TF-IDF vector from the **final phrase set** (deduped + head nouns): | |
| - Convert each phrase to underscore form (same `lookup` rule). | |
| - Terms that exist in the TF-IDF vocabulary (underscore lookups) contribute `(term_count * idf(term))`. | |
| - OOV terms contribute nothing (but may be reported in verbose mode). | |
| 2) Project to SVD space and L2-normalize: | |
| - `query_vec = normalize(svd.transform(tfidf_vec))` | |
| If the query vector has zero norm (no recognized TF-IDF terms), then `query_has_context = False` and: | |
| - `score_context = None` for all candidates | |
| - `score_combined = score_fasttext` (FastText-only) | |
| If `query_has_context = True`, compute per-candidate cosine similarity when possible: | |
| - For tags that have a TF-IDF/SVD row: `score_context_by_tag[tag] = dot(query_vec, reduced_matrix_norm[row])` | |
| - For tags that lack a TF-IDF/SVD row: initial `score_context = None` (may be imputed per-phrase) | |
| ### Missing context policy (per-phrase, q=0.10) | |
| If `query_has_context = True` and a candidate tag has `score_context = None`: | |
| - For that phrase, compute `default_context_for_phrase` as the 10th percentile (q=0.10) of the available (non-None) context scores among that phrase's candidates. | |
| - If there are no available context scores for that phrase, `default_context_for_phrase = 0.0`. | |
| - Impute missing context scores using `default_context_for_phrase` and mark: | |
| - `context_imputed = True` | |
| Otherwise: | |
| - `context_imputed = False` | |
| --- | |
| ## Score fusion (FastText + Context) | |
| Compute a fused score per phrase candidate: | |
| - If `query_has_context = False`: | |
| - `score_combined = score_fasttext` | |
| - Else: | |
| - `score_combined = (1 - context_weight) * score_fasttext + context_weight * score_context` | |
| - (`score_context` may be imputed as described above) | |
| --- | |
| ## Per-phrase truncation and must-include rule | |
| After scoring candidates for a phrase: | |
| - Sort by `score_combined` descending. | |
| - Keep top `per_phrase_final_k` (typically 10). | |
| **Must-include rule (pinned exact phrase tags)**: | |
| - Let `required_tags` be the canonical tag(s) produced by projecting the phrase's own `lookup` (`projected_lookup`). | |
| - Each required tag must appear in that phrase's final top `per_phrase_final_k` list, even if its fused score would otherwise place it below the cutoff. | |
| - If the list is full, evict the lowest-ranked tag that is *not* required. | |
| - Note: `required_tags` may contain multiple canonicals if `alias2tags` maps a token to multiple tags. | |
| This rule applies **only to the phrase's own required tags**. It does not inject tags into other phrases' lists. | |
| --- | |
| ## Merge across phrases (global candidate pool) | |
| A canonical tag may appear in multiple per-phrase top-K lists. Stage 2 deduplicates tags into a single global record. | |
| - `sources` is the union of phrases whose per-phrase lists contained the tag. | |
| - `score_fasttext` is the maximum FastText score observed for the tag across those phrases. | |
| - `score_context` is the maximum context cosine observed for the tag across those phrases (with `None` treated as missing). | |
| - `score_combined` is the maximum fused score observed for the tag across those phrases. | |
| Note: | |
| - These maxima may come from different phrases; the global candidate row does not necessarily correspond to any single phrase's row. | |
| - For tags with a TF-IDF row, `score_context` is phrase-invariant. Differences across phrases only arise for tags whose context score was imputed. | |
| Finally: | |
| - Sort global candidates by `score_combined` descending. | |
| - Return top `global_k` candidates (and optionally all candidates if the app needs them). | |
| --- | |
| ## Output schema | |
| ### Stage 2 return (non-verbose) | |
| - `candidates: list[Candidate]` (ordered) | |
| - `tag: str` (canonical) | |
| - `score_combined: float` | |
| - `score_fasttext: float | None` | |
| - `score_context: float | None` (None only when `query_has_context=False` or when missing) | |
| - `count: int | None` | |
| - `sources: list[str]` | |
| ### Optional per-phrase debug report (verbose) | |
| For each phrase: | |
| - `phrase: str` | |
| - `normalized: str` | |
| - `lookup: str` | |
| - `tfidf_vocab: bool` (lookup is in TF-IDF vocabulary) | |
| - `oov_terms: list[str]` | |
| - `candidates: list[CandidateRow]` (top per-phrase list) | |
| - `tag: str` | |
| - `alias_token: str` | |
| - `score_fasttext: float` | |
| - `score_context: float | None` | |
| - `score_combined: float` | |
| - `context_imputed: bool` | |
| - `count: int | None` | |
| --- | |
| ## Determinism and performance constraints | |
| - Artifact loading is **lazy** (load-on-first-use, cached thereafter). | |
| - No feature flags for old/new behavior: delete old code paths. | |
| - Logging must be read-only and must not affect results. | |
| --- | |
| ## NSFW tag source | |
| - `nsfw_tags` is sourced from `word_rating_probabilities.csv` with `NSFW_THRESHOLD=0.95` as implemented in `psq_rag.retrieval.state`. | |