Spaces:
Running
Retrieval Contract -- Stage 2 (Retrieval Grounding / Candidate Generation)
Stage 2 performs retrieval grounding over a closed vocabulary of canonical e621-style tags. It does not "tag images", and it does not do free-form generation. Its job is to produce a high-recall, inspectable candidate pool for downstream closed-set selection.
Inputs
rewrite_phrases: list[str]- Output of Stage 1 query rewriting (comma-separated "tag-shaped" phrases).
- Not canonical tags. Not underscored. High recall is preferred.
allow_nsfw_tags: bool- If false, filter out tags in the project's
nsfw_tagsset.
- If false, filter out tags in the project's
verbose: bool- If true, return per-phrase debug reports.
Normalization and phrase expansion
- Normalize rewrite phrases for internal processing:
- lowercase
- strip leading/trailing whitespace
- collapse internal whitespace to a single space
Treat the phrase list as a set (dedupe after normalization).
Head-noun expansion:
- For each multi-token phrase, add its head noun (last token) as an additional phrase.
- Apply the same set semantics so duplicates are processed once.
Example:
- Input phrases:
["big shirt", "grey shirt"] - Final phrase set:
{"big shirt", "grey shirt", "shirt"}
Candidate generation per phrase (FastText neighbors + canonicalization)
For each phrase p in the final phrase set:
- Convert to lookup form:
lookup = p.replace(" ", "_")
- Retrieve neighbors using FastText:
neighbors = fasttext.most_similar(lookup, topn=per_phrase_k)- Note: FastText neighbors may include alias tokens and other non-canonical strings.
- Project neighbor tokens to canonical tags (alias -> canonical):
- If a neighbor token is already a canonical tag (token is in
tag_countsOR token has a TF-IDF row intag_to_row_index), it maps to itself. - Else if it is an alias, map it via
alias2tags[token](may map to multiple canonical tags). - Else, drop it (not in closed vocabulary).
- Deduplicate by canonical tag within this phrase:
- Keep the canonical tag with the highest FastText similarity among all tokens that mapped to it.
- Record the token that achieved that max similarity as
alias_tokenfor verbose reporting ("best token wins").
- Exact-match injection:
- Project the phrase's own
lookupthrough the same projection logic. - For each canonical tag produced by that projection, inject it into the candidate set with:
score_fasttext = 1.0alias_token = lookup
- This ensures the phrase canonical appears even though
most_similar()often does not return the query token itself.
- Apply NSFW filtering (if
allow_nsfw_tags=False):
- Drop candidate canonical tags that are present in
nsfw_tags.
Result: for each phrase, we have a set of canonical candidate tags with:
score_fasttextalias_token(token that produced the best FastText score for that canonical tag)
Context similarity (TF-IDF -> SVD cosine)
Stage 2 computes one query context vector for the entire request:
- Build a pseudo TF-IDF vector from the final phrase set (deduped + head nouns):
- Convert each phrase to underscore form (same
lookuprule). - Terms that exist in the TF-IDF vocabulary (underscore lookups) contribute
(term_count * idf(term)). - OOV terms contribute nothing (but may be reported in verbose mode).
- Project to SVD space and L2-normalize:
query_vec = normalize(svd.transform(tfidf_vec))
If the query vector has zero norm (no recognized TF-IDF terms), then query_has_context = False and:
score_context = Nonefor all candidatesscore_combined = score_fasttext(FastText-only)
If query_has_context = True, compute per-candidate cosine similarity when possible:
- For tags that have a TF-IDF/SVD row:
score_context_by_tag[tag] = dot(query_vec, reduced_matrix_norm[row]) - For tags that lack a TF-IDF/SVD row: initial
score_context = None(may be imputed per-phrase)
Missing context policy (per-phrase, q=0.10)
If query_has_context = True and a candidate tag has score_context = None:
- For that phrase, compute
default_context_for_phraseas the 10th percentile (q=0.10) of the available (non-None) context scores among that phrase's candidates. - If there are no available context scores for that phrase,
default_context_for_phrase = 0.0. - Impute missing context scores using
default_context_for_phraseand mark:context_imputed = TrueOtherwise:
context_imputed = False
Score fusion (FastText + Context)
Compute a fused score per phrase candidate:
If
query_has_context = False:score_combined = score_fasttext
Else:
score_combined = (1 - context_weight) * score_fasttext + context_weight * score_context- (
score_contextmay be imputed as described above)
Per-phrase truncation and must-include rule
After scoring candidates for a phrase:
- Sort by
score_combineddescending. - Keep top
per_phrase_final_k(typically 10).
Must-include rule (pinned exact phrase tags):
- Let
required_tagsbe the canonical tag(s) produced by projecting the phrase's ownlookup(projected_lookup). - Each required tag must appear in that phrase's final top
per_phrase_final_klist, even if its fused score would otherwise place it below the cutoff. - If the list is full, evict the lowest-ranked tag that is not required.
- Note:
required_tagsmay contain multiple canonicals ifalias2tagsmaps a token to multiple tags.
This rule applies only to the phrase's own required tags. It does not inject tags into other phrases' lists.
Merge across phrases (global candidate pool)
A canonical tag may appear in multiple per-phrase top-K lists. Stage 2 deduplicates tags into a single global record.
sourcesis the union of phrases whose per-phrase lists contained the tag.score_fasttextis the maximum FastText score observed for the tag across those phrases.score_contextis the maximum context cosine observed for the tag across those phrases (withNonetreated as missing).score_combinedis the maximum fused score observed for the tag across those phrases.
Note:
- These maxima may come from different phrases; the global candidate row does not necessarily correspond to any single phrase's row.
- For tags with a TF-IDF row,
score_contextis phrase-invariant. Differences across phrases only arise for tags whose context score was imputed.
Finally:
- Sort global candidates by
score_combineddescending. - Return top
global_kcandidates (and optionally all candidates if the app needs them).
Output schema
Stage 2 return (non-verbose)
candidates: list[Candidate](ordered)tag: str(canonical)score_combined: floatscore_fasttext: float | Nonescore_context: float | None(None only whenquery_has_context=Falseor when missing)count: int | Nonesources: list[str]
Optional per-phrase debug report (verbose)
For each phrase:
phrase: strnormalized: strlookup: strtfidf_vocab: bool(lookup is in TF-IDF vocabulary)oov_terms: list[str]candidates: list[CandidateRow](top per-phrase list)tag: stralias_token: strscore_fasttext: floatscore_context: float | Nonescore_combined: floatcontext_imputed: boolcount: int | None
Determinism and performance constraints
- Artifact loading is lazy (load-on-first-use, cached thereafter).
- No feature flags for old/new behavior: delete old code paths.
- Logging must be read-only and must not affect results.
NSFW tag source
nsfw_tagsis sourced fromword_rating_probabilities.csvwithNSFW_THRESHOLD=0.95as implemented inpsq_rag.retrieval.state.