Spaces:
Running
Running
File size: 7,775 Bytes
c6be992 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | # Retrieval Contract -- Stage 2 (Retrieval Grounding / Candidate Generation)
Stage 2 performs **retrieval grounding** over a **closed vocabulary** of canonical e621-style tags.
It does not "tag images", and it does not do free-form generation. Its job is to produce a high-recall,
inspectable candidate pool for downstream **closed-set selection**.
---
## Inputs
- `rewrite_phrases: list[str]`
- Output of Stage 1 query rewriting (comma-separated "tag-shaped" phrases).
- Not canonical tags. Not underscored. High recall is preferred.
- `allow_nsfw_tags: bool`
- If false, filter out tags in the project's `nsfw_tags` set.
- `verbose: bool`
- If true, return per-phrase debug reports.
---
## Normalization and phrase expansion
1) Normalize rewrite phrases for internal processing:
- lowercase
- strip leading/trailing whitespace
- collapse internal whitespace to a single space
2) Treat the phrase list as a **set** (dedupe after normalization).
3) **Head-noun expansion**:
- For each multi-token phrase, add its head noun (last token) as an additional phrase.
- Apply the same set semantics so duplicates are processed once.
Example:
- Input phrases: `["big shirt", "grey shirt"]`
- Final phrase set: `{"big shirt", "grey shirt", "shirt"}`
---
## Candidate generation per phrase (FastText neighbors + canonicalization)
For each phrase `p` in the final phrase set:
1) Convert to lookup form:
- `lookup = p.replace(" ", "_")`
2) Retrieve neighbors using FastText:
- `neighbors = fasttext.most_similar(lookup, topn=per_phrase_k)`
- Note: FastText neighbors may include alias tokens and other non-canonical strings.
3) **Project neighbor tokens to canonical tags** (alias -> canonical):
- If a neighbor token is already a canonical tag (token is in `tag_counts` OR token has a TF-IDF row in `tag_to_row_index`), it maps to itself.
- Else if it is an alias, map it via `alias2tags[token]` (may map to multiple canonical tags).
- Else, drop it (not in closed vocabulary).
4) **Deduplicate by canonical tag** within this phrase:
- Keep the canonical tag with the highest FastText similarity among all tokens that mapped to it.
- Record the token that achieved that max similarity as `alias_token` for verbose reporting ("best token wins").
5) **Exact-match injection**:
- Project the phrase's own `lookup` through the same projection logic.
- For each canonical tag produced by that projection, inject it into the candidate set with:
- `score_fasttext = 1.0`
- `alias_token = lookup`
- This ensures the phrase canonical appears even though `most_similar()` often does not return the query token itself.
6) Apply NSFW filtering (if `allow_nsfw_tags=False`):
- Drop candidate canonical tags that are present in `nsfw_tags`.
Result: for each phrase, we have a set of canonical candidate tags with:
- `score_fasttext`
- `alias_token` (token that produced the best FastText score for that canonical tag)
---
## Context similarity (TF-IDF -> SVD cosine)
Stage 2 computes one **query context vector** for the entire request:
1) Build a pseudo TF-IDF vector from the **final phrase set** (deduped + head nouns):
- Convert each phrase to underscore form (same `lookup` rule).
- Terms that exist in the TF-IDF vocabulary (underscore lookups) contribute `(term_count * idf(term))`.
- OOV terms contribute nothing (but may be reported in verbose mode).
2) Project to SVD space and L2-normalize:
- `query_vec = normalize(svd.transform(tfidf_vec))`
If the query vector has zero norm (no recognized TF-IDF terms), then `query_has_context = False` and:
- `score_context = None` for all candidates
- `score_combined = score_fasttext` (FastText-only)
If `query_has_context = True`, compute per-candidate cosine similarity when possible:
- For tags that have a TF-IDF/SVD row: `score_context_by_tag[tag] = dot(query_vec, reduced_matrix_norm[row])`
- For tags that lack a TF-IDF/SVD row: initial `score_context = None` (may be imputed per-phrase)
### Missing context policy (per-phrase, q=0.10)
If `query_has_context = True` and a candidate tag has `score_context = None`:
- For that phrase, compute `default_context_for_phrase` as the 10th percentile (q=0.10) of the available (non-None) context scores among that phrase's candidates.
- If there are no available context scores for that phrase, `default_context_for_phrase = 0.0`.
- Impute missing context scores using `default_context_for_phrase` and mark:
- `context_imputed = True`
Otherwise:
- `context_imputed = False`
---
## Score fusion (FastText + Context)
Compute a fused score per phrase candidate:
- If `query_has_context = False`:
- `score_combined = score_fasttext`
- Else:
- `score_combined = (1 - context_weight) * score_fasttext + context_weight * score_context`
- (`score_context` may be imputed as described above)
---
## Per-phrase truncation and must-include rule
After scoring candidates for a phrase:
- Sort by `score_combined` descending.
- Keep top `per_phrase_final_k` (typically 10).
**Must-include rule (pinned exact phrase tags)**:
- Let `required_tags` be the canonical tag(s) produced by projecting the phrase's own `lookup` (`projected_lookup`).
- Each required tag must appear in that phrase's final top `per_phrase_final_k` list, even if its fused score would otherwise place it below the cutoff.
- If the list is full, evict the lowest-ranked tag that is *not* required.
- Note: `required_tags` may contain multiple canonicals if `alias2tags` maps a token to multiple tags.
This rule applies **only to the phrase's own required tags**. It does not inject tags into other phrases' lists.
---
## Merge across phrases (global candidate pool)
A canonical tag may appear in multiple per-phrase top-K lists. Stage 2 deduplicates tags into a single global record.
- `sources` is the union of phrases whose per-phrase lists contained the tag.
- `score_fasttext` is the maximum FastText score observed for the tag across those phrases.
- `score_context` is the maximum context cosine observed for the tag across those phrases (with `None` treated as missing).
- `score_combined` is the maximum fused score observed for the tag across those phrases.
Note:
- These maxima may come from different phrases; the global candidate row does not necessarily correspond to any single phrase's row.
- For tags with a TF-IDF row, `score_context` is phrase-invariant. Differences across phrases only arise for tags whose context score was imputed.
Finally:
- Sort global candidates by `score_combined` descending.
- Return top `global_k` candidates (and optionally all candidates if the app needs them).
---
## Output schema
### Stage 2 return (non-verbose)
- `candidates: list[Candidate]` (ordered)
- `tag: str` (canonical)
- `score_combined: float`
- `score_fasttext: float | None`
- `score_context: float | None` (None only when `query_has_context=False` or when missing)
- `count: int | None`
- `sources: list[str]`
### Optional per-phrase debug report (verbose)
For each phrase:
- `phrase: str`
- `normalized: str`
- `lookup: str`
- `tfidf_vocab: bool` (lookup is in TF-IDF vocabulary)
- `oov_terms: list[str]`
- `candidates: list[CandidateRow]` (top per-phrase list)
- `tag: str`
- `alias_token: str`
- `score_fasttext: float`
- `score_context: float | None`
- `score_combined: float`
- `context_imputed: bool`
- `count: int | None`
---
## Determinism and performance constraints
- Artifact loading is **lazy** (load-on-first-use, cached thereafter).
- No feature flags for old/new behavior: delete old code paths.
- Logging must be read-only and must not affect results.
---
## NSFW tag source
- `nsfw_tags` is sourced from `word_rating_probabilities.csv` with `NSFW_THRESHOLD=0.95` as implemented in `psq_rag.retrieval.state`.
|