File size: 7,775 Bytes
c6be992
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
# Retrieval Contract -- Stage 2 (Retrieval Grounding / Candidate Generation)

Stage 2 performs **retrieval grounding** over a **closed vocabulary** of canonical e621-style tags.
It does not "tag images", and it does not do free-form generation. Its job is to produce a high-recall,
inspectable candidate pool for downstream **closed-set selection**.

---

## Inputs

- `rewrite_phrases: list[str]`
  - Output of Stage 1 query rewriting (comma-separated "tag-shaped" phrases).
  - Not canonical tags. Not underscored. High recall is preferred.

- `allow_nsfw_tags: bool`
  - If false, filter out tags in the project's `nsfw_tags` set.

- `verbose: bool`
  - If true, return per-phrase debug reports.

---

## Normalization and phrase expansion

1) Normalize rewrite phrases for internal processing:
- lowercase
- strip leading/trailing whitespace
- collapse internal whitespace to a single space

2) Treat the phrase list as a **set** (dedupe after normalization).

3) **Head-noun expansion**:
- For each multi-token phrase, add its head noun (last token) as an additional phrase.
- Apply the same set semantics so duplicates are processed once.

Example:
- Input phrases: `["big shirt", "grey shirt"]`
- Final phrase set: `{"big shirt", "grey shirt", "shirt"}`

---

## Candidate generation per phrase (FastText neighbors + canonicalization)

For each phrase `p` in the final phrase set:

1) Convert to lookup form:
- `lookup = p.replace(" ", "_")`

2) Retrieve neighbors using FastText:
- `neighbors = fasttext.most_similar(lookup, topn=per_phrase_k)`
- Note: FastText neighbors may include alias tokens and other non-canonical strings.

3) **Project neighbor tokens to canonical tags** (alias -> canonical):
- If a neighbor token is already a canonical tag (token is in `tag_counts` OR token has a TF-IDF row in `tag_to_row_index`), it maps to itself.
- Else if it is an alias, map it via `alias2tags[token]` (may map to multiple canonical tags).
- Else, drop it (not in closed vocabulary).

4) **Deduplicate by canonical tag** within this phrase:
- Keep the canonical tag with the highest FastText similarity among all tokens that mapped to it.
- Record the token that achieved that max similarity as `alias_token` for verbose reporting ("best token wins").

5) **Exact-match injection**:
- Project the phrase's own `lookup` through the same projection logic.
- For each canonical tag produced by that projection, inject it into the candidate set with:
  - `score_fasttext = 1.0`
  - `alias_token = lookup`
- This ensures the phrase canonical appears even though `most_similar()` often does not return the query token itself.

6) Apply NSFW filtering (if `allow_nsfw_tags=False`):
- Drop candidate canonical tags that are present in `nsfw_tags`.

Result: for each phrase, we have a set of canonical candidate tags with:
- `score_fasttext`
- `alias_token` (token that produced the best FastText score for that canonical tag)

---

## Context similarity (TF-IDF -> SVD cosine)

Stage 2 computes one **query context vector** for the entire request:

1) Build a pseudo TF-IDF vector from the **final phrase set** (deduped + head nouns):
- Convert each phrase to underscore form (same `lookup` rule).
- Terms that exist in the TF-IDF vocabulary (underscore lookups) contribute `(term_count * idf(term))`.
- OOV terms contribute nothing (but may be reported in verbose mode).

2) Project to SVD space and L2-normalize:
- `query_vec = normalize(svd.transform(tfidf_vec))`

If the query vector has zero norm (no recognized TF-IDF terms), then `query_has_context = False` and:
- `score_context = None` for all candidates
- `score_combined = score_fasttext` (FastText-only)

If `query_has_context = True`, compute per-candidate cosine similarity when possible:
- For tags that have a TF-IDF/SVD row: `score_context_by_tag[tag] = dot(query_vec, reduced_matrix_norm[row])`
- For tags that lack a TF-IDF/SVD row: initial `score_context = None` (may be imputed per-phrase)

### Missing context policy (per-phrase, q=0.10)

If `query_has_context = True` and a candidate tag has `score_context = None`:
- For that phrase, compute `default_context_for_phrase` as the 10th percentile (q=0.10) of the available (non-None) context scores among that phrase's candidates.
- If there are no available context scores for that phrase, `default_context_for_phrase = 0.0`.
- Impute missing context scores using `default_context_for_phrase` and mark:
  - `context_imputed = True`
Otherwise:
- `context_imputed = False`

---

## Score fusion (FastText + Context)

Compute a fused score per phrase candidate:

- If `query_has_context = False`:
  - `score_combined = score_fasttext`

- Else:
  - `score_combined = (1 - context_weight) * score_fasttext + context_weight * score_context`
  - (`score_context` may be imputed as described above)

---

## Per-phrase truncation and must-include rule

After scoring candidates for a phrase:
- Sort by `score_combined` descending.
- Keep top `per_phrase_final_k` (typically 10).

**Must-include rule (pinned exact phrase tags)**:
- Let `required_tags` be the canonical tag(s) produced by projecting the phrase's own `lookup` (`projected_lookup`).
- Each required tag must appear in that phrase's final top `per_phrase_final_k` list, even if its fused score would otherwise place it below the cutoff.
- If the list is full, evict the lowest-ranked tag that is *not* required.
- Note: `required_tags` may contain multiple canonicals if `alias2tags` maps a token to multiple tags.

This rule applies **only to the phrase's own required tags**. It does not inject tags into other phrases' lists.

---

## Merge across phrases (global candidate pool)

A canonical tag may appear in multiple per-phrase top-K lists. Stage 2 deduplicates tags into a single global record.

- `sources` is the union of phrases whose per-phrase lists contained the tag.
- `score_fasttext` is the maximum FastText score observed for the tag across those phrases.
- `score_context` is the maximum context cosine observed for the tag across those phrases (with `None` treated as missing).
- `score_combined` is the maximum fused score observed for the tag across those phrases.

Note:
- These maxima may come from different phrases; the global candidate row does not necessarily correspond to any single phrase's row.
- For tags with a TF-IDF row, `score_context` is phrase-invariant. Differences across phrases only arise for tags whose context score was imputed.

Finally:
- Sort global candidates by `score_combined` descending.
- Return top `global_k` candidates (and optionally all candidates if the app needs them).

---

## Output schema

### Stage 2 return (non-verbose)
- `candidates: list[Candidate]` (ordered)
  - `tag: str` (canonical)
  - `score_combined: float`
  - `score_fasttext: float | None`
  - `score_context: float | None` (None only when `query_has_context=False` or when missing)
  - `count: int | None`
  - `sources: list[str]`

### Optional per-phrase debug report (verbose)
For each phrase:
- `phrase: str`
- `normalized: str`
- `lookup: str`
- `tfidf_vocab: bool` (lookup is in TF-IDF vocabulary)
- `oov_terms: list[str]`
- `candidates: list[CandidateRow]` (top per-phrase list)
  - `tag: str`
  - `alias_token: str`
  - `score_fasttext: float`
  - `score_context: float | None`
  - `score_combined: float`
  - `context_imputed: bool`
  - `count: int | None`

---

## Determinism and performance constraints

- Artifact loading is **lazy** (load-on-first-use, cached thereafter).
- No feature flags for old/new behavior: delete old code paths.
- Logging must be read-only and must not affect results.

---

## NSFW tag source

- `nsfw_tags` is sourced from `word_rating_probabilities.csv` with `NSFW_THRESHOLD=0.95` as implemented in `psq_rag.retrieval.state`.