Prompt_Squirrel_RAG / docs /rewrite_contract.md
Food Desert
Add alias-based character tag filtering for Stage 3
c6be992
|
Raw
History Blame
3.59 kB
# Stage 1 — Query Rewriting Contract
## Purpose
Stage 1 (“Query Rewriting”) converts a free-form natural-language prompt into a
comma-separated list of short, tag-shaped phrases suitable for downstream
retrieval over a closed image-tag vocabulary.
This stage is not tagging, not normalization, and not validation.
Its sole role is to rewrite user intent into a retrieval-friendly surface form
with high recall.
---
## Inputs
- User prompt: an arbitrary string entered by the user.
- The input may include:
- natural language
- comma-separated phrases
- Stable-Diffusion-style parentheses and weights
- punctuation and spacing artifacts
No structural guarantees are assumed about the input.
---
## Pre-Rewrite Heuristics (Non-LLM)
Before the LLM rewrite is invoked, the system performs a lightweight heuristic
extraction:
- The prompt is split on "." and ","
- Segments with three or fewer whitespace-separated tokens are retained
- Case-insensitive deduplication is applied
This produces a small list of user-provided phrases that may later be appended
to the rewrite output for retrieval support.
This heuristic:
- is lossy
- is not authoritative
- exists only to preserve short explicit phrases if the rewrite fails or omits them
---
## Rewrite Mechanism
Stage 1 uses a single deterministic LLM call with:
- temperature = 0.0
- no retries
- no streaming
- no structured output enforcement
The system prompt instructs the model to:
- output a comma-separated list
- use short, literal, tag-shaped phrases
- preserve coherent multi-word visual concepts
- avoid inventing details
- avoid demographic inference
- avoid guessing identities
The LLM output is treated as plain text.
---
## Output Format
On success, Stage 1 returns:
- a single string
- containing comma-separated phrases
- with arbitrary spacing normalized
- truncated to a maximum of approximately 800 characters
No further parsing, validation, or canonicalization is applied at this stage.
The rewrite may:
- reorder concepts
- merge or split phrasing
- introduce additional generic visual concepts (e.g. "white background")
---
## Failure and Fallback Behavior
If the LLM call:
- errors
- produces a refusal-like response
- returns empty output
then Stage 1 returns an empty string.
In downstream stages, this empty rewrite may be supplemented by the heuristic
phrases extracted earlier, but Stage 1 itself does not attempt recovery.
---
## Explicit Non-Guarantees
Stage 1 does not guarantee that:
- output phrases correspond to known vocabulary tags
- phrases are unique
- phrases are canonicalized
- phrases are mutually exclusive
- all user concepts are preserved
- added concepts reflect ground truth
Stage 2 must not assume any of the above.
---
## Contract Boundary with Stage 2
Stage 1 guarantees only that:
- output is a comma-separated list of short phrases
- phrases are intended to be retrieval queries, not canonical tags
- output is deterministic for a given input
Stage 2 is responsible for:
- normalization
- deduplication
- head-noun expansion
- vocabulary grounding
- alias handling
- scoring and ranking
---
## Summary (Interview-Safe)
Stage 1 is a deterministic query-rewriting step that reshapes free-form text into
retrieval-friendly phrase queries. It intentionally favors recall and
surface-form alignment over correctness or canonicalization, delegating all
grounding and validation to later stages.