# Quick Start for New Sessions

## What is Prompt Squirrel?
A RAG system that converts natural language prompts → e621-style tags for furry art generation.

## Three-Stage Pipeline
1. **Stage 1 (Rewrite)**: Natural language → tag-shaped phrases (LLM)
2. **Stage 2 (Retrieval)**: Phrases → candidate tags (FastText + TF-IDF/SVD, closed vocab)
3. **Stage 3 (Selection)**: Candidates → final selected tags (LLM)
4. **Stage 3s (Structural)**: Selected tags → structural inferences (optional, e.g., clothing → topless)

## Latest Features (Feb 13-14, 2026)
- **Tag Categorization**: Organized suggestions by e621 checklist categories (species, clothing, posture, etc.)
- **Category Parser**: Parses checklist with tiers (CRITICAL/IMPORTANT/NICE_TO_HAVE/META) and constraints
- **Evaluation Metrics**: Per-category P/R/F1, ranking metrics (MRR, P@K, nDCG)
- **Multi-select Constraints**: Fixed body_type, species, gender to allow multiple tags

## Key Files
- `app.py` - Gradio web interface
- `psq_rag/tagging/categorized_suggestions.py` - Category-based tag suggestions
- `psq_rag/tagging/category_parser.py` - Parse e621 checklist
- `scripts/eval_pipeline.py` - Main evaluation harness
- `scripts/eval_categorized.py` - Per-category metrics
- `scripts/analyze_threshold_grid.py` - Threshold grid analysis (score/global rank/phrase rank)
- `scripts/analyze_caption_evident_audit.py` - Caption-evident audit vs retrieval
- `docs/retrieval_contract.md` - Stage 2 spec
- `docs/stage3_contract.md` - Stage 3 spec
- `tagging_checklist.txt` - E621 tagging guidelines

## Running Code
```bash
# Always from repo root
.venv/Scripts/python.exe -m pip install -r requirements.txt  # Windows
.venv/Scripts/python.exe app.py
```

## Recent Git History (Last 5 commits)
```
0f73a4b - Fix eval_categorized.py to work with eval_pipeline.py output
ff407fc - Remove binary PNG files (use Hugging Face XET storage instead)
8ba971a - Add eval results for debugging
51b7109 - Add ranking metrics infrastructure to eval pipeline
edba146 - Add per-category evaluation metrics script
```

## Key Contracts to Remember
1. **Stage boundaries are strict**: Don't mix retrieval (Stage 2) with selection (Stage 3)
2. **Keep diffs small**: One focused change per commit
3. **Code matches contracts**: Update code to match docs, not vice versa
4. **No feature flags**: Delete old code paths, no legacy behavior switches

## Quick Orientation Commands
```bash
# View project structure
ls -la

# View recent commits
git log --oneline -10

# Check current branch
git branch

# List Python modules
ls -la psq_rag/

# View evaluation results
ls -la data/eval_results/
```

## Common Tasks
- **Add category**: Edit `tagging_checklist.txt`, update parser
- **Eval changes**: Run `scripts/eval_pipeline.py`, then `scripts/eval_categorized.py`
- **Threshold sweeps**: Run `scripts/analyze_threshold_grid.py` (see `--mode score|rank|phrase_rank`)
- **Caption-evident audit**: Run `scripts/analyze_caption_evident_audit.py`
- **Test retrieval**: Use `scripts/smoke_test.py`
- **Debug Stage 3**: Use `scripts/stage3_debug.py` (`--phrases` optional; omitted runs Stage 1 rewrite first, then Stage 2 retrieval from rewritten phrases)

## Data Artifacts (Lazy-loaded)
- FastText embeddings (semantic similarity)
- TF-IDF + SVD matrices (context similarity)
- Alias → canonical tag mappings
- Tag counts, implications, groups, wiki definitions

## Eval Datasets
- `data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_expanded.jsonl` - Base eval set (implication-expanded GT)
- `data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_caption_evident.jsonl` - Caption-evident GT subset (10 samples); used to estimate retrieval ceiling from text

## New Eval Features (Feb 2026)
- `eval_pipeline.py` now logs Stage 3 selection scores and ranks:
  - `stage3_selected_scores` (retrieval score)
  - `stage3_selected_ranks` (global rank)
  - `stage3_selected_phrase_ranks` (per-phrase rank)
- New CLI flag: `--per-phrase-final-k` to control per-phrase retrieval cap

## NSFW Handling
- Filtered via `word_rating_probabilities.csv` (threshold 0.95)
- Stage 2 removes NSFW tags when `allow_nsfw_tags=False`
- Stage 3 doesn't need policy flags (defense-in-depth only)