# Prompt Squirrel RAG - Project Summary

## Overview

**Prompt Squirrel** is an advanced RAG (Retrieval-Augmented Generation) system designed to transform natural language prompts into structured e621-style tags for furry art image generation. The system uses a multi-stage pipeline combining FastText embeddings, TF-IDF similarity, SVD dimensionality reduction, and LLM-based selection to convert user prompts into canonical, well-formed tag sets.

### Key Metadata
- **Platform**: Hugging Face Gradio application
- **License**: Apache 2.0
- **Python Version**: 3.10.12
- **Gradio SDK**: 5.43.1
- **Main Application**: `app.py`

---

## System Architecture

The system implements a **three-stage pipeline** with strict contracts between stages:

### Stage 1: Query Rewriting (LLM-based)
- **Purpose**: Transform natural language into "tag-shaped" comma-separated phrases
- **Input**: User's natural language prompt
- **Output**: List of normalized phrases (not canonical tags yet)
- **Module**: `psq_rag.llm.rewrite`

### Stage 2: Retrieval Grounding / Candidate Generation (Closed Vocabulary)
- **Purpose**: Generate high-recall candidate pool from a closed vocabulary of canonical e621 tags
- **Key Operations**:
  - Phrase normalization and head-noun expansion
  - FastText neighbor retrieval with alias-to-canonical projection
  - TF-IDF/SVD context similarity scoring
  - Score fusion (FastText + context weighted)
  - Per-phrase top-K truncation with must-include rules
  - Global candidate pool merging
- **Contract**: `docs/retrieval_contract.md`
- **Module**: `psq_rag.retrieval.psq_retrieval`
- **State Management**: `psq_rag.retrieval.state`

### Stage 3: Closed-Set Selection (LLM-based)
- **Purpose**: Select final tags from Stage 2 candidates
- **Modes**:
  - Single-shot (one LLM call for all candidates)
  - Chunked map-union (parallel processing with no LLM reduce)
- **Contract**: `docs/stage3_contract.md`
- **Module**: `psq_rag.llm.select`
- **Output**: Selected tags with optional rationale codes (explicit, strong_implied, weak_implied, style_or_meta, other)

### Stage 3s: Structural Tag Inference (Optional)
- **Purpose**: Infer structural/implied tags from selected tags
- **Example**: clothing → topless/bottomless based on what clothing is present
- **Implementation**: Group-based system using wiki data
- **Module**: `psq_rag.llm.select.llm_infer_structural_tags`

---

## Recent Development Work

### Tag Categorization Pipeline (Latest Work - Feb 13-14, 2026)

Implementation of a categorized tag suggestion system based on the e621 tagging checklist:

1. **Category Parser** (`psq_rag/tagging/category_parser.py`):
   - Parses e621 tagging checklist into structured categories
   - Supports category tiers: CRITICAL, IMPORTANT, NICE_TO_HAVE, META
   - Constraint types: exactly_one, multi, multi_or_none
   - Categories: body_type, species, gender, clothing, location, perspective, posture, etc.

2. **Categorized Suggestions** (`psq_rag/tagging/categorized_suggestions.py`):
   - Generates TF-IDF similarity-ranked suggestions per category
   - Identifies already-selected tags by category
   - Organizes suggestions for guided user tagging

3. **Evaluation Infrastructure**:
   - **eval_pipeline.py**: Core evaluation script with parallel processing
   - **eval_categorized.py**: Per-category metrics (precision, recall, F1)
   - Support for ranking metrics (MRR, Precision@K, nDCG)
   - Ground truth annotation expansion via tag implications

### Structural Inference System (Feb 12-13, 2026)
- Redesigned as group-based system using wiki definitions
- Extracted tag groups and wiki data from e621
- Added diagnostic scripts for clothing inference
- Improved topless/bottomless definitions to prevent confusion
- Fixed Windows encoding issues

### Tag Implication System (Feb 10-11, 2026)
- Integrated `tag_implications-2023-07-20.csv`
- Automatic tag expansion (e.g., fox → canine → canid → mammal)
- Expanded ground truth annotations for evaluation
- Leaf-only metrics to avoid penalizing implied tags

### Evaluation Enhancements (Feb 10-14, 2026)
- Added `--min-why` threshold filtering (explicit, strong_implied, weak_implied)
- Per-tag evidence tracking
- Compact eval output format
- Retrieval gap analysis scripts
- Multiple eval runs with different configurations
- Stored eval results in `data/eval_results/`
 - Added per-phrase retrieval cap flag: `--per-phrase-final-k`
 - Added Stage 3 selection score/rank logging for post-hoc threshold analysis
 - Added score/global-rank/phrase-rank grid analysis script

### Code Quality Improvements
- Removed binary PNG files (migrated to Hugging Face XET storage)
- Fixed eval_categorized.py compatibility with eval_pipeline.py output
- Enhanced diagnostic and analysis scripts
 - Ensured tagging checklist loads from repo root if present
 - Forced UTF-8 stdout/stderr in eval pipeline to avoid Windows encoding crashes

---

## Key Data Files

### Artifacts (loaded lazily)
- **FastText embeddings**: Compressed format for semantic similarity
- **TF-IDF vectors + SVD**: For context-based tag similarity
- **Alias mappings**: Non-canonical → canonical tag projection
- **Tag counts**: Frequency information from corpus
- **Tag implications**: Hierarchical tag relationships (e.g., species → family)
- **Tag groups** (`data/tag_groups.json`): Structured tag families for inference
- **Tag wiki definitions** (`data/tag_wiki_defs.json`): E621 wiki data for tags

### Configuration
- **tagging_checklist.txt**: E621 tagging guidelines and categories
- **word_rating_probabilities.csv**: NSFW tag classification (threshold 0.95)
- **fluffyrock_3m.csv**: Large tag corpus dataset
- **SamplePrompts.csv**: Test prompts for development
- **TagDocumentation.txt**: E621 tag documentation

### Evaluation
- **data/eval_samples/**: Test images with ground truth annotations
- **data/eval_results/**: Stored evaluation results (JSONL format)
- **eval_analysis.txt**: Latest per-category performance metrics
- **data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_caption_evident.jsonl**: Caption-evident GT subset (10 samples) for retrieval-ceiling audits

---

## Development Practices (from AGENTS.md)

### Environment
- **OS**: Windows / PowerShell (though currently running on Linux)
- **Python**: Always use `.venv\Scripts\python.exe` (Windows) or venv equivalent
- **Working Directory**: Always run from repo root

### Code Discipline
- Keep diffs small: one issue or one focused step per patch
- Do not rewrite large files
- Do not move logic across modules unless contract requires it
- Preserve stage boundaries: rewriting (LLM) vs retrieval (candidate generation) vs selection (index-only)

### Contracts
- Follow contracts in `docs/` directory
- If behavior conflicts with code, update code to match contract (not vice versa)
- Contracts define deterministic behavior, no feature flags for old/new paths

---


## Testing & Evaluation

### Scripts
- **scripts/eval_pipeline.py**: Main evaluation harness
  - Parallel processing support
  - Multiple min_why thresholds
  - Ground truth comparison with implications expansion
  - `--per-phrase-final-k` retrieval cap control
  - Logs `stage3_selected_scores`, `stage3_selected_ranks`, `stage3_selected_phrase_ranks`

- **scripts/eval_categorized.py**: Per-category evaluation
  - Precision, recall, F1 per category
  - Constraint validation (exactly_one, multi, etc.)
  - Tier-based aggregation (CRITICAL, IMPORTANT, etc.)

- **scripts/analyze_compact_eval.py**: Compact evaluation analysis
- **scripts/analyze_retrieval_gaps.py**: Retrieval gap identification
- **scripts/analyze_threshold_grid.py**: Post-hoc threshold grids (score/global rank/phrase rank)
- **scripts/analyze_caption_evident_audit.py**: Caption-evident audit vs retrieval (optional implication expansion)
- **scripts/diagnose_structural_clothing.py**: Clothing inference diagnostics
- **scripts/extract_wiki_data.py**: E621 wiki data extraction
- **scripts/smoke_test.py**: Quick pipeline validation

### Sample Scripts
- **scripts/rewrite_playground.py**: Stage 1 testing
- **scripts/stage3_debug.py**: Stage 3 debugging
- **scripts/test_categorized_suggestions.py**: Category suggestion testing
- **scripts/test_parser_only.py**: Parser validation

---

## Gradio Application (app.py)

### Features
- Image upload support
- Natural language prompt input
- Multi-stage pipeline execution
- Verbose retrieval reporting (optional)
- NSFW tag filtering (configurable)
- Final prompt composition with deduplication
- Mascot branding (🐿️ squirrel)

### Configuration
- `allow_nsfw_tags`: NSFW content filtering
- `verbose_retrieval`: Debug output for Stage 2
- `verbose_retrieval_limit`: Max candidates to display (20)
- Logging: Controlled via `PSQ_LOG_LEVEL` environment variable
- Gradio analytics disabled to avoid threading errors

---

## Key Technical Innovations

1. **Alias-to-Canonical Projection**: Handles non-canonical tag variants and projects them to e621 canonical forms

2. **Head-Noun Expansion**: Automatically extracts head nouns from multi-word phrases (e.g., "big shirt" → also search "shirt")

3. **Dual Scoring**: FastText semantic similarity + TF-IDF/SVD context similarity with weighted fusion

4. **Must-Include Rules**: Ensures exact phrase matches appear in results even if scores are lower

5. **Context Imputation**: Handles missing context scores with per-phrase 10th percentile fallback

6. **Chunked Map-Union**: Scalable LLM selection for large candidate sets without LLM reduce overhead

7. **Tag Implications**: Automatic hierarchical tag expansion for complete tagging

8. **Categorized Suggestions**: TF-IDF-ranked suggestions organized by e621 checklist categories

---

## Next Steps / Known Work Areas

Based on recent commits, the project is actively working on:

1. **Tag Categorization Pipeline**: Providing structured, category-based tag suggestions to users

2. **Evaluation Metrics**: Comprehensive per-category and ranking metrics to measure system quality

3. **Structural Inference**: Improving clothing and body-state inference from selected tags

4. **Ground Truth Quality**: Expanding and improving evaluation datasets with proper implication handling

5. **Production Deployment**: Optimizing for Hugging Face Spaces deployment (binary file handling, logging, etc.)

---

## Dependencies

Core packages (requirements.txt):
- gradio==4.44.1 / gradio-client==1.3.0 - Web interface
- hnswlib==0.8.0 - Fast nearest neighbor search
- numpy==1.25.1 - Numerical operations
- scikit-learn==1.4.1.post1 - ML utilities (TF-IDF, SVD)
- h5py==3.8.0 - HDF5 file handling
- compress-fasttext - Compressed FastText embeddings
- lark-parser - Grammar parsing (for prompt parsing)
- scipy==1.12.0 - Scientific computing
- gensim==4.3.2 - Word embeddings
- huggingface_hub<1.0 - Dataset/model hosting
- rapidfuzz>=3.0 - Fast string matching

---

## Summary of Session Work

In this session (and recent sessions based on git history), we have:

1. ✅ **Built tag categorization infrastructure** based on e621 checklist
2. ✅ **Created category parser** with tier and constraint support
3. ✅ **Implemented TF-IDF-based categorized suggestions**
4. ✅ **Added comprehensive evaluation metrics** (per-category P/R/F1, ranking metrics)
5. ✅ **Fixed multi-select constraint handling** for body_type, species, gender
6. ✅ **Improved structural inference system** with group-based wiki data approach
7. ✅ **Enhanced evaluation pipeline** with parallel processing and implication expansion
8. ✅ **Added diagnostic and analysis tools** for debugging and quality assessment
9. ✅ **Cleaned up binary files** and moved to proper XET storage on Hugging Face

The project is now at a sophisticated stage with a full three-stage pipeline, comprehensive evaluation infrastructure, and category-based tag organization aligned with e621's tagging best practices.