# Prompt Squirrel RAG - Project Summary ## Overview **Prompt Squirrel** is an advanced RAG (Retrieval-Augmented Generation) system designed to transform natural language prompts into structured e621-style tags for furry art image generation. The system uses a multi-stage pipeline combining FastText embeddings, TF-IDF similarity, SVD dimensionality reduction, and LLM-based selection to convert user prompts into canonical, well-formed tag sets. ### Key Metadata - **Platform**: Hugging Face Gradio application - **License**: Apache 2.0 - **Python Version**: 3.10.12 - **Gradio SDK**: 5.43.1 - **Main Application**: `app.py` --- ## System Architecture The system implements a **three-stage pipeline** with strict contracts between stages: ### Stage 1: Query Rewriting (LLM-based) - **Purpose**: Transform natural language into "tag-shaped" comma-separated phrases - **Input**: User's natural language prompt - **Output**: List of normalized phrases (not canonical tags yet) - **Module**: `psq_rag.llm.rewrite` ### Stage 2: Retrieval Grounding / Candidate Generation (Closed Vocabulary) - **Purpose**: Generate high-recall candidate pool from a closed vocabulary of canonical e621 tags - **Key Operations**: - Phrase normalization and head-noun expansion - FastText neighbor retrieval with alias-to-canonical projection - TF-IDF/SVD context similarity scoring - Score fusion (FastText + context weighted) - Per-phrase top-K truncation with must-include rules - Global candidate pool merging - **Contract**: `docs/retrieval_contract.md` - **Module**: `psq_rag.retrieval.psq_retrieval` - **State Management**: `psq_rag.retrieval.state` ### Stage 3: Closed-Set Selection (LLM-based) - **Purpose**: Select final tags from Stage 2 candidates - **Modes**: - Single-shot (one LLM call for all candidates) - Chunked map-union (parallel processing with no LLM reduce) - **Contract**: `docs/stage3_contract.md` - **Module**: `psq_rag.llm.select` - **Output**: Selected tags with optional rationale codes (explicit, strong_implied, weak_implied, style_or_meta, other) ### Stage 3s: Structural Tag Inference (Optional) - **Purpose**: Infer structural/implied tags from selected tags - **Example**: clothing → topless/bottomless based on what clothing is present - **Implementation**: Group-based system using wiki data - **Module**: `psq_rag.llm.select.llm_infer_structural_tags` --- ## Recent Development Work ### Tag Categorization Pipeline (Latest Work - Feb 13-14, 2026) Implementation of a categorized tag suggestion system based on the e621 tagging checklist: 1. **Category Parser** (`psq_rag/tagging/category_parser.py`): - Parses e621 tagging checklist into structured categories - Supports category tiers: CRITICAL, IMPORTANT, NICE_TO_HAVE, META - Constraint types: exactly_one, multi, multi_or_none - Categories: body_type, species, gender, clothing, location, perspective, posture, etc. 2. **Categorized Suggestions** (`psq_rag/tagging/categorized_suggestions.py`): - Generates TF-IDF similarity-ranked suggestions per category - Identifies already-selected tags by category - Organizes suggestions for guided user tagging 3. **Evaluation Infrastructure**: - **eval_pipeline.py**: Core evaluation script with parallel processing - **eval_categorized.py**: Per-category metrics (precision, recall, F1) - Support for ranking metrics (MRR, Precision@K, nDCG) - Ground truth annotation expansion via tag implications ### Structural Inference System (Feb 12-13, 2026) - Redesigned as group-based system using wiki definitions - Extracted tag groups and wiki data from e621 - Added diagnostic scripts for clothing inference - Improved topless/bottomless definitions to prevent confusion - Fixed Windows encoding issues ### Tag Implication System (Feb 10-11, 2026) - Integrated `tag_implications-2023-07-20.csv` - Automatic tag expansion (e.g., fox → canine → canid → mammal) - Expanded ground truth annotations for evaluation - Leaf-only metrics to avoid penalizing implied tags ### Evaluation Enhancements (Feb 10-14, 2026) - Added `--min-why` threshold filtering (explicit, strong_implied, weak_implied) - Per-tag evidence tracking - Compact eval output format - Retrieval gap analysis scripts - Multiple eval runs with different configurations - Stored eval results in `data/eval_results/` - Added per-phrase retrieval cap flag: `--per-phrase-final-k` - Added Stage 3 selection score/rank logging for post-hoc threshold analysis - Added score/global-rank/phrase-rank grid analysis script ### Code Quality Improvements - Removed binary PNG files (migrated to Hugging Face XET storage) - Fixed eval_categorized.py compatibility with eval_pipeline.py output - Enhanced diagnostic and analysis scripts - Ensured tagging checklist loads from repo root if present - Forced UTF-8 stdout/stderr in eval pipeline to avoid Windows encoding crashes --- ## Key Data Files ### Artifacts (loaded lazily) - **FastText embeddings**: Compressed format for semantic similarity - **TF-IDF vectors + SVD**: For context-based tag similarity - **Alias mappings**: Non-canonical → canonical tag projection - **Tag counts**: Frequency information from corpus - **Tag implications**: Hierarchical tag relationships (e.g., species → family) - **Tag groups** (`data/tag_groups.json`): Structured tag families for inference - **Tag wiki definitions** (`data/tag_wiki_defs.json`): E621 wiki data for tags ### Configuration - **tagging_checklist.txt**: E621 tagging guidelines and categories - **word_rating_probabilities.csv**: NSFW tag classification (threshold 0.95) - **fluffyrock_3m.csv**: Large tag corpus dataset - **SamplePrompts.csv**: Test prompts for development - **TagDocumentation.txt**: E621 tag documentation ### Evaluation - **data/eval_samples/**: Test images with ground truth annotations - **data/eval_results/**: Stored evaluation results (JSONL format) - **eval_analysis.txt**: Latest per-category performance metrics - **data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_caption_evident.jsonl**: Caption-evident GT subset (10 samples) for retrieval-ceiling audits --- ## Development Practices (from AGENTS.md) ### Environment - **OS**: Windows / PowerShell (though currently running on Linux) - **Python**: Always use `.venv\Scripts\python.exe` (Windows) or venv equivalent - **Working Directory**: Always run from repo root ### Code Discipline - Keep diffs small: one issue or one focused step per patch - Do not rewrite large files - Do not move logic across modules unless contract requires it - Preserve stage boundaries: rewriting (LLM) vs retrieval (candidate generation) vs selection (index-only) ### Contracts - Follow contracts in `docs/` directory - If behavior conflicts with code, update code to match contract (not vice versa) - Contracts define deterministic behavior, no feature flags for old/new paths --- ## Testing & Evaluation ### Scripts - **scripts/eval_pipeline.py**: Main evaluation harness - Parallel processing support - Multiple min_why thresholds - Ground truth comparison with implications expansion - `--per-phrase-final-k` retrieval cap control - Logs `stage3_selected_scores`, `stage3_selected_ranks`, `stage3_selected_phrase_ranks` - **scripts/eval_categorized.py**: Per-category evaluation - Precision, recall, F1 per category - Constraint validation (exactly_one, multi, etc.) - Tier-based aggregation (CRITICAL, IMPORTANT, etc.) - **scripts/analyze_compact_eval.py**: Compact evaluation analysis - **scripts/analyze_retrieval_gaps.py**: Retrieval gap identification - **scripts/analyze_threshold_grid.py**: Post-hoc threshold grids (score/global rank/phrase rank) - **scripts/analyze_caption_evident_audit.py**: Caption-evident audit vs retrieval (optional implication expansion) - **scripts/diagnose_structural_clothing.py**: Clothing inference diagnostics - **scripts/extract_wiki_data.py**: E621 wiki data extraction - **scripts/smoke_test.py**: Quick pipeline validation ### Sample Scripts - **scripts/rewrite_playground.py**: Stage 1 testing - **scripts/stage3_debug.py**: Stage 3 debugging - **scripts/test_categorized_suggestions.py**: Category suggestion testing - **scripts/test_parser_only.py**: Parser validation --- ## Gradio Application (app.py) ### Features - Image upload support - Natural language prompt input - Multi-stage pipeline execution - Verbose retrieval reporting (optional) - NSFW tag filtering (configurable) - Final prompt composition with deduplication - Mascot branding (🐿️ squirrel) ### Configuration - `allow_nsfw_tags`: NSFW content filtering - `verbose_retrieval`: Debug output for Stage 2 - `verbose_retrieval_limit`: Max candidates to display (20) - Logging: Controlled via `PSQ_LOG_LEVEL` environment variable - Gradio analytics disabled to avoid threading errors --- ## Key Technical Innovations 1. **Alias-to-Canonical Projection**: Handles non-canonical tag variants and projects them to e621 canonical forms 2. **Head-Noun Expansion**: Automatically extracts head nouns from multi-word phrases (e.g., "big shirt" → also search "shirt") 3. **Dual Scoring**: FastText semantic similarity + TF-IDF/SVD context similarity with weighted fusion 4. **Must-Include Rules**: Ensures exact phrase matches appear in results even if scores are lower 5. **Context Imputation**: Handles missing context scores with per-phrase 10th percentile fallback 6. **Chunked Map-Union**: Scalable LLM selection for large candidate sets without LLM reduce overhead 7. **Tag Implications**: Automatic hierarchical tag expansion for complete tagging 8. **Categorized Suggestions**: TF-IDF-ranked suggestions organized by e621 checklist categories --- ## Next Steps / Known Work Areas Based on recent commits, the project is actively working on: 1. **Tag Categorization Pipeline**: Providing structured, category-based tag suggestions to users 2. **Evaluation Metrics**: Comprehensive per-category and ranking metrics to measure system quality 3. **Structural Inference**: Improving clothing and body-state inference from selected tags 4. **Ground Truth Quality**: Expanding and improving evaluation datasets with proper implication handling 5. **Production Deployment**: Optimizing for Hugging Face Spaces deployment (binary file handling, logging, etc.) --- ## Dependencies Core packages (requirements.txt): - gradio==4.44.1 / gradio-client==1.3.0 - Web interface - hnswlib==0.8.0 - Fast nearest neighbor search - numpy==1.25.1 - Numerical operations - scikit-learn==1.4.1.post1 - ML utilities (TF-IDF, SVD) - h5py==3.8.0 - HDF5 file handling - compress-fasttext - Compressed FastText embeddings - lark-parser - Grammar parsing (for prompt parsing) - scipy==1.12.0 - Scientific computing - gensim==4.3.2 - Word embeddings - huggingface_hub<1.0 - Dataset/model hosting - rapidfuzz>=3.0 - Fast string matching --- ## Summary of Session Work In this session (and recent sessions based on git history), we have: 1. ✅ **Built tag categorization infrastructure** based on e621 checklist 2. ✅ **Created category parser** with tier and constraint support 3. ✅ **Implemented TF-IDF-based categorized suggestions** 4. ✅ **Added comprehensive evaluation metrics** (per-category P/R/F1, ranking metrics) 5. ✅ **Fixed multi-select constraint handling** for body_type, species, gender 6. ✅ **Improved structural inference system** with group-based wiki data approach 7. ✅ **Enhanced evaluation pipeline** with parallel processing and implication expansion 8. ✅ **Added diagnostic and analysis tools** for debugging and quality assessment 9. ✅ **Cleaned up binary files** and moved to proper XET storage on Hugging Face The project is now at a sophisticated stage with a full three-stage pipeline, comprehensive evaluation infrastructure, and category-based tag organization aligned with e621's tagging best practices.