Spaces:
Running
Prompt Squirrel RAG - Project Summary
Overview
Prompt Squirrel is an advanced RAG (Retrieval-Augmented Generation) system designed to transform natural language prompts into structured e621-style tags for furry art image generation. The system uses a multi-stage pipeline combining FastText embeddings, TF-IDF similarity, SVD dimensionality reduction, and LLM-based selection to convert user prompts into canonical, well-formed tag sets.
Key Metadata
- Platform: Hugging Face Gradio application
- License: Apache 2.0
- Python Version: 3.10.12
- Gradio SDK: 5.43.1
- Main Application:
app.py
System Architecture
The system implements a three-stage pipeline with strict contracts between stages:
Stage 1: Query Rewriting (LLM-based)
- Purpose: Transform natural language into "tag-shaped" comma-separated phrases
- Input: User's natural language prompt
- Output: List of normalized phrases (not canonical tags yet)
- Module:
psq_rag.llm.rewrite
Stage 2: Retrieval Grounding / Candidate Generation (Closed Vocabulary)
- Purpose: Generate high-recall candidate pool from a closed vocabulary of canonical e621 tags
- Key Operations:
- Phrase normalization and head-noun expansion
- FastText neighbor retrieval with alias-to-canonical projection
- TF-IDF/SVD context similarity scoring
- Score fusion (FastText + context weighted)
- Per-phrase top-K truncation with must-include rules
- Global candidate pool merging
- Contract:
docs/retrieval_contract.md - Module:
psq_rag.retrieval.psq_retrieval - State Management:
psq_rag.retrieval.state
Stage 3: Closed-Set Selection (LLM-based)
- Purpose: Select final tags from Stage 2 candidates
- Modes:
- Single-shot (one LLM call for all candidates)
- Chunked map-union (parallel processing with no LLM reduce)
- Contract:
docs/stage3_contract.md - Module:
psq_rag.llm.select - Output: Selected tags with optional rationale codes (explicit, strong_implied, weak_implied, style_or_meta, other)
Stage 3s: Structural Tag Inference (Optional)
- Purpose: Infer structural/implied tags from selected tags
- Example: clothing → topless/bottomless based on what clothing is present
- Implementation: Group-based system using wiki data
- Module:
psq_rag.llm.select.llm_infer_structural_tags
Recent Development Work
Tag Categorization Pipeline (Latest Work - Feb 13-14, 2026)
Implementation of a categorized tag suggestion system based on the e621 tagging checklist:
Category Parser (
psq_rag/tagging/category_parser.py):- Parses e621 tagging checklist into structured categories
- Supports category tiers: CRITICAL, IMPORTANT, NICE_TO_HAVE, META
- Constraint types: exactly_one, multi, multi_or_none
- Categories: body_type, species, gender, clothing, location, perspective, posture, etc.
Categorized Suggestions (
psq_rag/tagging/categorized_suggestions.py):- Generates TF-IDF similarity-ranked suggestions per category
- Identifies already-selected tags by category
- Organizes suggestions for guided user tagging
Evaluation Infrastructure:
- eval_pipeline.py: Core evaluation script with parallel processing
- eval_categorized.py: Per-category metrics (precision, recall, F1)
- Support for ranking metrics (MRR, Precision@K, nDCG)
- Ground truth annotation expansion via tag implications
Structural Inference System (Feb 12-13, 2026)
- Redesigned as group-based system using wiki definitions
- Extracted tag groups and wiki data from e621
- Added diagnostic scripts for clothing inference
- Improved topless/bottomless definitions to prevent confusion
- Fixed Windows encoding issues
Tag Implication System (Feb 10-11, 2026)
- Integrated
tag_implications-2023-07-20.csv - Automatic tag expansion (e.g., fox → canine → canid → mammal)
- Expanded ground truth annotations for evaluation
- Leaf-only metrics to avoid penalizing implied tags
Evaluation Enhancements (Feb 10-14, 2026)
- Added
--min-whythreshold filtering (explicit, strong_implied, weak_implied) - Per-tag evidence tracking
- Compact eval output format
- Retrieval gap analysis scripts
- Multiple eval runs with different configurations
- Stored eval results in
data/eval_results/ - Added per-phrase retrieval cap flag:
--per-phrase-final-k - Added Stage 3 selection score/rank logging for post-hoc threshold analysis
- Added score/global-rank/phrase-rank grid analysis script
Code Quality Improvements
- Removed binary PNG files (migrated to Hugging Face XET storage)
- Fixed eval_categorized.py compatibility with eval_pipeline.py output
- Enhanced diagnostic and analysis scripts
- Ensured tagging checklist loads from repo root if present
- Forced UTF-8 stdout/stderr in eval pipeline to avoid Windows encoding crashes
Key Data Files
Artifacts (loaded lazily)
- FastText embeddings: Compressed format for semantic similarity
- TF-IDF vectors + SVD: For context-based tag similarity
- Alias mappings: Non-canonical → canonical tag projection
- Tag counts: Frequency information from corpus
- Tag implications: Hierarchical tag relationships (e.g., species → family)
- Tag groups (
data/tag_groups.json): Structured tag families for inference - Tag wiki definitions (
data/tag_wiki_defs.json): E621 wiki data for tags
Configuration
- tagging_checklist.txt: E621 tagging guidelines and categories
- word_rating_probabilities.csv: NSFW tag classification (threshold 0.95)
- fluffyrock_3m.csv: Large tag corpus dataset
- SamplePrompts.csv: Test prompts for development
- TagDocumentation.txt: E621 tag documentation
Evaluation
- data/eval_samples/: Test images with ground truth annotations
- data/eval_results/: Stored evaluation results (JSONL format)
- eval_analysis.txt: Latest per-category performance metrics
- data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_caption_evident.jsonl: Caption-evident GT subset (10 samples) for retrieval-ceiling audits
Development Practices (from AGENTS.md)
Environment
- OS: Windows / PowerShell (though currently running on Linux)
- Python: Always use
.venv\Scripts\python.exe(Windows) or venv equivalent - Working Directory: Always run from repo root
Code Discipline
- Keep diffs small: one issue or one focused step per patch
- Do not rewrite large files
- Do not move logic across modules unless contract requires it
- Preserve stage boundaries: rewriting (LLM) vs retrieval (candidate generation) vs selection (index-only)
Contracts
- Follow contracts in
docs/directory - If behavior conflicts with code, update code to match contract (not vice versa)
- Contracts define deterministic behavior, no feature flags for old/new paths
Testing & Evaluation
Scripts
scripts/eval_pipeline.py: Main evaluation harness
- Parallel processing support
- Multiple min_why thresholds
- Ground truth comparison with implications expansion
--per-phrase-final-kretrieval cap control- Logs
stage3_selected_scores,stage3_selected_ranks,stage3_selected_phrase_ranks
scripts/eval_categorized.py: Per-category evaluation
- Precision, recall, F1 per category
- Constraint validation (exactly_one, multi, etc.)
- Tier-based aggregation (CRITICAL, IMPORTANT, etc.)
scripts/analyze_compact_eval.py: Compact evaluation analysis
scripts/analyze_retrieval_gaps.py: Retrieval gap identification
scripts/analyze_threshold_grid.py: Post-hoc threshold grids (score/global rank/phrase rank)
scripts/analyze_caption_evident_audit.py: Caption-evident audit vs retrieval (optional implication expansion)
scripts/diagnose_structural_clothing.py: Clothing inference diagnostics
scripts/extract_wiki_data.py: E621 wiki data extraction
scripts/smoke_test.py: Quick pipeline validation
Sample Scripts
- scripts/rewrite_playground.py: Stage 1 testing
- scripts/stage3_debug.py: Stage 3 debugging
- scripts/test_categorized_suggestions.py: Category suggestion testing
- scripts/test_parser_only.py: Parser validation
Gradio Application (app.py)
Features
- Image upload support
- Natural language prompt input
- Multi-stage pipeline execution
- Verbose retrieval reporting (optional)
- NSFW tag filtering (configurable)
- Final prompt composition with deduplication
- Mascot branding (🐿️ squirrel)
Configuration
allow_nsfw_tags: NSFW content filteringverbose_retrieval: Debug output for Stage 2verbose_retrieval_limit: Max candidates to display (20)- Logging: Controlled via
PSQ_LOG_LEVELenvironment variable - Gradio analytics disabled to avoid threading errors
Key Technical Innovations
Alias-to-Canonical Projection: Handles non-canonical tag variants and projects them to e621 canonical forms
Head-Noun Expansion: Automatically extracts head nouns from multi-word phrases (e.g., "big shirt" → also search "shirt")
Dual Scoring: FastText semantic similarity + TF-IDF/SVD context similarity with weighted fusion
Must-Include Rules: Ensures exact phrase matches appear in results even if scores are lower
Context Imputation: Handles missing context scores with per-phrase 10th percentile fallback
Chunked Map-Union: Scalable LLM selection for large candidate sets without LLM reduce overhead
Tag Implications: Automatic hierarchical tag expansion for complete tagging
Categorized Suggestions: TF-IDF-ranked suggestions organized by e621 checklist categories
Next Steps / Known Work Areas
Based on recent commits, the project is actively working on:
Tag Categorization Pipeline: Providing structured, category-based tag suggestions to users
Evaluation Metrics: Comprehensive per-category and ranking metrics to measure system quality
Structural Inference: Improving clothing and body-state inference from selected tags
Ground Truth Quality: Expanding and improving evaluation datasets with proper implication handling
Production Deployment: Optimizing for Hugging Face Spaces deployment (binary file handling, logging, etc.)
Dependencies
Core packages (requirements.txt):
- gradio==4.44.1 / gradio-client==1.3.0 - Web interface
- hnswlib==0.8.0 - Fast nearest neighbor search
- numpy==1.25.1 - Numerical operations
- scikit-learn==1.4.1.post1 - ML utilities (TF-IDF, SVD)
- h5py==3.8.0 - HDF5 file handling
- compress-fasttext - Compressed FastText embeddings
- lark-parser - Grammar parsing (for prompt parsing)
- scipy==1.12.0 - Scientific computing
- gensim==4.3.2 - Word embeddings
- huggingface_hub<1.0 - Dataset/model hosting
- rapidfuzz>=3.0 - Fast string matching
Summary of Session Work
In this session (and recent sessions based on git history), we have:
- ✅ Built tag categorization infrastructure based on e621 checklist
- ✅ Created category parser with tier and constraint support
- ✅ Implemented TF-IDF-based categorized suggestions
- ✅ Added comprehensive evaluation metrics (per-category P/R/F1, ranking metrics)
- ✅ Fixed multi-select constraint handling for body_type, species, gender
- ✅ Improved structural inference system with group-based wiki data approach
- ✅ Enhanced evaluation pipeline with parallel processing and implication expansion
- ✅ Added diagnostic and analysis tools for debugging and quality assessment
- ✅ Cleaned up binary files and moved to proper XET storage on Hugging Face
The project is now at a sophisticated stage with a full three-stage pipeline, comprehensive evaluation infrastructure, and category-based tag organization aligned with e621's tagging best practices.