Spaces:

FoodDesert
/

Prompt_Squirrel_RAG

Running

App Files Files Community

Prompt_Squirrel_RAG / PROJECT_SUMMARY.md

Food Desert

Add eval audit tools, caption-evident set, and logging

73f56cf 4 months ago

preview code

Raw

History Blame

12.2 kB

Prompt Squirrel RAG - Project Summary

Overview

Prompt Squirrel is an advanced RAG (Retrieval-Augmented Generation) system designed to transform natural language prompts into structured e621-style tags for furry art image generation. The system uses a multi-stage pipeline combining FastText embeddings, TF-IDF similarity, SVD dimensionality reduction, and LLM-based selection to convert user prompts into canonical, well-formed tag sets.

Key Metadata

Platform: Hugging Face Gradio application
License: Apache 2.0
Python Version: 3.10.12
Gradio SDK: 5.43.1
Main Application: app.py

System Architecture

The system implements a three-stage pipeline with strict contracts between stages:

Stage 1: Query Rewriting (LLM-based)

Purpose: Transform natural language into "tag-shaped" comma-separated phrases
Input: User's natural language prompt
Output: List of normalized phrases (not canonical tags yet)
Module: psq_rag.llm.rewrite

Stage 2: Retrieval Grounding / Candidate Generation (Closed Vocabulary)

Purpose: Generate high-recall candidate pool from a closed vocabulary of canonical e621 tags
Key Operations:
- Phrase normalization and head-noun expansion
- FastText neighbor retrieval with alias-to-canonical projection
- TF-IDF/SVD context similarity scoring
- Score fusion (FastText + context weighted)
- Per-phrase top-K truncation with must-include rules
- Global candidate pool merging
Contract: docs/retrieval_contract.md
Module: psq_rag.retrieval.psq_retrieval
State Management: psq_rag.retrieval.state

Stage 3: Closed-Set Selection (LLM-based)

Purpose: Select final tags from Stage 2 candidates
Modes:
- Single-shot (one LLM call for all candidates)
- Chunked map-union (parallel processing with no LLM reduce)
Contract: docs/stage3_contract.md
Module: psq_rag.llm.select
Output: Selected tags with optional rationale codes (explicit, strong_implied, weak_implied, style_or_meta, other)

Stage 3s: Structural Tag Inference (Optional)

Purpose: Infer structural/implied tags from selected tags
Example: clothing → topless/bottomless based on what clothing is present
Implementation: Group-based system using wiki data
Module: psq_rag.llm.select.llm_infer_structural_tags

Recent Development Work

Tag Categorization Pipeline (Latest Work - Feb 13-14, 2026)

Implementation of a categorized tag suggestion system based on the e621 tagging checklist:

Category Parser (psq_rag/tagging/category_parser.py):
- Parses e621 tagging checklist into structured categories
- Supports category tiers: CRITICAL, IMPORTANT, NICE_TO_HAVE, META
- Constraint types: exactly_one, multi, multi_or_none
- Categories: body_type, species, gender, clothing, location, perspective, posture, etc.
Categorized Suggestions (psq_rag/tagging/categorized_suggestions.py):
- Generates TF-IDF similarity-ranked suggestions per category
- Identifies already-selected tags by category
- Organizes suggestions for guided user tagging
Evaluation Infrastructure:
- eval_pipeline.py: Core evaluation script with parallel processing
- eval_categorized.py: Per-category metrics (precision, recall, F1)
- Support for ranking metrics (MRR, Precision@K, nDCG)
- Ground truth annotation expansion via tag implications

Structural Inference System (Feb 12-13, 2026)

Redesigned as group-based system using wiki definitions
Extracted tag groups and wiki data from e621
Added diagnostic scripts for clothing inference
Improved topless/bottomless definitions to prevent confusion
Fixed Windows encoding issues

Tag Implication System (Feb 10-11, 2026)

Integrated tag_implications-2023-07-20.csv
Automatic tag expansion (e.g., fox → canine → canid → mammal)
Expanded ground truth annotations for evaluation
Leaf-only metrics to avoid penalizing implied tags

Evaluation Enhancements (Feb 10-14, 2026)

Added --min-why threshold filtering (explicit, strong_implied, weak_implied)
Per-tag evidence tracking
Compact eval output format
Retrieval gap analysis scripts
Multiple eval runs with different configurations
Stored eval results in data/eval_results/
Added per-phrase retrieval cap flag: --per-phrase-final-k
Added Stage 3 selection score/rank logging for post-hoc threshold analysis
Added score/global-rank/phrase-rank grid analysis script

Code Quality Improvements

Removed binary PNG files (migrated to Hugging Face XET storage)
Fixed eval_categorized.py compatibility with eval_pipeline.py output
Enhanced diagnostic and analysis scripts
Ensured tagging checklist loads from repo root if present
Forced UTF-8 stdout/stderr in eval pipeline to avoid Windows encoding crashes

Key Data Files

Artifacts (loaded lazily)

FastText embeddings: Compressed format for semantic similarity
TF-IDF vectors + SVD: For context-based tag similarity
Alias mappings: Non-canonical → canonical tag projection
Tag counts: Frequency information from corpus
Tag implications: Hierarchical tag relationships (e.g., species → family)
Tag groups (data/tag_groups.json): Structured tag families for inference
Tag wiki definitions (data/tag_wiki_defs.json): E621 wiki data for tags

Configuration

tagging_checklist.txt: E621 tagging guidelines and categories
word_rating_probabilities.csv: NSFW tag classification (threshold 0.95)
fluffyrock_3m.csv: Large tag corpus dataset
SamplePrompts.csv: Test prompts for development
TagDocumentation.txt: E621 tag documentation

Evaluation

data/eval_samples/: Test images with ground truth annotations
data/eval_results/: Stored evaluation results (JSONL format)
eval_analysis.txt: Latest per-category performance metrics
data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_caption_evident.jsonl: Caption-evident GT subset (10 samples) for retrieval-ceiling audits

Development Practices (from AGENTS.md)

Environment

OS: Windows / PowerShell (though currently running on Linux)
Python: Always use .venv\Scripts\python.exe (Windows) or venv equivalent
Working Directory: Always run from repo root

Code Discipline

Keep diffs small: one issue or one focused step per patch
Do not rewrite large files
Do not move logic across modules unless contract requires it
Preserve stage boundaries: rewriting (LLM) vs retrieval (candidate generation) vs selection (index-only)

Contracts

Follow contracts in docs/ directory
If behavior conflicts with code, update code to match contract (not vice versa)
Contracts define deterministic behavior, no feature flags for old/new paths

Testing & Evaluation

Scripts

scripts/eval_pipeline.py: Main evaluation harness
- Parallel processing support
- Multiple min_why thresholds
- Ground truth comparison with implications expansion
- --per-phrase-final-k retrieval cap control
- Logs stage3_selected_scores, stage3_selected_ranks, stage3_selected_phrase_ranks
scripts/eval_categorized.py: Per-category evaluation
- Precision, recall, F1 per category
- Constraint validation (exactly_one, multi, etc.)
- Tier-based aggregation (CRITICAL, IMPORTANT, etc.)
scripts/analyze_compact_eval.py: Compact evaluation analysis
scripts/analyze_retrieval_gaps.py: Retrieval gap identification
scripts/analyze_threshold_grid.py: Post-hoc threshold grids (score/global rank/phrase rank)
scripts/analyze_caption_evident_audit.py: Caption-evident audit vs retrieval (optional implication expansion)
scripts/diagnose_structural_clothing.py: Clothing inference diagnostics
scripts/extract_wiki_data.py: E621 wiki data extraction
scripts/smoke_test.py: Quick pipeline validation

Sample Scripts

scripts/rewrite_playground.py: Stage 1 testing
scripts/stage3_debug.py: Stage 3 debugging
scripts/test_categorized_suggestions.py: Category suggestion testing
scripts/test_parser_only.py: Parser validation

Gradio Application (app.py)

Features

Image upload support
Natural language prompt input
Multi-stage pipeline execution
Verbose retrieval reporting (optional)
NSFW tag filtering (configurable)
Final prompt composition with deduplication
Mascot branding (🐿️ squirrel)

Configuration

allow_nsfw_tags: NSFW content filtering
verbose_retrieval: Debug output for Stage 2
verbose_retrieval_limit: Max candidates to display (20)
Logging: Controlled via PSQ_LOG_LEVEL environment variable
Gradio analytics disabled to avoid threading errors

Key Technical Innovations

Alias-to-Canonical Projection: Handles non-canonical tag variants and projects them to e621 canonical forms
Head-Noun Expansion: Automatically extracts head nouns from multi-word phrases (e.g., "big shirt" → also search "shirt")
Dual Scoring: FastText semantic similarity + TF-IDF/SVD context similarity with weighted fusion
Must-Include Rules: Ensures exact phrase matches appear in results even if scores are lower
Context Imputation: Handles missing context scores with per-phrase 10th percentile fallback
Chunked Map-Union: Scalable LLM selection for large candidate sets without LLM reduce overhead
Tag Implications: Automatic hierarchical tag expansion for complete tagging
Categorized Suggestions: TF-IDF-ranked suggestions organized by e621 checklist categories

Next Steps / Known Work Areas

Based on recent commits, the project is actively working on:

Tag Categorization Pipeline: Providing structured, category-based tag suggestions to users
Evaluation Metrics: Comprehensive per-category and ranking metrics to measure system quality
Structural Inference: Improving clothing and body-state inference from selected tags
Ground Truth Quality: Expanding and improving evaluation datasets with proper implication handling
Production Deployment: Optimizing for Hugging Face Spaces deployment (binary file handling, logging, etc.)

Dependencies

Core packages (requirements.txt):

gradio==4.44.1 / gradio-client==1.3.0 - Web interface
hnswlib==0.8.0 - Fast nearest neighbor search
numpy==1.25.1 - Numerical operations
scikit-learn==1.4.1.post1 - ML utilities (TF-IDF, SVD)
h5py==3.8.0 - HDF5 file handling
compress-fasttext - Compressed FastText embeddings
lark-parser - Grammar parsing (for prompt parsing)
scipy==1.12.0 - Scientific computing
gensim==4.3.2 - Word embeddings
huggingface_hub<1.0 - Dataset/model hosting
rapidfuzz>=3.0 - Fast string matching

Summary of Session Work

In this session (and recent sessions based on git history), we have:

✅ Built tag categorization infrastructure based on e621 checklist
✅ Created category parser with tier and constraint support
✅ Implemented TF-IDF-based categorized suggestions
✅ Added comprehensive evaluation metrics (per-category P/R/F1, ranking metrics)
✅ Fixed multi-select constraint handling for body_type, species, gender
✅ Improved structural inference system with group-based wiki data approach
✅ Enhanced evaluation pipeline with parallel processing and implication expansion
✅ Added diagnostic and analysis tools for debugging and quality assessment
✅ Cleaned up binary files and moved to proper XET storage on Hugging Face

The project is now at a sophisticated stage with a full three-stage pipeline, comprehensive evaluation infrastructure, and category-based tag organization aligned with e621's tagging best practices.