Prompt_Squirrel_RAG / PROJECT_SUMMARY.md
Food Desert
Add eval audit tools, caption-evident set, and logging
73f56cf
|
Raw
History Blame
12.2 kB

Prompt Squirrel RAG - Project Summary

Overview

Prompt Squirrel is an advanced RAG (Retrieval-Augmented Generation) system designed to transform natural language prompts into structured e621-style tags for furry art image generation. The system uses a multi-stage pipeline combining FastText embeddings, TF-IDF similarity, SVD dimensionality reduction, and LLM-based selection to convert user prompts into canonical, well-formed tag sets.

Key Metadata

  • Platform: Hugging Face Gradio application
  • License: Apache 2.0
  • Python Version: 3.10.12
  • Gradio SDK: 5.43.1
  • Main Application: app.py

System Architecture

The system implements a three-stage pipeline with strict contracts between stages:

Stage 1: Query Rewriting (LLM-based)

  • Purpose: Transform natural language into "tag-shaped" comma-separated phrases
  • Input: User's natural language prompt
  • Output: List of normalized phrases (not canonical tags yet)
  • Module: psq_rag.llm.rewrite

Stage 2: Retrieval Grounding / Candidate Generation (Closed Vocabulary)

  • Purpose: Generate high-recall candidate pool from a closed vocabulary of canonical e621 tags
  • Key Operations:
    • Phrase normalization and head-noun expansion
    • FastText neighbor retrieval with alias-to-canonical projection
    • TF-IDF/SVD context similarity scoring
    • Score fusion (FastText + context weighted)
    • Per-phrase top-K truncation with must-include rules
    • Global candidate pool merging
  • Contract: docs/retrieval_contract.md
  • Module: psq_rag.retrieval.psq_retrieval
  • State Management: psq_rag.retrieval.state

Stage 3: Closed-Set Selection (LLM-based)

  • Purpose: Select final tags from Stage 2 candidates
  • Modes:
    • Single-shot (one LLM call for all candidates)
    • Chunked map-union (parallel processing with no LLM reduce)
  • Contract: docs/stage3_contract.md
  • Module: psq_rag.llm.select
  • Output: Selected tags with optional rationale codes (explicit, strong_implied, weak_implied, style_or_meta, other)

Stage 3s: Structural Tag Inference (Optional)

  • Purpose: Infer structural/implied tags from selected tags
  • Example: clothing → topless/bottomless based on what clothing is present
  • Implementation: Group-based system using wiki data
  • Module: psq_rag.llm.select.llm_infer_structural_tags

Recent Development Work

Tag Categorization Pipeline (Latest Work - Feb 13-14, 2026)

Implementation of a categorized tag suggestion system based on the e621 tagging checklist:

  1. Category Parser (psq_rag/tagging/category_parser.py):

    • Parses e621 tagging checklist into structured categories
    • Supports category tiers: CRITICAL, IMPORTANT, NICE_TO_HAVE, META
    • Constraint types: exactly_one, multi, multi_or_none
    • Categories: body_type, species, gender, clothing, location, perspective, posture, etc.
  2. Categorized Suggestions (psq_rag/tagging/categorized_suggestions.py):

    • Generates TF-IDF similarity-ranked suggestions per category
    • Identifies already-selected tags by category
    • Organizes suggestions for guided user tagging
  3. Evaluation Infrastructure:

    • eval_pipeline.py: Core evaluation script with parallel processing
    • eval_categorized.py: Per-category metrics (precision, recall, F1)
    • Support for ranking metrics (MRR, Precision@K, nDCG)
    • Ground truth annotation expansion via tag implications

Structural Inference System (Feb 12-13, 2026)

  • Redesigned as group-based system using wiki definitions
  • Extracted tag groups and wiki data from e621
  • Added diagnostic scripts for clothing inference
  • Improved topless/bottomless definitions to prevent confusion
  • Fixed Windows encoding issues

Tag Implication System (Feb 10-11, 2026)

  • Integrated tag_implications-2023-07-20.csv
  • Automatic tag expansion (e.g., fox → canine → canid → mammal)
  • Expanded ground truth annotations for evaluation
  • Leaf-only metrics to avoid penalizing implied tags

Evaluation Enhancements (Feb 10-14, 2026)

  • Added --min-why threshold filtering (explicit, strong_implied, weak_implied)
  • Per-tag evidence tracking
  • Compact eval output format
  • Retrieval gap analysis scripts
  • Multiple eval runs with different configurations
  • Stored eval results in data/eval_results/
  • Added per-phrase retrieval cap flag: --per-phrase-final-k
  • Added Stage 3 selection score/rank logging for post-hoc threshold analysis
  • Added score/global-rank/phrase-rank grid analysis script

Code Quality Improvements

  • Removed binary PNG files (migrated to Hugging Face XET storage)
  • Fixed eval_categorized.py compatibility with eval_pipeline.py output
  • Enhanced diagnostic and analysis scripts
  • Ensured tagging checklist loads from repo root if present
  • Forced UTF-8 stdout/stderr in eval pipeline to avoid Windows encoding crashes

Key Data Files

Artifacts (loaded lazily)

  • FastText embeddings: Compressed format for semantic similarity
  • TF-IDF vectors + SVD: For context-based tag similarity
  • Alias mappings: Non-canonical → canonical tag projection
  • Tag counts: Frequency information from corpus
  • Tag implications: Hierarchical tag relationships (e.g., species → family)
  • Tag groups (data/tag_groups.json): Structured tag families for inference
  • Tag wiki definitions (data/tag_wiki_defs.json): E621 wiki data for tags

Configuration

  • tagging_checklist.txt: E621 tagging guidelines and categories
  • word_rating_probabilities.csv: NSFW tag classification (threshold 0.95)
  • fluffyrock_3m.csv: Large tag corpus dataset
  • SamplePrompts.csv: Test prompts for development
  • TagDocumentation.txt: E621 tag documentation

Evaluation

  • data/eval_samples/: Test images with ground truth annotations
  • data/eval_results/: Stored evaluation results (JSONL format)
  • eval_analysis.txt: Latest per-category performance metrics
  • data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_caption_evident.jsonl: Caption-evident GT subset (10 samples) for retrieval-ceiling audits

Development Practices (from AGENTS.md)

Environment

  • OS: Windows / PowerShell (though currently running on Linux)
  • Python: Always use .venv\Scripts\python.exe (Windows) or venv equivalent
  • Working Directory: Always run from repo root

Code Discipline

  • Keep diffs small: one issue or one focused step per patch
  • Do not rewrite large files
  • Do not move logic across modules unless contract requires it
  • Preserve stage boundaries: rewriting (LLM) vs retrieval (candidate generation) vs selection (index-only)

Contracts

  • Follow contracts in docs/ directory
  • If behavior conflicts with code, update code to match contract (not vice versa)
  • Contracts define deterministic behavior, no feature flags for old/new paths

Testing & Evaluation

Scripts

  • scripts/eval_pipeline.py: Main evaluation harness

    • Parallel processing support
    • Multiple min_why thresholds
    • Ground truth comparison with implications expansion
    • --per-phrase-final-k retrieval cap control
    • Logs stage3_selected_scores, stage3_selected_ranks, stage3_selected_phrase_ranks
  • scripts/eval_categorized.py: Per-category evaluation

    • Precision, recall, F1 per category
    • Constraint validation (exactly_one, multi, etc.)
    • Tier-based aggregation (CRITICAL, IMPORTANT, etc.)
  • scripts/analyze_compact_eval.py: Compact evaluation analysis

  • scripts/analyze_retrieval_gaps.py: Retrieval gap identification

  • scripts/analyze_threshold_grid.py: Post-hoc threshold grids (score/global rank/phrase rank)

  • scripts/analyze_caption_evident_audit.py: Caption-evident audit vs retrieval (optional implication expansion)

  • scripts/diagnose_structural_clothing.py: Clothing inference diagnostics

  • scripts/extract_wiki_data.py: E621 wiki data extraction

  • scripts/smoke_test.py: Quick pipeline validation

Sample Scripts

  • scripts/rewrite_playground.py: Stage 1 testing
  • scripts/stage3_debug.py: Stage 3 debugging
  • scripts/test_categorized_suggestions.py: Category suggestion testing
  • scripts/test_parser_only.py: Parser validation

Gradio Application (app.py)

Features

  • Image upload support
  • Natural language prompt input
  • Multi-stage pipeline execution
  • Verbose retrieval reporting (optional)
  • NSFW tag filtering (configurable)
  • Final prompt composition with deduplication
  • Mascot branding (🐿️ squirrel)

Configuration

  • allow_nsfw_tags: NSFW content filtering
  • verbose_retrieval: Debug output for Stage 2
  • verbose_retrieval_limit: Max candidates to display (20)
  • Logging: Controlled via PSQ_LOG_LEVEL environment variable
  • Gradio analytics disabled to avoid threading errors

Key Technical Innovations

  1. Alias-to-Canonical Projection: Handles non-canonical tag variants and projects them to e621 canonical forms

  2. Head-Noun Expansion: Automatically extracts head nouns from multi-word phrases (e.g., "big shirt" → also search "shirt")

  3. Dual Scoring: FastText semantic similarity + TF-IDF/SVD context similarity with weighted fusion

  4. Must-Include Rules: Ensures exact phrase matches appear in results even if scores are lower

  5. Context Imputation: Handles missing context scores with per-phrase 10th percentile fallback

  6. Chunked Map-Union: Scalable LLM selection for large candidate sets without LLM reduce overhead

  7. Tag Implications: Automatic hierarchical tag expansion for complete tagging

  8. Categorized Suggestions: TF-IDF-ranked suggestions organized by e621 checklist categories


Next Steps / Known Work Areas

Based on recent commits, the project is actively working on:

  1. Tag Categorization Pipeline: Providing structured, category-based tag suggestions to users

  2. Evaluation Metrics: Comprehensive per-category and ranking metrics to measure system quality

  3. Structural Inference: Improving clothing and body-state inference from selected tags

  4. Ground Truth Quality: Expanding and improving evaluation datasets with proper implication handling

  5. Production Deployment: Optimizing for Hugging Face Spaces deployment (binary file handling, logging, etc.)


Dependencies

Core packages (requirements.txt):

  • gradio==4.44.1 / gradio-client==1.3.0 - Web interface
  • hnswlib==0.8.0 - Fast nearest neighbor search
  • numpy==1.25.1 - Numerical operations
  • scikit-learn==1.4.1.post1 - ML utilities (TF-IDF, SVD)
  • h5py==3.8.0 - HDF5 file handling
  • compress-fasttext - Compressed FastText embeddings
  • lark-parser - Grammar parsing (for prompt parsing)
  • scipy==1.12.0 - Scientific computing
  • gensim==4.3.2 - Word embeddings
  • huggingface_hub<1.0 - Dataset/model hosting
  • rapidfuzz>=3.0 - Fast string matching

Summary of Session Work

In this session (and recent sessions based on git history), we have:

  1. Built tag categorization infrastructure based on e621 checklist
  2. Created category parser with tier and constraint support
  3. Implemented TF-IDF-based categorized suggestions
  4. Added comprehensive evaluation metrics (per-category P/R/F1, ranking metrics)
  5. Fixed multi-select constraint handling for body_type, species, gender
  6. Improved structural inference system with group-based wiki data approach
  7. Enhanced evaluation pipeline with parallel processing and implication expansion
  8. Added diagnostic and analysis tools for debugging and quality assessment
  9. Cleaned up binary files and moved to proper XET storage on Hugging Face

The project is now at a sophisticated stage with a full three-stage pipeline, comprehensive evaluation infrastructure, and category-based tag organization aligned with e621's tagging best practices.