Clean assistant workflow references

Browse files

Files changed (5) hide show

docs/superpowers/plans/2026-03-23-gradio-spam-classifier.md +0 -1105
docs/superpowers/plans/2026-03-23-mlx-spam-classifier.md +0 -848
docs/superpowers/plans/2026-04-14-spam-xai-v2-simplify.md +0 -383
docs/superpowers/specs/2026-03-23-gradio-spam-classifier-design.md +0 -298
docs/superpowers/specs/2026-03-23-mlx-spam-classifier-design.md +0 -311

docs/superpowers/plans/2026-03-23-gradio-spam-classifier.md DELETED Viewed

@@ -1,1105 +0,0 @@
-# Gradio Spam Classifier Implementation Plan
-> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
-**Goal:** Build a fresh, beginner-level Gradio spam classifier app with LIME, SHAP, and plain-English explanations — replacing the old Streamlit project.
-**Architecture:** New standalone project (`spam-classifier-gradio/`) that symlinks data from the old project. Three Python files: `utils.py` (shared preprocessing), `train.py` (model training + comparison), `app.py` (Gradio UI). Models saved to `models/` directory.
-**Tech Stack:** Python, scikit-learn, Gradio, LIME, SHAP, NLTK, pandas, numpy, matplotlib, joblib
-**Spec:** `docs/superpowers/specs/2026-03-23-gradio-spam-classifier-design.md`
----
-### Task 1: Project Scaffolding
-**Files:**
-- Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/requirements.txt`
-- Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/CLAUDE.md`
-- Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/CHANGELOG.md`
-- Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/models/.gitkeep`
-- [ ] **Step 1: Create the project directory and models folder**
-```bash
-mkdir -p "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/models"
-```
-- [ ] **Step 2: Create symlink to data from old project**
-```bash
-ln -s "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-xai-project/data" "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/data"
-```
-Verify symlink works:
-```bash
-ls "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/data/spam_Emails_data.csv"
-```
-Expected: file should be listed (not "No such file")
-- [ ] **Step 3: Create requirements.txt**
-```
-numpy>=1.24.0
-pandas>=2.0.0
-matplotlib>=3.7.0
-scikit-learn>=1.3.0
-scipy>=1.11.0
-nltk>=3.8.0
-lime>=0.2.0
-shap>=0.44.0
-gradio>=4.0.0
-joblib>=1.3.0
-tqdm>=4.65.0
-```
-- [ ] **Step 4: Create CLAUDE.md**
-```markdown
-# CLAUDE.md
-## Project Context
-Spam email classifier with Gradio UI for ENGT 375 (Applied Machine Learning, Spring 2026, ODU).
-Uses scikit-learn (Random Forest, Logistic Regression, SVM ensemble) with LIME and SHAP for explainability.
-## Code Style
-- Beginner-level Python: explicit for-loops, clear variable names, comments explaining why
-- No advanced patterns (decorators, metaclasses, complex comprehensions)
-- Reference course concepts in comments where applicable
-## How to Run
-1. Install deps: `pip install -r requirements.txt`
-2. Train models: `python train.py`
-3. Launch app: `python app.py`
-## Key Files
-- `utils.py` — Shared text preprocessing and feature engineering (24 metadata features)
-- `train.py` — Data loading, model comparison (RF/LR/SVM), VotingClassifier ensemble, saves to models/
-- `app.py` — Gradio UI with Result, LIME, and SHAP tabs
-## Data
-- `data/` is a symlink to `../spam-xai-project/data/`
-- Sources: Kaggle spam CSV + GitHub email-dataset
-```
-- [ ] **Step 5: Create initial CHANGELOG.md**
-```markdown
-# Changelog
-All notable changes to this project will be documented in this file.
-This serves as a reference for writing the course paper's methodology section.
-## v0.1.0 — 2026-03-23
-### Initial Project Setup
-- Created fresh Gradio-based spam classifier project
-- Symlinked data from old spam-xai-project
-- Set up requirements.txt with core dependencies
-```
-- [ ] **Step 6: Create .gitignore**
-```
-__pycache__/
-*.pyc
-.pytest_cache/
-models/*.joblib
-models/*.json
-data/
-*.egg-info/
-.DS_Store
-```
-- [ ] **Step 7: Initialize git repo**
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio"
-git init
-git add requirements.txt CLAUDE.md CHANGELOG.md models/.gitkeep .gitignore
-git commit -m "chore: scaffold project with requirements, CLAUDE.md, CHANGELOG"
-```
-- [ ] **Step 8: Install dependencies**
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio"
-pip install -r requirements.txt
-```
-Verify: `python -c "import gradio; print(gradio.__version__)"`
-Expected: version 4.x or higher
-- [ ] **Step 9: Download NLTK stopwords (if not already present)**
-```bash
-python -c "import nltk; nltk.download('stopwords', quiet=True); print('OK')"
-```
-Expected: `OK`
----
-### Task 2: Utilities Module (`utils.py`)
-**Files:**
-- Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/utils.py`
-- Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/test_utils.py`
-- [ ] **Step 1: Write test for `preprocess_text`**
-Create `test_utils.py`:
-```python
-# Tests for the shared utility functions
-# Run: python -m pytest test_utils.py -v
-def test_preprocess_text_strips_html():
-    from utils import preprocess_text
-    result = preprocess_text('<b>Hello</b> world')
-    assert '<' not in result
-    assert '>' not in result
-def test_preprocess_text_removes_urls():
-    from utils import preprocess_text
-    result = preprocess_text('Visit http://example.com for details')
-    assert 'http' not in result
-    assert 'example' not in result
-def test_preprocess_text_removes_emails():
-    from utils import preprocess_text
-    result = preprocess_text('Contact user@example.com for info')
-    assert '@' not in result
-def test_preprocess_text_lowercases():
-    from utils import preprocess_text
-    result = preprocess_text('HELLO WORLD')
-    # After stemming, should be lowercase
-    assert result == result.lower()
-def test_preprocess_text_removes_stopwords():
-    from utils import preprocess_text
-    result = preprocess_text('this is a test of the system')
-    assert 'this' not in result.split()
-    assert 'the' not in result.split()
-def test_preprocess_text_empty_input():
-    from utils import preprocess_text
-    result = preprocess_text('')
-    assert result == ''
-```
-- [ ] **Step 2: Run tests to verify they fail**
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio"
-python -m pytest test_utils.py -v
-```
-Expected: FAIL — `ModuleNotFoundError: No module named 'utils'`
-- [ ] **Step 3: Write `utils.py` — phrase lists, constants, and `preprocess_text`**
-Copy the preprocessing logic from `spam-xai-project/utils_student.py` (lines 1-76), plus phrase lists (lines 20-58). Remove Ollama/LLM references. Add the `FEATURE_DESCRIPTIONS` dict and `META_FEATURE_NAMES` list.
-The file should contain:
-- Imports: `re`, `numpy`, `nltk.corpus.stopwords`, `nltk.stem.PorterStemmer`
-- Phrase lists: `spam_context_phrases`, `ham_context_phrases`, `registration_phrases`, `url_shorteners`, `legitimate_platforms`
-- `META_FEATURE_NAMES` — list of 24 strings
-- `FEATURE_DESCRIPTIONS` — dict mapping each metadata feature name to a human-readable string
-- `preprocess_text(text)` — same logic as `utils_student.py:65-76`
-- `compute_metadata_features(texts)` — placeholder (next step)
-- [ ] **Step 4: Run preprocessing tests to verify they pass**
-```bash
-python -m pytest test_utils.py -v
-```
-Expected: All preprocessing tests PASS
-- [ ] **Step 5: Write tests for `compute_metadata_features`**
-Add to `test_utils.py`:
-```python
-import numpy as np
-def test_compute_metadata_features_shape():
-    from utils import compute_metadata_features
-    result = compute_metadata_features(['Hello world!', 'Buy now!!!'])
-    assert isinstance(result, np.ndarray)
-    assert result.shape == (2, 24)
-def test_compute_metadata_features_exclamation_density():
-    from utils import compute_metadata_features
-    # "Buy now!!!" has 3 exclamation marks, 1 sentence
-    result = compute_metadata_features(['Buy now!!!'])
-    exclamation_density = result[0][0]
-    assert exclamation_density == 3.0
-def test_compute_metadata_features_dollar_count():
-    from utils import compute_metadata_features
-    result = compute_metadata_features(['Win $100 or $200'])
-    dollar_count = result[0][1]
-    assert dollar_count == 2
-def test_compute_metadata_features_spam_phrases():
-    from utils import compute_metadata_features
-    result = compute_metadata_features(['Act now! Buy now!'])
-    spam_phrase_count = result[0][3]
-    assert spam_phrase_count >= 2  # 'act now' and 'buy now'
-```
-- [ ] **Step 6: Run tests to verify new tests fail**
-```bash
-python -m pytest test_utils.py -v
-```
-Expected: The new `test_compute_metadata_features_*` tests FAIL (function is placeholder)
-- [ ] **Step 7: Implement `compute_metadata_features` in `utils.py`**
-Copy the full 24-feature computation logic from `spam-xai-project/utils_student.py` lines 82-236. This is the exact same code — explicit `for` loops, same feature order, same comments.
-- [ ] **Step 8: Run all tests to verify they pass**
-```bash
-python -m pytest test_utils.py -v
-```
-Expected: All tests PASS (preprocessing + metadata)
-- [ ] **Step 9: Commit**
-```bash
-git add utils.py test_utils.py
-git commit -m "feat: add utils.py with preprocessing and 24 metadata features"
-```
----
-### Task 3: Training Script (`train.py`)
-**Files:**
-- Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/train.py`
-This task builds the full training pipeline. No separate test file — the training script itself prints classification reports and saves a `training_report.json` as verification.
-- [ ] **Step 1: Write `train.py` — complete file**
-Write the full `train.py` file. Here is the complete code:
-```python
-# Train the spam classifier — compare models and build ensemble
-# ENGT 375 Project - Spring 2026 - ODU
-# Run: python train.py
-import json
-import warnings
-import numpy as np
-import pandas as pd
-from pathlib import Path
-from sklearn.model_selection import train_test_split
-from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.ensemble import RandomForestClassifier, VotingClassifier
-from sklearn.linear_model import LogisticRegression
-from sklearn.svm import SVC
-from sklearn.preprocessing import MinMaxScaler
-from sklearn.metrics import (classification_report, f1_score,
-                             accuracy_score, precision_score, recall_score,
-                             precision_recall_curve)
-from scipy.sparse import hstack, csr_matrix
-from tqdm import tqdm
-import joblib
-from utils import preprocess_text, compute_metadata_features, META_FEATURE_NAMES
-warnings.filterwarnings('ignore', category=FutureWarning)
-warnings.filterwarnings('ignore', category=DeprecationWarning)
-# Paths
-project_dir = Path(__file__).parent
-data_dir = project_dir / 'data'
-models_dir = project_dir / 'models'
-random_state = 42
-kaggle_csv = data_dir / 'spam_Emails_data.csv'
-github_dataset_dir = data_dir / 'email-dataset-main' / 'email-dataset-main' / 'dataset'
-KAGGLE_CAP = 100000  # stratified sample to keep training fast
-# ---- Data Loading ----
-print('Starting model training...')
-# Load Kaggle 190K spam dataset (stratified sample)
-df = pd.DataFrame(columns=['text', 'label'])
-if kaggle_csv.exists():
-    print('Loading Kaggle spam dataset...')
-    df_kaggle = pd.read_csv(kaggle_csv)
-    # Normalize column names — Kaggle CSV has 'label' and 'text' columns
-    # Labels are 'Ham'/'Spam' — normalize to lowercase
-    df_kaggle = df_kaggle.rename(columns={'label': 'label_raw', 'text': 'text'})
-    df_kaggle['label'] = df_kaggle['label_raw'].str.strip().str.lower().map({'ham': 'ham', 'spam': 'spam'})
-    df_kaggle = df_kaggle[['text', 'label']].dropna(subset=['label', 'text'])
-    print('  Kaggle total: %d emails' % len(df_kaggle))
-    # Stratified sample to cap size (same as old project)
-    if len(df_kaggle) > KAGGLE_CAP:
-        df_kaggle = df_kaggle.groupby('label', group_keys=False).apply(
-            lambda x: x.sample(n=int(KAGGLE_CAP * len(x) / len(df_kaggle)),
-                               random_state=random_state)
-        )
-        print('  Kaggle after stratified cap: %d emails' % len(df_kaggle))
-    df = df_kaggle.reset_index(drop=True)
-else:
-    print('WARNING: Kaggle CSV not found at %s' % str(kaggle_csv))
-# Load GitHub email-dataset (individual text files)
-# dataset/1/ = ham, dataset/2/ = spam
-if github_dataset_dir.exists():
-    print('Loading GitHub email-dataset...')
-    github_rows = []
-    for subdir, lbl in [('1', 'ham'), ('2', 'spam')]:
-        folder = github_dataset_dir / subdir
-        if folder.exists():
-            for fpath in folder.iterdir():
-                if fpath.is_file():
-                    try:
-                        content = fpath.read_text(encoding='utf-8', errors='replace')
-                        if content.strip():
-                            github_rows.append({'text': content, 'label': lbl})
-                    except Exception:
-                        pass
-    if github_rows:
-        df_github = pd.DataFrame(github_rows)
-        print('  GitHub dataset: %d emails (%d ham, %d spam)' % (
-            len(df_github),
-            len(df_github[df_github['label'] == 'ham']),
-            len(df_github[df_github['label'] == 'spam'])
-        ))
-        df = pd.concat([df, df_github], ignore_index=True)
-else:
-    print('WARNING: GitHub email-dataset not found at %s' % str(github_dataset_dir))
-if len(df) == 0:
-    raise RuntimeError('No training data found! Check that data/ symlink is valid.')
-# Deduplicate
-before_dedup = len(df)
-df = df.drop_duplicates(subset=['text']).reset_index(drop=True)
-print('Combined dataset: %d emails (removed %d duplicates)' % (len(df), before_dedup - len(df)))
-print('  Ham: %d, Spam: %d' % (len(df[df['label'] == 'ham']), len(df[df['label'] == 'spam'])))
-# ---- Preprocessing & Feature Engineering ----
-print('Preprocessing text...')
-df['clean_text'] = df['text'].apply(preprocess_text)
-# TF-IDF: same parameters as the old project for comparable results
-print('Building TF-IDF features (max 3000, ngrams 1-3)...')
-tfidf = TfidfVectorizer(
-    max_features=3000,
-    ngram_range=(1, 3),
-    min_df=2,
-    max_df=0.90,
-    sublinear_tf=True,
-)
-X_tfidf = tfidf.fit_transform(df['clean_text'])
-# 24 metadata features (exclamation density, dollar signs, caps ratio, etc.)
-print('Computing 24 metadata features...')
-X_meta = compute_metadata_features(df['text'].values)
-# Scale metadata to 0-1 range so they match TF-IDF scale
-# Without this, features like email_length (could be 1000+) would dominate
-meta_scaler = MinMaxScaler()
-X_meta_scaled = meta_scaler.fit_transform(X_meta)
-# Combine TF-IDF + metadata into one feature matrix
-X_combined = hstack([X_tfidf, csr_matrix(X_meta_scaled)])
-feature_names = list(tfidf.get_feature_names_out()) + META_FEATURE_NAMES
-# Encode labels: 1 = spam, 0 = ham
-y = (df['label'] == 'spam').astype(int)
-print('Total features: %d (%d TF-IDF + %d metadata)' % (
-    len(feature_names), X_tfidf.shape[1], len(META_FEATURE_NAMES)))
-```
-The rest of `train.py` (model comparison, ensemble, saving) is added in Steps 2-4 below — but they are all part of the same file, appended after this code.
-- [ ] **Step 2: Add model comparison section**
-After building `X_combined` and `y`, add:
-- 70/30 stratified split
-- Train RF, LR, SVM individually
-- Print classification reports for each
-- Collect F1 scores
-```python
-# ---- Train/Test Split ----
-X_train, X_test, y_train, y_test = train_test_split(
-    X_combined, y, test_size=0.3, random_state=random_state, stratify=y
-)
-print('Train: %d, Test: %d' % (X_train.shape[0], X_test.shape[0]))
-# Helper to collect metrics for the training report
-def get_metrics(y_true, y_pred):
-    """Compute accuracy, precision, recall, F1 for the spam (1) class."""
-    return {
-        'accuracy': round(accuracy_score(y_true, y_pred), 4),
-        'precision': round(precision_score(y_true, y_pred), 4),
-        'recall': round(recall_score(y_true, y_pred), 4),
-        'f1': round(f1_score(y_true, y_pred), 4),
-    }
-# ---- Model Comparison ----
-# Train three classifiers individually and compare
-print('\n--- Random Forest ---')
-rf = RandomForestClassifier(n_estimators=200, n_jobs=-1, class_weight='balanced', random_state=random_state)
-rf.fit(X_train, y_train)
-rf_pred = rf.predict(X_test)
-rf_metrics = get_metrics(y_test, rf_pred)
-print(classification_report(y_test, rf_pred, target_names=['Ham', 'Spam']))
-print('\n--- Logistic Regression ---')
-lr = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=random_state)
-lr.fit(X_train, y_train)
-lr_pred = lr.predict(X_test)
-lr_metrics = get_metrics(y_test, lr_pred)
-print(classification_report(y_test, lr_pred, target_names=['Ham', 'Spam']))
-print('\n--- SVM (Linear) ---')
-svm = SVC(kernel='linear', class_weight='balanced', probability=True, random_state=random_state)
-svm.fit(X_train, y_train)
-svm_pred = svm.predict(X_test)
-svm_metrics = get_metrics(y_test, svm_pred)
-print(classification_report(y_test, svm_pred, target_names=['Ham', 'Spam']))
-```
-- [ ] **Step 3: Add VotingClassifier ensemble and threshold optimization**
-```python
-# Build VotingClassifier with all three models
-# VotingClassifier retrains the models internally, so we pass fresh estimators
-print('\n--- Voting Ensemble ---')
-voting = VotingClassifier(
-    estimators=[
-        ('rf', RandomForestClassifier(n_estimators=200, n_jobs=-1, class_weight='balanced', random_state=random_state)),
-        ('lr', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=random_state)),
-        ('svm', SVC(kernel='linear', class_weight='balanced', probability=True, random_state=random_state)),
-    ],
-    voting='soft',
-    n_jobs=-1
-)
-voting.fit(X_train, y_train)
-voting_pred = voting.predict(X_test)
-voting_metrics = get_metrics(y_test, voting_pred)
-print(classification_report(y_test, voting_pred, target_names=['Ham', 'Spam']))
-# Find optimal threshold using precision-recall curve
-# We want the highest threshold where ham predictions are >= 99% precise
-y_proba = voting.predict_proba(X_test)[:, 1]
-precision, recall, thresholds_pr = precision_recall_curve(y_test, y_proba)
-y_test_arr = np.array(y_test)  # convert to numpy to avoid pandas .values issues
-best_threshold = 0.50
-for t in sorted(thresholds_pr, reverse=True):
-    predicted_ham_mask = y_proba < t
-    if predicted_ham_mask.sum() == 0:
-        continue
-    ham_precision = (y_test_arr[predicted_ham_mask] == 0).sum() / predicted_ham_mask.sum()
-    if ham_precision >= 0.99:
-        best_threshold = t
-        break
-optimal_threshold = best_threshold
-print('Optimal threshold (99%% ham precision): %.4f' % optimal_threshold)
-```
-- [ ] **Step 4: Add model saving section**
-```python
-# Save all model artifacts
-models_dir.mkdir(exist_ok=True)
-joblib.dump(voting, models_dir / 'voting_model.joblib')
-joblib.dump(tfidf, models_dir / 'tfidf_vectorizer.joblib')
-joblib.dump(meta_scaler, models_dir / 'meta_scaler.joblib')
-joblib.dump(feature_names, models_dir / 'feature_names.joblib')
-joblib.dump(optimal_threshold, models_dir / 'optimal_threshold.joblib')
-# Save 200-row training sample for LIME
-X_train_dense = X_train.toarray()
-rng = np.random.RandomState(random_state)
-sample_idx = rng.choice(X_train_dense.shape[0], size=min(200, X_train_dense.shape[0]), replace=False)
-training_sample = X_train_dense[sample_idx]
-joblib.dump(training_sample, models_dir / 'training_sample.joblib')
-# Save training report as JSON (includes accuracy, precision, recall, F1 per spec)
-report = {
-    'random_forest': rf_metrics,
-    'logistic_regression': lr_metrics,
-    'svm': svm_metrics,
-    'voting_ensemble': voting_metrics,
-    'optimal_threshold': round(optimal_threshold, 4),
-    'best_single_model': max(
-        [('random_forest', rf_metrics['f1']),
-         ('logistic_regression', lr_metrics['f1']),
-         ('svm', svm_metrics['f1'])],
-        key=lambda x: x[1]
-    )[0],
-}
-with open(models_dir / 'training_report.json', 'w') as f:
-    json.dump(report, f, indent=2)
-print('\nAll models saved to models/')
-print('Training report: %s' % json.dumps(report, indent=2))
-```
-- [ ] **Step 5: Run training**
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio"
-python train.py
-```
-Expected output:
-- Loading messages for Kaggle and GitHub datasets
-- Classification reports for RF, LR, SVM, and Voting Ensemble
-- Optimal threshold printed
-- "All models saved to models/"
-- Files in `models/`: `voting_model.joblib`, `tfidf_vectorizer.joblib`, `meta_scaler.joblib`, `feature_names.joblib`, `optimal_threshold.joblib`, `training_sample.joblib`, `training_report.json`
-Verify:
-```bash
-ls models/
-cat models/training_report.json
-```
-- [ ] **Step 6: Commit**
-```bash
-git add train.py
-git commit -m "feat: add train.py with RF/LR/SVM comparison and VotingClassifier ensemble"
-```
-Note: Model artifacts are already in `.gitignore` from Task 1.
----
-### Task 4: Gradio App — Basic Classification (`app.py`)
-**Files:**
-- Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/app.py`
-- [ ] **Step 1: Write `app.py` — model loading and classification function**
-```python
-# Spam Email Classifier with XAI Explanations — Gradio App
-# ENGT 375 Project - Spring 2026 - ODU
-# Run: python app.py
-import numpy as np
-import joblib
-import matplotlib
-matplotlib.use('Agg')  # Non-interactive backend for Gradio
-import matplotlib.pyplot as plt
-import gradio as gr
-from pathlib import Path
-from scipy.sparse import hstack, csr_matrix
-from utils import (preprocess_text, compute_metadata_features,
-                   META_FEATURE_NAMES, FEATURE_DESCRIPTIONS)
-# Paths
-project_dir = Path(__file__).parent
-models_dir = project_dir / 'models'
-# Load trained model artifacts
-# These are created by running train.py first
-def load_models():
-    """Load all saved model files. Returns None values if models not found."""
-    try:
-        model = joblib.load(models_dir / 'voting_model.joblib')
-        vectorizer = joblib.load(models_dir / 'tfidf_vectorizer.joblib')
-        scaler = joblib.load(models_dir / 'meta_scaler.joblib')
-        feature_names = joblib.load(models_dir / 'feature_names.joblib')
-        threshold = joblib.load(models_dir / 'optimal_threshold.joblib')
-        training_sample = joblib.load(models_dir / 'training_sample.joblib')
-        return model, vectorizer, scaler, feature_names, threshold, training_sample
-    except FileNotFoundError:
-        return None, None, None, None, None, None
-model, vectorizer, scaler, feature_names, threshold, training_sample = load_models()
-def classify_email(email_text):
-    """Classify a single email. Returns (label, confidence, combined_features_sparse_matrix)."""
-    if model is None:
-        return "Models not found. Run `python train.py` first.", 0.0, "No model available."
-    if not email_text or not email_text.strip():
-        return "Please enter email text or upload a file.", 0.0, ""
-    # Step 1: Preprocess
-    clean = preprocess_text(email_text)
-    # Step 2: TF-IDF
-    tfidf_vec = vectorizer.transform([clean])
-    # Step 3: Metadata features
-    meta = compute_metadata_features([email_text])
-    meta_scaled = scaler.transform(meta)
-    # Step 4: Combine
-    combined = hstack([tfidf_vec, csr_matrix(meta_scaled)])
-    # Step 5: Predict
-    proba = model.predict_proba(combined)[0][1]  # probability of spam
-    is_spam = proba >= threshold
-    label = "SPAM" if is_spam else "HAM (Not Spam)"
-    confidence = proba if is_spam else (1 - proba)
-    return label, confidence, combined
-```
-- [ ] **Step 2: Add plain-English summary function**
-```python
-def generate_summary(label, confidence, email_text, lime_explanation=None):
-    """Generate a plain-English explanation of the classification."""
-    # Get metadata feature values for this email
-    meta = compute_metadata_features([email_text])
-    meta_values = meta[0]
-    summary_lines = []
-    summary_lines.append("This email was classified as **%s** (%.0f%% confidence).\n" % (label, confidence * 100))
-    summary_lines.append("**Key factors:**\n")
-    # If we have LIME results, use those for the top factors
-    if lime_explanation is not None:
-        feature_weights = lime_explanation.as_list()
-        for feat_name, weight in feature_weights[:5]:
-            direction = "toward spam" if weight > 0 else "toward ham"
-            summary_lines.append("- **%s** pushes %s" % (feat_name, direction))
-    else:
-        # Fallback: report notable metadata values
-        for i, name in enumerate(META_FEATURE_NAMES):
-            val = meta_values[i]
-            if val > 0:
-                desc = FEATURE_DESCRIPTIONS.get(name, name)
-                summary_lines.append("- %s: %.2f" % (desc, val))
-    return "\n".join(summary_lines)
-```
-- [ ] **Step 3: Add example emails**
-```python
-# Example emails for quick testing (from the old project)
-EXAMPLE_EMAILS = [
-    ["Subject: URGENT - You Have Won $5,000,000!!!\n\nDear Friend,\n\nCONGRATULATIONS!!! You have been selected as the winner of our international lottery program!!!\nTo claim your $5,000,000 USD prize, click the link below IMMEDIATELY and provide your bank details.\n\nACT NOW - This offer expires in 24 hours!!!\n\nClick here: http://totally-legit-prize.com/claim\nSend $500 processing fee to unlock your winnings.\n\nBest regards,\nDr. Prince Mohammed"],
-    ["Subject: Team sync Thursday 2pm\n\nHi everyone,\n\nJust a reminder that we have our weekly team sync this Thursday at 2pm in Conference Room B.\n\nAgenda:\n- Sprint review\n- Q2 planning discussion\n- New hire onboarding update\n\nPlease come prepared with your status updates.\n\nThanks,\nSarah"],
-    ["Subject: Your account has been compromised!\n\nDear Customer,\n\nWe detected suspicious activity on your account. Click here immediately to verify your identity: http://secure-bank-login.com/verify\n\nIf you do not verify within 24 hours, your account will be permanently locked.\n\nSecurity Team"],
-    ["Subject: Thanksgiving dinner plans\n\nHi everyone!\n\nI wanted to start planning for Thanksgiving dinner. I'm thinking we could do it at my place this year. What does everyone think about 4pm?\n\nLet me know if you have any dietary restrictions or if you want to bring a dish.\n\nLove,\nMom"],
-]
-```
-- [ ] **Step 4: Add Gradio interface with basic Result tab**
-```python
-def classify_and_explain(email_text, file_obj):
-    """Main function called by Gradio. Returns result text, LIME plot, SHAP plot."""
-    # Handle file upload
-    if file_obj is not None:
-        try:
-            email_text = Path(file_obj.name).read_text(encoding='utf-8', errors='replace')
-        except Exception:
-            return "Could not read file. Please upload a .txt file.", None, None
-    if not email_text or not email_text.strip():
-        return "Please enter email text or upload a file.", None, None
-    if model is None:
-        return "Models not found. Run `python train.py` first.", None, None
-    # Classify
-    label, confidence, combined = classify_email(email_text)
-    # Generate summary (LIME will be added in Task 5)
-    summary = generate_summary(label, confidence, email_text)
-    return summary, None, None  # LIME and SHAP plots added in later tasks
-# Build the Gradio interface
-with gr.Blocks(title="Spam Email Classifier with XAI") as demo:
-    gr.Markdown("# Spam Email Classifier with XAI Explanations")
-    gr.Markdown("Paste an email below or upload a .txt file to classify it as spam or ham.")
-    with gr.Row():
-        with gr.Column(scale=1):
-            email_input = gr.Textbox(
-                label="Email Text",
-                placeholder="Paste email content here...",
-                lines=12,
-            )
-            file_input = gr.File(label="Or upload a .txt file", file_types=[".txt"])
-            classify_btn = gr.Button("Classify Email", variant="primary")
-            gr.Examples(
-                examples=EXAMPLE_EMAILS,
-                inputs=email_input,
-                label="Example Emails",
-            )
-        with gr.Column(scale=1):
-            with gr.Tab("Result"):
-                result_output = gr.Markdown(label="Classification Result")
-            with gr.Tab("LIME Explanation"):
-                lime_output = gr.Plot(label="LIME")
-            with gr.Tab("SHAP — Metadata Feature Importance"):
-                shap_output = gr.Plot(label="SHAP")
-    classify_btn.click(
-        fn=classify_and_explain,
-        inputs=[email_input, file_input],
-        outputs=[result_output, lime_output, shap_output],
-    )
-if __name__ == '__main__':
-    demo.launch()
-```
-- [ ] **Step 5: Test the basic app launches**
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio"
-python app.py
-```
-Expected: Gradio launches at `http://127.0.0.1:7860`. Open in browser, paste Nigerian Prince email, click Classify. Should see the plain-English summary in the Result tab. LIME and SHAP tabs will be empty (None) for now.
-Stop the server with Ctrl+C after verifying.
-- [ ] **Step 6: Commit**
-```bash
-git add app.py
-git commit -m "feat: add Gradio app with classification and plain-English summary"
-```
----
-### Task 5: LIME Explanation Tab
-**Files:**
-- Modify: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/app.py`
-- [ ] **Step 1: Add LIME imports and explainer setup**
-Add to the top of `app.py` after existing imports:
-```python
-import lime
-import lime.lime_tabular
-```
-After the `load_models()` call, add:
-```python
-# Set up LIME explainer using the saved training sample
-# LIME needs a sample of training data to understand feature distributions
-if training_sample is not None:
-    lime_explainer = lime.lime_tabular.LimeTabularExplainer(
-        training_data=training_sample,
-        feature_names=feature_names,
-        class_names=['Ham', 'Spam'],
-        mode='classification',
-    )
-else:
-    lime_explainer = None
-```
-- [ ] **Step 2: Add `generate_lime_plot` function**
-```python
-def generate_lime_plot(combined_features):
-    """Generate a LIME explanation plot for the classified email."""
-    if lime_explainer is None:
-        return None
-    # Convert sparse matrix to dense array for LIME
-    # This is fine for a single email - only a problem with thousands
-    instance = combined_features.toarray()[0]
-    # LIME explains this single prediction
-    explanation = lime_explainer.explain_instance(
-        instance,
-        model.predict_proba,
-        num_features=10,
-    )
-    # Create matplotlib figure
-    fig = explanation.as_pyplot_figure()
-    fig.set_size_inches(10, 6)
-    fig.tight_layout()
-    return fig, explanation
-```
-- [ ] **Step 3: Update `classify_and_explain` to include LIME**
-Replace the `classify_and_explain` function body to call `generate_lime_plot` and pass the LIME explanation to `generate_summary`:
-```python
-def classify_and_explain(email_text, file_obj):
-    # ... (file handling and validation unchanged) ...
-    # Classify
-    label, confidence, combined = classify_email(email_text)
-    # LIME explanation
-    lime_fig = None
-    lime_exp = None
-    if lime_explainer is not None:
-        lime_fig, lime_exp = generate_lime_plot(combined)
-    # Generate summary using LIME results
-    summary = generate_summary(label, confidence, email_text, lime_explanation=lime_exp)
-    return summary, lime_fig, None  # SHAP added in Task 6
-```
-- [ ] **Step 4: Test LIME tab works**
-```bash
-python app.py
-```
-Open browser, classify an example email. The LIME tab should show a horizontal bar chart with feature names and their contributions (green = ham, red = spam).
-Stop server with Ctrl+C.
-- [ ] **Step 5: Commit**
-```bash
-git add app.py
-git commit -m "feat: add LIME explanation tab with feature importance plot"
-```
----
-### Task 6: SHAP Explanation Tab
-**Files:**
-- Modify: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/app.py`
-- [ ] **Step 1: Add SHAP import**
-Add at the top of `app.py`:
-```python
-import shap
-```
-- [ ] **Step 2: Add `generate_shap_plot` function**
-This uses KernelExplainer on metadata features only (24 features) for speed:
-```python
-def generate_shap_plot(email_text):
-    """Generate a SHAP bar chart for metadata features only.
-    We only use the 24 metadata features (not 3000+ TF-IDF features)
-    because KernelExplainer would be too slow on the full feature set.
-    """
-    if model is None or training_sample is None:
-        return None
-    # Compute metadata features for this email
-    meta = compute_metadata_features([email_text])
-    meta_scaled = scaler.transform(meta)
-    # Get metadata columns from training sample for background
-    # Metadata features are the last 24 columns in the combined feature matrix
-    n_meta = len(META_FEATURE_NAMES)
-    background_meta = training_sample[:50, -n_meta:]
-    # We need a predict function that works on metadata-only input
-    # We'll create a wrapper that fills in zeros for TF-IDF features
-    n_tfidf = training_sample.shape[1] - n_meta
-    def predict_with_meta_only(meta_array):
-        """Predict using only metadata features (pad TF-IDF with zeros)."""
-        zeros = np.zeros((meta_array.shape[0], n_tfidf))
-        full = np.hstack([zeros, meta_array])
-        return model.predict_proba(full)
-    # Create SHAP explainer with small background sample
-    explainer = shap.KernelExplainer(predict_with_meta_only, background_meta)
-    shap_values = explainer.shap_values(meta_scaled, nsamples=100)
-    # shap_values format depends on SHAP version:
-    # - Older versions: list of arrays [ham_values, spam_values]
-    # - Newer versions: single array for binary classification
-    # We want the SHAP values for the spam class (class 1)
-    if isinstance(shap_values, list) and len(shap_values) == 2:
-        # List format: [ham_values, spam_values], each is (n_samples, n_features)
-        spam_shap = shap_values[1][0]
-    elif isinstance(shap_values, np.ndarray) and shap_values.ndim == 2:
-        # Single array (n_samples, n_features) — this IS the spam class values
-        spam_shap = shap_values[0]
-    else:
-        # Fallback: try to use as-is
-        spam_shap = np.array(shap_values).flatten()[:n_meta]
-    # Create bar chart
-    fig, ax = plt.subplots(figsize=(10, 6))
-    sorted_idx = np.argsort(np.abs(spam_shap))
-    top_idx = sorted_idx[-10:]  # top 10 features
-    colors = ['#ff4444' if v > 0 else '#4444ff' for v in spam_shap[top_idx]]
-    ax.barh(
-        [META_FEATURE_NAMES[i] for i in top_idx],
-        spam_shap[top_idx],
-        color=colors,
-    )
-    ax.set_xlabel('SHAP Value (impact on spam prediction)')
-    ax.set_title('SHAP — Metadata Feature Importance')
-    ax.axvline(x=0, color='gray', linestyle='--', linewidth=0.5)
-    fig.tight_layout()
-    return fig
-```
-- [ ] **Step 3: Update `classify_and_explain` to include SHAP**
-Replace the last return line:
-```python
-    # SHAP explanation (metadata features only)
-    shap_fig = generate_shap_plot(email_text)
-    return summary, lime_fig, shap_fig
-```
-- [ ] **Step 4: Test SHAP tab works**
-```bash
-python app.py
-```
-Open browser, classify an email. The SHAP tab should show a horizontal bar chart with metadata feature names. Red bars = pushes toward spam, blue = pushes toward ham. May take a few seconds (KernelExplainer is slower than TreeExplainer).
-Stop server with Ctrl+C.
-- [ ] **Step 5: Commit**
-```bash
-git add app.py
-git commit -m "feat: add SHAP metadata feature importance tab"
-```
----
-### Task 7: Update CHANGELOG and Final Polish
-**Files:**
-- Modify: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/CHANGELOG.md`
-- [ ] **Step 1: Update CHANGELOG.md with full v0.1.0 entry**
-```markdown
-# Changelog
-All notable changes to this project will be documented in this file.
-This serves as a reference for writing the course paper's methodology section.
-## v0.1.0 — 2026-03-23
-### Initial Build
-- Created fresh project with Gradio UI (replacing old Streamlit version)
-- Ported preprocessing and 24 metadata features from old project's utils_student.py
-- Loaded Kaggle spam dataset (~190K emails, capped at 100K) + GitHub email-dataset
-- Trained and compared 3 models: Random Forest, Logistic Regression, SVM
-- Combined all 3 into a VotingClassifier (soft voting) for better accuracy
-- Built Gradio interface with:
-  - Text input + file upload
-  - Result tab with plain-English summary (top 5 factors)
-  - LIME explanation tab (full feature space, top 10 features)
-  - SHAP tab (metadata features only, KernelExplainer)
-- 4 built-in example emails for quick testing
-- All paths cross-platform (macOS compatible, no Windows .bat files)
-- No LLM/Ollama dependency — pure scikit-learn
-```
-- [ ] **Step 2: Run the app end-to-end and verify all tabs work**
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio"
-python app.py
-```
-Verify in browser:
-1. Paste Nigerian Prince email → Result tab shows SPAM with high confidence
-2. LIME tab shows feature importance bar chart
-3. SHAP tab shows metadata feature bar chart
-4. Paste meeting invite email → Result tab shows HAM
-5. Upload a .txt file → works
-6. Examples dropdown → works
-- [ ] **Step 3: Commit final state**
-```bash
-git add CHANGELOG.md
-git commit -m "docs: update CHANGELOG with v0.1.0 initial build"
-```
----
-### Task 8: Retroactive Changelog for Old Project
-**Files:**
-- Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-xai-project/CHANGELOG.md`
-This is a documentation-only task. No code changes to the old project.
-- [ ] **Step 1: Examine old project file timestamps and code comments**
-Check modification dates:
-```bash
-ls -lt "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-xai-project/"
-```
-Read code comments with "Change" markers:
-```bash
-grep -n "Change\|change\|--- " "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-xai-project/app.py" | head -20
-grep -n "Change\|change\|--- " "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-xai-project/retrain.py" | head -20
-```
-- [ ] **Step 2: Write `CHANGELOG.md` for old project**
-Reconstruct the development timeline from file dates and code comments. The changelog should cover:
-- Initial Streamlit app (`app.py`) with basic spam classification
-- Feature engineering evolution (11 → 24 metadata features)
-- Addition of LIME, SHAP, ELI5 explanations
-- Ollama/Qwen LLM integration for AI explanations
-- Student version creation (`app_student.py`, `utils_student.py`, `retrain_student.py`)
-- Context-aware phrase lists (Change 4)
-- Domain whitelist (Change 2)
-- Newsletter augmentation (Change 9)
-- OCR support addition
-- Dark mode fixes
-Structure as dated version entries using file modification timestamps.
-- [ ] **Step 3: Commit to old project (if git initialized) or just save**
-If the old project has no git repo, just save the file. The user can decide whether to initialize git later.
----
-### Task Summary
-| Task | Description | Depends On |
-|------|-------------|------------|
-| 1 | Project scaffolding (dirs, symlink, deps, git) | — |
-| 2 | `utils.py` with preprocessing + 24 features | Task 1 |
-| 3 | `train.py` with model comparison + ensemble | Task 2 |
-| 4 | `app.py` basic Gradio UI + classification | Task 3 |
-| 5 | LIME explanation tab | Task 4 |
-| 6 | SHAP explanation tab | Task 5 |
-| 7 | CHANGELOG update + final verification | Task 6 |
-| 8 | Retroactive changelog for old project | — (independent) |
-Tasks 1-7 are sequential. Task 8 can run in parallel with any task.

docs/superpowers/plans/2026-03-23-mlx-spam-classifier.md DELETED Viewed

@@ -1,848 +0,0 @@
-# MLX Spam Classifier Implementation Plan
-> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
-**Goal:** Fine-tune Qwen3.5-0.8B on spam classification using Apple MLX with LoRA, build a Gradio UI with classify + chat tabs, and create comprehensive documentation.
-**Architecture:** New standalone project (`spam-classifier-mlx/`). Three main scripts: `prepare_data.py` (generates training JSONL using local 9B model), `fine_tune.py` (LoRA fine-tuning wrapper), `app.py` (Gradio UI). A `docs/` folder contains all reference documentation. Models downloaded from HuggingFace.
-**Tech Stack:** Python, Apple MLX, mlx-lm, Gradio, pandas, numpy
-**Spec:** `docs/superpowers/specs/2026-03-23-mlx-spam-classifier-design.md`
-**Agent Team:**
-- **Implementer agents** — one per task, writes code
-- **QA agent** — dispatched after each task to verify: plan compliance, code quality, wiring correctness, beginner-level code style matching ENGT 375 lectures
----
-### Task 1: Project Scaffolding + Documentation Folder
-**Files:**
-- Create: `spam-classifier-mlx/requirements.txt`
-- Create: `spam-classifier-mlx/.gitignore`
-- Create: `spam-classifier-mlx/CLAUDE.md`
-- Create: `spam-classifier-mlx/CHANGELOG.md`
-- Create: `spam-classifier-mlx/docs/README.md`
-- Create: `spam-classifier-mlx/docs/01-what-is-mlx.md`
-- Create: `spam-classifier-mlx/docs/02-what-is-lora.md`
-- Create: `spam-classifier-mlx/docs/03-training-guide.md`
-- Create: `spam-classifier-mlx/docs/04-mlx-lm-reference.md`
-- Create: `spam-classifier-mlx/docs/05-deployment-guide.md`
-- [ ] **Step 1: Create project directory, subdirectories, and data symlink**
-```bash
-mkdir -p "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"/{models,adapters,training_data,docs}
-ln -s "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-xai-project/data" \
-      "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx/data"
-```
-Verify: `ls "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx/data/spam_Emails_data.csv"`
-- [ ] **Step 2: Create requirements.txt**
-```
-mlx-lm[train]>=0.31.0
-gradio==4.19.2
-numpy>=1.24.0
-pandas>=2.0.0
-```
-Note: `mlx-lm[train]` installs `mlx`, `transformers`, `safetensors`, `sentencepiece`, `tiktoken`, and training deps. No need to list `mlx` separately.
-- [ ] **Step 3: Create .gitignore**
-```
-__pycache__/
-*.pyc
-.pytest_cache/
-venv/
-models/
-adapters/
-fused_model/
-training_data/
-data/
-*.egg-info/
-.DS_Store
-```
-- [ ] **Step 4: Create CLAUDE.md**
-```markdown
-# CLAUDE.md
-## Project Context
-Fine-tuned LLM spam classifier using Apple MLX for ENGT 375 (Applied Machine Learning, Spring 2026, ODU).
-Uses LoRA fine-tuning on Qwen3.5-0.8B-MLX-9bit for spam/ham classification with natural language explanations.
-## Code Style
-- Beginner-level Python: explicit for-loops, clear variable names, comments explaining why
-- No advanced patterns (decorators, metaclasses, complex comprehensions)
-- Reference course concepts in comments where applicable
-## How to Run
-1. Create venv: `python3 -m venv venv && source venv/bin/activate`
-2. Install deps: `pip install -r requirements.txt`
-3. Generate training data: `python3 prepare_data.py` (~20-40 min, needs 9B model)
-4. Fine-tune model: `python3 fine_tune.py` (~10-20 min)
-5. Launch app: `python3 app.py`
-## Key Files
-- `prepare_data.py` — Generates training JSONL using local Qwen3.5-9B model
-- `fine_tune.py` — Wrapper around mlx_lm.lora for LoRA fine-tuning
-- `app.py` — Gradio UI with Classify and Chat tabs
-- `docs/` — Reference documentation (MLX guide, LoRA explanation, training guide, etc.)
-## Data
-- `data/` is a symlink to `../spam-xai-project/data/`
-- `training_data/` contains generated JSONL (created by prepare_data.py)
-- `models/` contains downloaded base model
-- `adapters/` contains LoRA weights (created by fine_tune.py)
-```
-- [ ] **Step 5: Create initial CHANGELOG.md**
-```markdown
-# Changelog
-All notable changes to this project will be documented in this file.
-This serves as a reference for writing the course paper's methodology section.
-## v0.1.0 — 2026-03-23
-### Initial Project Setup
-- Created project scaffold for MLX-based spam classifier
-- Set up documentation folder with MLX, LoRA, training, and deployment guides
-- Symlinked data from spam-xai-project
-```
-- [ ] **Step 6: Create docs/README.md**
-```markdown
-# Documentation
-Reference guides for the MLX spam classifier project.
-| Document | Description |
-|----------|-------------|
-| [01-what-is-mlx.md](01-what-is-mlx.md) | What is Apple MLX and why use it on Apple Silicon |
-| [02-what-is-lora.md](02-what-is-lora.md) | LoRA fine-tuning explained for beginners |
-| [03-training-guide.md](03-training-guide.md) | Step-by-step: preparing data, fine-tuning, evaluating |
-| [04-mlx-lm-reference.md](04-mlx-lm-reference.md) | mlx-lm CLI commands and Python API reference |
-| [05-deployment-guide.md](05-deployment-guide.md) | How to deploy to Hugging Face Spaces |
-```
-- [ ] **Step 7: Create docs/01-what-is-mlx.md**
-Write a beginner-friendly guide (~200 words) covering:
-- MLX is Apple's ML framework built for Apple Silicon (M1/M2/M3/M4 chips)
-- Uses unified memory — GPU and CPU share the same RAM (no copying data between them)
-- Alternative to PyTorch/TensorFlow for Mac users
-- Why it matters: fine-tune LLMs on your laptop instead of needing a cloud GPU
-- Reference: https://github.com/ml-explore/mlx
-- [ ] **Step 8: Create docs/02-what-is-lora.md**
-Write a beginner-friendly guide (~300 words) covering:
-- Normal fine-tuning: update ALL model weights (billions of parameters, huge memory)
-- LoRA: freeze the original weights, add tiny "adapter" matrices alongside them
-- Only train the adapters (~1-5% of total parameters)
-- Result: same quality, fraction of the memory and time
-- QLoRA: base model is quantized (compressed) to save even more memory
-- Analogy: "Instead of rewriting a textbook, you add sticky notes with corrections"
-- Reference: https://arxiv.org/abs/2106.09685
-- [ ] **Step 9: Create docs/03-training-guide.md**
-Write a step-by-step guide (~400 words) covering:
-1. Prepare your data as JSONL (chat format with system/user/assistant messages)
-2. Download the base model from HuggingFace
-3. Run `mlx_lm.lora --model <path> --train --data <dir>` with key flags
-4. Monitor training loss (should decrease over iterations)
-5. Evaluate with `--test` flag (lower perplexity = better)
-6. Test with `mlx_lm.generate` to see real outputs
-7. Fuse adapter into base model with `mlx_lm.fuse` for deployment
-8. Memory tips: `--grad-checkpoint`, reduce `--batch-size`, reduce `--num-layers`
-- [ ] **Step 10: Create docs/04-mlx-lm-reference.md**
-Write a reference card with all mlx-lm commands:
-- `mlx_lm.lora` — all flags: `--model`, `--train`, `--test`, `--data`, `--iters`, `--batch-size`, `--learning-rate`, `--num-layers`, `--adapter-path`, `--mask-prompt`, `--grad-checkpoint`, `--fine-tune-type` (lora/dora/full)
-- `mlx_lm.generate` — `--model`, `--adapter-path`, `--prompt`, `--max-tokens`
-- `mlx_lm.fuse` — `--model`, `--adapter-path`, `--upload-repo`, `--export-gguf`
-- Python API: `from mlx_lm import load, generate`
-- Data format: JSONL chat, completions, text formats with examples
-- Source: https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/LORA.md
-- [ ] **Step 11: Create docs/05-deployment-guide.md**
-Write a guide (~200 words) covering:
-- Problem: HF Spaces runs Linux, not Apple Silicon. MLX won't work there.
-- Solution: Fuse adapter → convert to transformers format → deploy with torch
-- Step 1: `mlx_lm.fuse --model models/Qwen3.5-0.8B-MLX-9bit`
-- Step 2: Upload fused model to HuggingFace Hub
-- Step 3: Create `app_hf.py` using `transformers` instead of `mlx_lm`
-- Step 4: Create HF Space with `requirements.txt` listing `transformers`, `torch`, `gradio`
-- Reference: https://huggingface.co/docs/hub/spaces-sdks-gradio
-- [ ] **Step 12: Initialize git repo and commit**
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
-git init
-git add requirements.txt .gitignore CLAUDE.md CHANGELOG.md docs/
-git commit -m "chore: scaffold project with docs, requirements, CLAUDE.md, CHANGELOG"
-```
-- [ ] **Step 13: Create venv and install dependencies**
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
-python3 -m venv venv
-source venv/bin/activate
-pip install -r requirements.txt
-```
-Verify: `python3 -c "import mlx_lm; print('mlx-lm OK')"`
-- [ ] **Step 14: Download the base model**
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
-source venv/bin/activate
-huggingface-cli download inferencerlabs/Qwen3.5-0.8B-MLX-9bit --local-dir models/Qwen3.5-0.8B-MLX-9bit
-```
-Verify: `ls models/Qwen3.5-0.8B-MLX-9bit/config.json` (config.json always exists; safetensors may be sharded into multiple files)
-**Fallback model:** If LoRA training on Qwen3.5-0.8B produces poor results (the model uses non-standard GatedDeltaNet attention layers), try `mlx-community/Qwen2.5-0.5B-Instruct-4bit` as an alternative — standard transformer architecture, explicitly supported by mlx-lm.
-Quick test:
-```bash
-mlx_lm.generate --model models/Qwen3.5-0.8B-MLX-9bit --prompt "Hello, how are you?" --max-tokens 50
-```
-Expected: Model generates a short text response.
----
-### Task 2: Data Preparation (`prepare_data.py`)
-**Files:**
-- Create: `spam-classifier-mlx/prepare_data.py`
-- [ ] **Step 1: Write the COMPLETE prepare_data.py**
-This is the most complex script. The implementing agent MUST include all of the following in the file. Key technical details:
-**CRITICAL — Chat Template:** The `mlx_lm.generate()` Python API does NOT auto-apply the chat template. You MUST use `tokenizer.apply_chat_template()` before calling `generate()`. Qwen3.5 uses ChatML format (`<|im_start|>system\n...<|im_end|>`).
-**CRITICAL — Thinking Mode:** Qwen3.5 outputs `<think>...</think>` tags by default. Pass `enable_thinking=False` in `apply_chat_template` to suppress this. If thinking tokens leak into training data, the fine-tuned model will learn to produce them.
-**CRITICAL — Response Parsing:** Strip any `<think>...</think>` blocks from responses as a safety measure, then extract the first line (SPAM or HAM) for validation.
-The complete file must contain:
-```python
-# Generate training data for the spam classifier
-# ENGT 375 Project - Spring 2026 - ODU
-#
-# This script uses the local Qwen3.5-9B model to generate
-# classification explanations for each email, then saves
-# them as JSONL files for fine-tuning the 0.8B model.
-#
-# Run: python3 prepare_data.py
-# Requires: ~/MLXModels/mlx-community/Qwen3.5-9B-OptiQ-4bit/
-# Time: ~30-60 minutes (600 emails through 9B model)
-import json
-import re
-import random
-import pandas as pd
-from pathlib import Path
-from mlx_lm import load, generate
-# Paths
-project_dir = Path(__file__).parent
-data_dir = project_dir / 'data'
-output_dir = project_dir / 'training_data'
-output_dir.mkdir(exist_ok=True)
-# The 9B model generates explanations (smarter than the 0.8B we'll fine-tune)
-MODEL_9B_PATH = str(Path.home() / 'MLXModels' / 'mlx-community' / 'Qwen3.5-9B-OptiQ-4bit')
-random.seed(42)
-SYSTEM_PROMPT = "You are an email spam classifier. Analyze the email and classify it as SPAM or HAM. Explain your reasoning."
-CLASSIFY_PROMPT = """Classify this email as SPAM or HAM. Give your classification on the first line, then explain your reasoning in 2-3 sentences. Be specific about what words, patterns, or signals you noticed.
-Email:
-{email_text}"""
-# Hardcoded Q&A topics for conversational training data
-QA_PROMPTS = [
-    "Why do spam emails often use urgency language like 'act now' or 'limited time'?",
-    "What is the difference between spam and phishing emails?",
-    "How can you tell if a marketing email from a legitimate company is not spam?",
-    "Why do spam emails use dollar signs and large numbers?",
-    "What makes newsletters sometimes look like spam to filters?",
-    "What are common red flags in email headers that indicate spam?",
-    "Why do spam emails sometimes misspell words intentionally?",
-    "How do spammers try to bypass email filters?",
-    "What should I do if I receive a suspicious email?",
-    "What is a ham email?",
-    # ... (implement at least 50 diverse prompts covering spam patterns,
-    #      email security, classification techniques, etc.)
-]
-# The implementing agent should expand this to 50 prompts.
-def strip_thinking(text):
-    """Remove any <think>...</think> blocks from the model's response."""
-    cleaned = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
-    return cleaned.strip()
-def generate_with_chat_template(model, tokenizer, system_msg, user_msg, max_tokens=200):
-    """Generate a response using the proper chat template.
-    IMPORTANT: The mlx_lm.generate() Python API does NOT auto-apply the
-    chat template. We must format the prompt ourselves using
-    tokenizer.apply_chat_template(). Without this, the model gets raw
-    text and produces garbage output.
-    """
-    messages = [
-        {"role": "system", "content": system_msg},
-        {"role": "user", "content": user_msg},
-    ]
-    # apply_chat_template converts messages to the ChatML format the model expects
-    # enable_thinking=False suppresses the <think>...</think> chain-of-thought output
-    prompt = tokenizer.apply_chat_template(
-        messages,
-        tokenize=False,
-        add_generation_prompt=True,
-        enable_thinking=False,
-    )
-    response = generate(model, tokenizer, prompt=prompt, max_tokens=max_tokens)
-    # Safety: strip any thinking tags that slipped through
-    response = strip_thinking(response)
-    return response
-def parse_classification(response_text):
-    """Extract SPAM or HAM from the first line of the model's response."""
-    first_line = response_text.strip().split('\n')[0].upper()
-    if 'SPAM' in first_line:
-        return 'spam'
-    elif 'HAM' in first_line:
-        return 'ham'
-    return None
-def format_as_jsonl(system_prompt, user_content, assistant_content):
-    """Format one training example as a JSONL chat message dict."""
-    return json.dumps({
-        "messages": [
-            {"role": "system", "content": system_prompt},
-            {"role": "user", "content": user_content},
-            {"role": "assistant", "content": assistant_content},
-        ]
-    })
-```
-Then the script body must:
-1. Load Kaggle CSV, oversample to 350 spam + 350 ham (700 total, to account for mismatches after validation)
-2. Load the 9B model with `load(MODEL_9B_PATH)`
-3. For each email (printing progress every 10), truncate to 500 chars, call `generate_with_chat_template()`, parse classification, validate against ground truth
-4. Keep matches, discard mismatches, print running success rate
-5. Format matches as JSONL using `format_as_jsonl()`
-6. Generate 50 conversational Q&A pairs using the same 9B model (with a conversational system prompt)
-7. Combine classify + Q&A examples, shuffle
-8. Split: first 500 → `train.jsonl`, remaining → `test.jsonl`
-9. Print 10 random examples for manual inspection
-10. Print final stats (total examples, train/test split, match rate)
-11. Unload model (del model) to free memory
-- [ ] **Step 2: Run prepare_data.py**
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
-source venv/bin/activate
-python3 prepare_data.py
-```
-Expected output:
-- "Loading 9B model..." (takes ~30s)
-- Progress: "Processing email 1/600..." through "600/600"
-- "Matched: X/600, Mismatched: Y/600"
-- "Generated 50 conversational Q&A pairs"
-- "Saved 500 examples to training_data/train.jsonl"
-- "Saved 100 examples to training_data/test.jsonl"
-- 10 sample examples printed for inspection
-Verify:
-```bash
-wc -l training_data/train.jsonl training_data/test.jsonl
-python3 -c "import json; [json.loads(l) for l in open('training_data/train.jsonl')]; print('Valid JSONL')"
-```
-- [ ] **Step 3: Commit**
-```bash
-git add prepare_data.py
-git commit -m "feat: add prepare_data.py for training data generation with 9B model"
-```
----
-### Task 3: Fine-Tuning Script (`fine_tune.py`)
-**Files:**
-- Create: `spam-classifier-mlx/fine_tune.py`
-- [ ] **Step 1: Write fine_tune.py**
-```python
-# Fine-tune Qwen3.5-0.8B on spam classification using LoRA
-# ENGT 375 Project - Spring 2026 - ODU
-# This is a wrapper around mlx_lm.lora that sets up the right
-# parameters for our spam classification task.
-#
-# Run: python3 fine_tune.py
-# Requires: models/Qwen3.5-0.8B-MLX-9bit/ and training_data/train.jsonl
-# Time: ~10-20 minutes on M4 Pro
-import subprocess
-import sys
-from pathlib import Path
-project_dir = Path(__file__).parent
-model_path = project_dir / 'models' / 'Qwen3.5-0.8B-MLX-9bit'
-data_path = project_dir / 'training_data'
-adapter_path = project_dir / 'adapters'
-def check_prerequisites():
-    """Make sure the model and training data exist before we start."""
-    if not model_path.exists():
-        print('ERROR: Base model not found at %s' % model_path)
-        print('Download it first:')
-        print('  huggingface-cli download inferencerlabs/Qwen3.5-0.8B-MLX-9bit --local-dir models/Qwen3.5-0.8B-MLX-9bit')
-        sys.exit(1)
-    train_file = data_path / 'train.jsonl'
-    if not train_file.exists():
-        print('ERROR: Training data not found at %s' % train_file)
-        print('Generate it first: python3 prepare_data.py')
-        sys.exit(1)
-    print('Model found: %s' % model_path)
-    print('Training data found: %s' % train_file)
-def run_training():
-    """Run LoRA fine-tuning using mlx_lm.lora CLI."""
-    print('\nStarting LoRA fine-tuning...')
-    print('This will take about 10-20 minutes on M4 Pro.')
-    print('The model has 24 transformer layers — we are adding small')
-    print('LoRA adapter matrices to each layer and only training those.\n')
-    # Build the command
-    # mlx_lm.lora is the CLI tool from the mlx-lm package
-    cmd = [
-        sys.executable, '-m', 'mlx_lm.lora',
-        '--model', str(model_path),
-        '--train',
-        '--data', str(data_path),
-        '--iters', '600',
-        '--batch-size', '2',
-        '--learning-rate', '1e-5',
-        '--num-layers', '24',
-        '--adapter-path', str(adapter_path),
-        '--mask-prompt',
-        '--grad-checkpoint',
-    ]
-    print('Running: %s\n' % ' '.join(cmd))
-    # Run the training and show output in real time
-    result = subprocess.run(cmd)
-    if result.returncode != 0:
-        print('\nERROR: Training failed with exit code %d' % result.returncode)
-        sys.exit(1)
-    print('\nTraining complete!')
-    print('Adapter weights saved to: %s' % adapter_path)
-def run_evaluation():
-    """Evaluate the fine-tuned model on the test set."""
-    test_file = data_path / 'test.jsonl'
-    if not test_file.exists():
-        print('No test.jsonl found — skipping evaluation.')
-        return
-    print('\nEvaluating on test set...')
-    cmd = [
-        sys.executable, '-m', 'mlx_lm.lora',
-        '--model', str(model_path),
-        '--adapter-path', str(adapter_path),
-        '--data', str(data_path),
-        '--test',
-    ]
-    subprocess.run(cmd)
-def test_generation():
-    """Quick test: classify a sample email with the fine-tuned model."""
-    print('\n--- Quick Test ---')
-    test_prompt = 'Classify this email:\n\nSubject: You Won $5M!!!\nDear Friend, CONGRATULATIONS!!! Click here to claim your prize!'
-    system_prompt = "You are an email spam classifier. Analyze the email and classify it as SPAM or HAM. Explain your reasoning."
-    cmd = [
-        sys.executable, '-m', 'mlx_lm.generate',
-        '--model', str(model_path),
-        '--adapter-path', str(adapter_path),
-        '--prompt', test_prompt,
-        '--max-tokens', '200',
-    ]
-    # Note: the CLI auto-applies the chat template, but we should verify
-    # the output looks like a proper classification response
-    subprocess.run(cmd)
-    print('\n--- End Test ---')
-if __name__ == '__main__':
-    check_prerequisites()
-    run_training()
-    run_evaluation()
-    test_generation()
-    print('\nAll done! You can now run: python3 app.py')
-```
-- [ ] **Step 2: Run fine-tuning** (this takes ~10-20 minutes)
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
-source venv/bin/activate
-python3 fine_tune.py
-```
-Expected:
-- Prerequisites check passes
-- Training output shows iteration number and loss (loss should decrease)
-- Evaluation prints perplexity on test set
-- Quick test generates a spam classification response
-- "All done!" at the end
-Verify:
-```bash
-ls adapters/
-```
-Expected: adapter config and weight files present
-- [ ] **Step 3: Commit**
-```bash
-git add fine_tune.py
-git commit -m "feat: add fine_tune.py LoRA training wrapper for Qwen3.5-0.8B"
-```
----
-### Task 4: Gradio App (`app.py`)
-**Files:**
-- Create: `spam-classifier-mlx/app.py`
-- [ ] **Step 1: Write app.py — model loading and classify function**
-```python
-# Spam Email Classifier — Fine-Tuned LLM with Gradio UI
-# ENGT 375 Project - Spring 2026 - ODU
-# Uses Qwen3.5-0.8B fine-tuned with LoRA on spam/ham data
-#
-# Run: python3 app.py
-# Requires: models/Qwen3.5-0.8B-MLX-9bit/ and adapters/
-import gradio as gr
-from pathlib import Path
-from mlx_lm import load, generate
-# Paths
-project_dir = Path(__file__).parent
-model_path = str(project_dir / 'models' / 'Qwen3.5-0.8B-MLX-9bit')
-adapter_path = str(project_dir / 'adapters')
-# System prompt tells the model what role to play
-SYSTEM_PROMPT = "You are an email spam classifier. Analyze the email and classify it as SPAM or HAM. Explain your reasoning."
-CHAT_SYSTEM_PROMPT = "You are a spam email analysis expert. You can classify emails as spam or ham, explain your reasoning, and answer questions about email security and spam patterns."
-# Load model at startup (only happens once)
-print('Loading fine-tuned model...')
-try:
-    model, tokenizer = load(model_path, adapter_path=adapter_path)
-    print('Model loaded successfully!')
-    MODEL_LOADED = True
-except Exception as e:
-    print('Could not load model: %s' % str(e))
-    print('Run python3 fine_tune.py first.')
-    model, tokenizer = None, None
-    MODEL_LOADED = False
-```
-Then add these functions (CRITICAL — must use chat template, not raw prompts):
-- `build_classify_prompt(email_text)`:
-  ```python
-  messages = [
-      {"role": "system", "content": SYSTEM_PROMPT},
-      {"role": "user", "content": "Classify this email:\n\n" + email_text[:500]},
-  ]
-  return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
-  ```
-- `classify_email(email_text, file_obj)` — handles file upload, calls `build_classify_prompt`, then `generate(model, tokenizer, prompt, max_tokens=300)`, strips `<think>` tags, returns markdown result
-- `chat_respond(message, history)`:
-  ```python
-  # Gradio 4.19.2 ChatInterface passes history as list of {"role":..., "content":...} dicts
-  messages = [{"role": "system", "content": CHAT_SYSTEM_PROMPT}]
-  for msg in history:
-      messages.append(msg)
-  messages.append({"role": "user", "content": message})
-  prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
-  response = generate(model, tokenizer, prompt, max_tokens=500)
-  # Strip any thinking tags
-  import re
-  response = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()
-  return response
-  ```
-- Example emails (same 4 as sklearn project)
-- Gradio Blocks layout with two tabs:
-  - Tab 1 "Classify": Textbox + File + Examples → Markdown output
-  - Tab 2 "Chat": gr.ChatInterface with chat_respond function
-- `demo.launch()` at bottom
-- [ ] **Step 2: Test the app launches**
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
-source venv/bin/activate
-timeout 15 python3 app.py 2>&1 | grep -E "Running|loaded|Error" || echo "Check output"
-```
-Expected: "Model loaded successfully!" and "Running on local URL: http://127.0.0.1:7860"
-- [ ] **Step 3: Commit**
-```bash
-git add app.py
-git commit -m "feat: add Gradio app with Classify and Chat tabs"
-```
----
-### Task 5: Launch Scripts
-**Files:**
-- Create: `spam-classifier-mlx/launch.command`
-- Create: `spam-classifier-mlx/launch-notebook.command`
-- [ ] **Step 1: Create launch.command**
-```bash
-#!/bin/bash
-# Double-click this file in Finder to launch the Spam Classifier UI
-cd "$(dirname "$0")"
-source venv/bin/activate
-echo "Starting MLX Spam Classifier..."
-echo "Opening http://127.0.0.1:7860 in your browser..."
-sleep 2 && open http://127.0.0.1:7860 &
-python3 app.py
-```
-```bash
-chmod +x "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx/launch.command"
-```
-- [ ] **Step 2: Create launch-notebook.command**
-```bash
-#!/bin/bash
-# Double-click this file in Finder to open the project notebook
-cd "$(dirname "$0")"
-source venv/bin/activate
-pip install jupyter -q 2>/dev/null
-echo "Opening notebook..."
-jupyter notebook spam_classifier_mlx.ipynb
-```
-```bash
-chmod +x "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx/launch-notebook.command"
-```
-- [ ] **Step 3: Commit**
-```bash
-git add launch.command launch-notebook.command
-git commit -m "feat: add launch scripts for Gradio app and Jupyter notebook"
-```
----
-### Task 6: Project Notebook (`spam_classifier_mlx.ipynb`)
-**Files:**
-- Create: `spam-classifier-mlx/spam_classifier_mlx.ipynb`
-- [ ] **Step 1: Create the notebook**
-Create a Jupyter notebook with these sections (each as markdown + code cells):
-1. **Title + Introduction** — "Fine-Tuning a Small LLM for Spam Classification with Apple MLX"
-   - What is this project, why LLM approach vs traditional ML
-   - Course context (ENGT 375)
-2. **What is MLX?** — Markdown explaining Apple MLX (reference docs/01-what-is-mlx.md)
-3. **What is LoRA?** — Markdown explaining LoRA (reference docs/02-what-is-lora.md)
-   - Diagram concept: original weights frozen, small adapter matrices added
-4. **Environment Setup** — Code cell: check mlx-lm version, check model exists
-5. **Data Loading** — Code cells:
-   - Load Kaggle CSV, show shape and class distribution
-   - Sample 10 spam + 10 ham, display them
-   - Explain the strategy: 600 emails → generate explanations → JSONL
-6. **Inspecting Training Data** — Code cells:
-   - Load train.jsonl, show 5 examples
-   - Count label distribution in training data
-   - Show average response length
-7. **Fine-Tuning** — Code cells:
-   - Show the mlx_lm.lora command that was run
-   - Display training config (iters, batch_size, num_layers, etc.)
-   - If adapters exist, show adapter file sizes
-   - Explain what happened during training
-8. **Evaluation** — Code cells:
-   - Load model + adapter
-   - Test on 5 example emails (3 from training set, 2 new)
-   - Show the model's responses
-   - Compare to ground truth labels
-9. **Comparison with sklearn** — Markdown + code:
-   - Table: sklearn VotingClassifier (97.4% accuracy) vs fine-tuned LLM
-   - Test the Lenovo email — does the LLM handle it better?
-   - Discussion: when does each approach win?
-10. **Results and Conclusions** — Markdown:
-    - Summary of findings
-    - Limitations (0.8B model capacity, training data quality)
-    - Future work (bigger model, more training data, RLHF)
-Code style: Beginner-friendly, `%%time` on slow cells, `print()` for results.
-- [ ] **Step 2: Commit**
-```bash
-git add spam_classifier_mlx.ipynb
-git commit -m "feat: add project notebook for course submission"
-```
----
-### Task 7: CHANGELOG Update + Final Verification
-**Files:**
-- Modify: `spam-classifier-mlx/CHANGELOG.md`
-- [ ] **Step 1: Update CHANGELOG.md with full v0.1.0 entry**
-```markdown
-# Changelog
-All notable changes to this project will be documented in this file.
-This serves as a reference for writing the course paper's methodology section.
-## v0.1.0 — 2026-03-23
-### Initial Build
-- Created project with comprehensive documentation (docs/ folder):
-  - What is MLX guide
-  - What is LoRA guide
-  - Step-by-step training guide
-  - mlx-lm CLI reference
-  - Hugging Face deployment guide
-- Generated training data: 500 train + 100 test examples
-  - Used local Qwen3.5-9B-OptiQ-4bit to generate classification explanations
-  - 450 email classify examples + 50 conversational Q&A pairs
-  - Validated against ground truth labels
-- Fine-tuned Qwen3.5-0.8B-MLX-9bit with LoRA:
-  - 600 iterations, batch size 2, learning rate 1e-5
-  - All 24 layers with LoRA adapters
-  - QLoRA (automatic — base model is 9-bit quantized)
-  - ~10-20 minutes on M4 Pro
-- Built Gradio interface with:
-  - Classify tab: paste email → get SPAM/HAM + explanation
-  - Chat tab: conversational Q&A about spam patterns
-  - 4 built-in example emails
-- Project notebook for course submission
-- macOS native — runs on Apple Silicon via MLX
-```
-- [ ] **Step 2: Verify end-to-end**
-```bash
-cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
-source venv/bin/activate
-# Check all files exist
-ls prepare_data.py fine_tune.py app.py requirements.txt CLAUDE.md CHANGELOG.md
-ls docs/*.md
-ls training_data/train.jsonl training_data/test.jsonl
-ls adapters/
-ls models/Qwen3.5-0.8B-MLX-9bit/
-# Quick model test
-python3 -c "
-from mlx_lm import load, generate
-model, tok = load('models/Qwen3.5-0.8B-MLX-9bit', adapter_path='adapters')
-result = generate(model, tok, prompt='Is this spam? Hello, meeting at 3pm.', max_tokens=100)
-print(result)
-"
-```
-- [ ] **Step 3: Commit final state**
-```bash
-git add CHANGELOG.md
-git commit -m "docs: update CHANGELOG with v0.1.0 build details"
-```
----
-### Task Summary
-| Task | Description | Depends On | Estimated Time |
-|------|-------------|------------|----------------|
-| 1 | Scaffolding + docs + venv + model download | — | 15 min |
-| 2 | `prepare_data.py` (generate training JSONL) | Task 1 | 30-45 min (9B model generation) |
-| 3 | `fine_tune.py` (LoRA training wrapper) | Task 2 | 15-25 min (includes training time) |
-| 4 | `app.py` (Gradio UI with classify + chat) | Task 3 | 10 min |
-| 5 | Launch scripts (.command files) | Task 4 | 2 min |
-| 6 | Project notebook | Task 3 | 15 min |
-| 7 | CHANGELOG + final verification | Task 6 | 5 min |
-Tasks 1-5 are strictly sequential. Task 6 depends on Task 3 (needs adapters to exist) but can run in parallel with Tasks 4-5.
-### QA Agent Checklist (run after each task)
-The QA agent verifies after each task:
-1. **Plan compliance:** Does the implementation match what the plan specified?
-2. **Code quality:** No syntax errors, no import errors, files run without crashing
-3. **Wiring:** Do the pieces connect? (prepare_data output → fine_tune input → app.py loads result)
-4. **Beginner-level code:** Explicit loops, clear variable names, comments explaining why, no advanced patterns
-5. **Documentation:** CHANGELOG updated, docs accurate, CLAUDE.md correct

docs/superpowers/plans/2026-04-14-spam-xai-v2-simplify.md DELETED Viewed

@@ -1,383 +0,0 @@
-# Spam XAI Project v2 — Simplification Plan
-> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
-**Goal:** Create a `spam-xai-project-v2/` folder that is a simplified, beginner-friendly copy of `spam-xai-project/`, reducing `app.py` from 674 lines to roughly 400–450 lines without removing any features, and update the CHANGELOG to document v2.0.
-**Architecture:** Copy the entire project folder to `spam-xai-project-v2/`, then simplify `app.py` in place — breaking up repeated patterns, flattening nested functions, and merging near-duplicate functions. All existing features (LIME, SHAP, ELI5, Gradio UI, feedback logging) are kept. No model files are changed. Lecture patterns from Module_5A and Module_7B are referenced for plot style only.
-**Tech Stack:** Python 3.11, scikit-learn, gradio, lime, shap, eli5, joblib, matplotlib
-**Beginner-code rules (must follow throughout):**
-- Use plain `for` loops — no list comprehensions with lambdas
-- No decorators
-- No `functools`, no `lru_cache`, no `partial`
-- Variable names spell out what they hold (e.g., `spam_words` not `sw`)
-- Every non-obvious line gets a short plain-English comment
----
-## File Structure
-| File | Action | Notes |
-|------|--------|-------|
-| `spam-xai-project-v2/app.py` | Simplify (main work) | Target ~420 lines |
-| `spam-xai-project-v2/CHANGELOG.md` | Update | Add v2.0 entry |
-| All other files in `spam-xai-project-v2/` | Copy unchanged | models/, data/, utils.py, retrain.py, etc. |
----
-## Task 1: Copy the project folder
-**Files:**
-- Create: `spam-xai-project-v2/` (full copy of `spam-xai-project/`)
-- [ ] **Step 1: Copy the folder**
-Run from the `LLM Project/` directory:
-```bash
-cp -r "spam-xai-project" "spam-xai-project-v2"
-```
-- [ ] **Step 2: Verify the copy**
-Run:
-```bash
-ls "spam-xai-project-v2/"
-```
-Expected: Same files as `spam-xai-project/` — `app.py`, `utils.py`, `retrain.py`, `CHANGELOG.md`, `models/`, `data/`, etc.
-- [ ] **Step 3: Confirm app.py line count**
-Run:
-```bash
-wc -l "spam-xai-project-v2/app.py"
-```
-Expected: 674 lines (unchanged copy)
-- [ ] **Step 4: Commit the copy as a baseline**
-```bash
-cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
-git add "LLM Project/spam-xai-project-v2/"
-git commit -m "feat: add spam-xai-project-v2 baseline copy (pre-simplification)"
-```
----
-## Task 2: Merge the two duplicate feedback handler functions
-**Files:**
-- Modify: `spam-xai-project-v2/app.py` lines 395–425
-**What exists now:** Two functions `handle_correct()` and `handle_wrong()` that share ~90% of their code. Both unpack the same hidden state string, both call `log_feedback()`, and both call `count_corrections()`. The only difference is that `handle_wrong()` also takes a `user_label` parameter.
-**What to do:** Replace both with a single `handle_feedback()` function that takes an extra `is_correct` flag and optional `user_label`. Then wire Gradio buttons to call this one function.
-- [ ] **Step 1: Read the current feedback handlers**
-Read `spam-xai-project-v2/app.py` lines 390–435 to see the exact current code before changing anything.
-- [ ] **Step 2: Replace both handlers with one function**
-Find both `handle_correct` and `handle_wrong` function definitions and replace them with a single `handle_feedback` function that does:
-1. Parse the hidden state string (same as before)
-2. If `is_correct` is True, log positive feedback
-3. If `is_correct` is False, log a correction using `user_label`
-4. Return the correction count string
-- [ ] **Step 3: Update the Gradio button wiring**
-In the Gradio UI section (around line 430+), find where `.click()` is called on the correct/wrong buttons and update both to call `handle_feedback` with the right arguments (`is_correct=True` or `is_correct=False`).
-- [ ] **Step 4: Run the app to verify feedback still works**
-```bash
-cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING/LLM Project/spam-xai-project-v2"
-python3 app.py
-```
-Expected: App launches without errors. Test the "That's correct" and "That's wrong" buttons with a sample email.
-- [ ] **Step 5: Commit**
-```bash
-cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
-git add "LLM Project/spam-xai-project-v2/app.py"
-git commit -m "refactor(v2): merge duplicate feedback handlers into one handle_feedback function"
-```
----
-## Task 3: Flatten the nested SHAP function
-**Files:**
-- Modify: `spam-xai-project-v2/app.py` lines 119–165
-**What exists now:** `generate_shap_explanation()` contains an inner function `predict_with_meta_only()` defined inside it (lines 130–135). This pattern (function inside a function) is confusing for beginners.
-**What to do:** Move `predict_with_meta_only()` out to be a regular top-level function, defined before `generate_shap_explanation()`. No logic changes — just move it up.
-- [ ] **Step 1: Read lines 119–170 of app.py**
-Read the exact current code for `generate_shap_explanation` and its nested function.
-- [ ] **Step 2: Cut the inner function out and paste it as a top-level function**
-Place `predict_with_meta_only()` as a standalone function right above `generate_shap_explanation()`. Add a short comment explaining what it does in plain English.
-- [ ] **Step 3: Remove the inner definition from inside generate_shap_explanation**
-The body of `generate_shap_explanation` should now just call `predict_with_meta_only` as a normal function (it was already calling it this way — it just won't be defined inside anymore).
-- [ ] **Step 4: Run the app to verify SHAP still works**
-```bash
-python3 app.py
-```
-Expected: App launches. Classify a sample email and confirm the SHAP tab shows a chart.
-- [ ] **Step 5: Commit**
-```bash
-cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
-git add "LLM Project/spam-xai-project-v2/app.py"
-git commit -m "refactor(v2): move nested SHAP predict function to top-level"
-```
----
-## Task 4: Simplify generate_comparison() repeated extraction logic
-**Files:**
-- Modify: `spam-xai-project-v2/app.py` lines 246–288
-**What exists now:** `generate_comparison()` extracts the top-3 features from LIME, SHAP, and ELI5 using three near-identical blocks. Each block does: get the values, sort them, take the top 3. This is the same operation written out three times.
-**What to do:** Write a plain helper function `get_top_features(explanation, method_name)` above `generate_comparison()` that handles one explanation object and returns a list of the top-3 feature names. Then call it three times in a simple loop inside `generate_comparison()`.
-Keep the output (the markdown comparison table) exactly the same — only the internal logic changes.
-- [ ] **Step 1: Read lines 246–295 of app.py**
-Read the exact current code.
-- [ ] **Step 2: Write the helper function**
-Add a plain `get_top_features(explanation, method_name)` function above `generate_comparison()`. It takes one explanation object and a string name, and returns a list of 3 feature name strings. Write it with a plain `for` loop, no comprehensions.
-- [ ] **Step 3: Rewrite generate_comparison() to use the helper**
-The function body should:
-1. Call `get_top_features` once for LIME, once for SHAP, once for ELI5
-2. Store the results in three plain lists
-3. Find the overlap (features that appear in all three) with a plain `for` loop
-4. Build and return the same markdown table as before
-- [ ] **Step 4: Run the app and verify comparison tab is unchanged**
-```bash
-python3 app.py
-```
-Expected: Classify an email. The "Compare" tab output looks identical to the original.
-- [ ] **Step 5: Commit**
-```bash
-cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
-git add "LLM Project/spam-xai-project-v2/app.py"
-git commit -m "refactor(v2): extract top-feature helper to replace 3x repeated extraction in generate_comparison"
-```
----
-## Task 5: Simplify generate_plain_summary() badge logic
-**Files:**
-- Modify: `spam-xai-project-v2/app.py` lines 196–245
-**What exists now:** `generate_plain_summary()` is 50 lines. It has nested ternary operators to pick badge text, color, and icon based on label and confidence — all mixed into one dense block. This is hard to read at a beginner level.
-**What to do:** Pull the badge/icon selection out into a small helper function `get_result_badge(label, confidence)` that uses plain `if/elif/else` statements to return a dictionary with keys `color`, `icon`, and `text`. Then `generate_plain_summary()` just calls it and uses the returned values.
-- [ ] **Step 1: Read lines 196–250 of app.py**
-Read the exact current code.
-- [ ] **Step 2: Write get_result_badge()**
-Add a new function `get_result_badge(label, confidence)` above `generate_plain_summary()` using only `if/elif/else` — no ternaries. It returns a plain dictionary like:
-```
-{"color": "red", "icon": "🚨", "text": "SPAM"}
-```
-- [ ] **Step 3: Simplify generate_plain_summary()**
-Replace the nested ternary block with a single call to `get_result_badge()`. The rest of the markdown assembly stays the same. The function output (the markdown string returned) must be identical to the original.
-- [ ] **Step 4: Run the app and verify the Result tab looks identical**
-```bash
-python3 app.py
-```
-Expected: Classify an email. The summary card/badge looks exactly the same as in the original project.
-- [ ] **Step 5: Commit**
-```bash
-cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
-git add "LLM Project/spam-xai-project-v2/app.py"
-git commit -m "refactor(v2): extract badge/icon logic into get_result_badge helper"
-```
----
-## Task 6: Add comments to classify_and_explain() orchestrator
-**Files:**
-- Modify: `spam-xai-project-v2/app.py` lines 339–394
-**What exists now:** `classify_and_explain()` is 56 lines that calls 6 different functions in sequence. There are no section comments explaining the flow. For a beginner reading this for the first time, it is not obvious why 7 values are returned or what the hidden state string is for.
-**What to do:** Add short plain-English section comments (not docstrings, not multi-line blocks — just `# one-line comments`) at the start of each logical step: input handling, classification, each explainer call, hidden state packing, and return. Do not change any logic.
-- [ ] **Step 1: Read lines 339–400 of app.py**
-Read the exact current function.
-- [ ] **Step 2: Add section comments**
-Insert comments before each logical group:
-- Before the file vs text input check: `# Figure out if the user pasted text or uploaded a file`
-- Before classify_email(): `# Run the email through the model to get spam/ham prediction`
-- Before each explainer call: `# Generate [LIME/SHAP/ELI5] explanation`
-- Before the hidden state pack: `# Pack the email and prediction into one string so the feedback buttons can use it later`
-- Before return: `# Send all results back to the Gradio interface`
-- [ ] **Step 3: Verify app still runs**
-```bash
-python3 app.py
-```
-Expected: No errors. Classify a test email to confirm all tabs still populate.
-- [ ] **Step 4: Commit**
-```bash
-cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
-git add "LLM Project/spam-xai-project-v2/app.py"
-git commit -m "docs(v2): add plain-English section comments to classify_and_explain orchestrator"
-```
----
-## Task 7: Final line count check and app.py header comment
-**Files:**
-- Modify: `spam-xai-project-v2/app.py` top of file
-- [ ] **Step 1: Check the final line count**
-```bash
-wc -l "spam-xai-project-v2/app.py"
-```
-Expected: ~420–450 lines (down from 674). If still above 460, re-read and identify any remaining duplicate blocks before continuing.
-- [ ] **Step 2: Add a short header comment to app.py**
-At the very top of `spam-xai-project-v2/app.py`, before the imports, add a 4-line block comment explaining what the file does, for a beginner reader:
-```
-# app.py — Spam Email Classifier with Explanations
-# This file runs the Gradio web app.
-# It loads a trained model, classifies an email as spam or not spam,
-# and shows three different explanations of why it made that choice.
-```
-- [ ] **Step 3: Run the app one final time end-to-end**
-```bash
-python3 app.py
-```
-Expected: Launches cleanly. Test all 7 tabs: Result, LIME, SHAP, ELI5, Compare, Summary, How It Works. Confirm feedback buttons work.
-- [ ] **Step 4: Commit**
-```bash
-cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
-git add "LLM Project/spam-xai-project-v2/app.py"
-git commit -m "docs(v2): add top-of-file header comment to app.py"
-```
----
-## Task 8: Update CHANGELOG.md with v2.0 entry
-**Files:**
-- Modify: `spam-xai-project-v2/CHANGELOG.md`
-**What to do:** Add a new version entry at the top of the file for v2.0. It should document each simplification made, which lines changed, and what the beginner-friendliness goal was. Use the same markdown format as the existing v1.5 entry.
-- [ ] **Step 1: Read the top of CHANGELOG.md**
-Read the first 60 lines of `spam-xai-project-v2/CHANGELOG.md` to see the exact format of the existing version entries (headings, bullet style, date format).
-- [ ] **Step 2: Write the v2.0 CHANGELOG entry**
-Add this entry at the very top of the file, above any existing entries, following the format exactly as found in Step 1:
-```markdown
-## [v2.0] — 2026-04-14
-### Summary
-Simplified `app.py` from 674 lines to ~420 lines for a beginner audience (ENGT 375, Spring 2026).
-No features were removed. All tabs, explanations, and feedback logging work identically.
-This version lives in `spam-xai-project-v2/`.
-### Changes
-- **Merged duplicate feedback handlers** — `handle_correct()` and `handle_wrong()` (which shared 90% of their code) were combined into one `handle_feedback()` function with an `is_correct` flag. Saves ~20 lines and removes confusing duplication.
-- **Flattened nested SHAP function** — `predict_with_meta_only()` was defined inside `generate_shap_explanation()`. Moved it to the top level so it reads like a normal function. No logic change.
-- **Simplified comparison feature extraction** — `generate_comparison()` had three near-identical code blocks to get the top-3 features from LIME, SHAP, and ELI5 separately. Replaced with a single `get_top_features()` helper called three times.
-- **Simplified badge logic** — `generate_plain_summary()` used nested ternary operators to pick the spam/ham badge color, icon, and text. Replaced with a plain `get_result_badge()` function using `if/elif/else` statements.
-- **Added section comments to orchestrator** — `classify_and_explain()` (the main function that runs when you click Classify) had no comments explaining its steps. Added short plain-English comments so a student can follow the flow.
-- **Added file header comment** — Four-line comment at the top of `app.py` explaining what the file does in plain English.
-### Files Changed
-- `app.py` — simplified (674 → ~420 lines)
-- `CHANGELOG.md` — this entry
-### Files Unchanged
-- `utils.py`, `retrain.py`, `retrain_student.py`, `train_ensemble.py`
-- All model artifacts in `models/`
-- All data in `data/`
-- All notebooks in `notebooks/`
-```
-- [ ] **Step 3: Verify CHANGELOG looks correct**
-Read the top 80 lines of `spam-xai-project-v2/CHANGELOG.md` to confirm the new entry is above the old entries and the formatting matches.
-- [ ] **Step 4: Commit**
-```bash
-cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
-git add "LLM Project/spam-xai-project-v2/CHANGELOG.md"
-git commit -m "docs(v2): add v2.0 CHANGELOG entry documenting all simplifications"
-```
----
-## Self-Review Checklist
-- [x] All 5 simplification areas from the audit are covered (feedback handlers, SHAP nested fn, comparison loop, badge logic, orchestrator comments)
-- [x] No features removed — LIME, SHAP, ELI5, Gradio UI, feedback logging all stay
-- [x] Every new function uses plain if/else, plain for loops — no comprehensions, no decorators
-- [x] CHANGELOG entry is written out fully — no TBDs
-- [x] Each task ends with a working app run before committing
-- [x] v2 lives in a new folder — original `spam-xai-project/` is untouched
-- [x] No model files are touched

docs/superpowers/specs/2026-03-23-gradio-spam-classifier-design.md DELETED Viewed

@@ -1,298 +0,0 @@
-# Design: Spam Email Classifier with Gradio UI
-**Date:** 2026-03-23
-**Project:** ENGT 375 — Applied Machine Learning, Spring 2026, ODU
-**Goal:** Create a fresh, beginner-level spam classifier with a Gradio web interface, replacing the old Streamlit-based project. Runs on macOS. Includes LIME, SHAP, and plain-English explanations.
----
-## 1. Project Structure
-```
-/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/
-├── app.py              # Gradio UI (main entry point)
-├── train.py            # Train & compare models, save best ensemble
-├── utils.py            # Shared preprocessing & feature engineering
-├── requirements.txt    # Dependencies
-├── CHANGELOG.md        # Version history for paper reference
-├── CLAUDE.md           # Claude Code project instructions
-├── models/             # Saved trained models (generated by train.py)
-└── data/               # Symlinked from spam-xai-project/data
-```
-**Location:** New folder `spam-classifier-gradio/` alongside `spam-xai-project/` in the Applied Machine Learning directory.
-**Data strategy:** Symlink `data/` from `spam-xai-project/data` to avoid duplicating ~500MB+ of email corpora.
----
-## 2. Data Pipeline (`train.py`)
-### Data Sources (matching old project's `retrain_student.py`)
-1. **Kaggle spam dataset**: `data/spam_Emails_data.csv` — ~190K emails, stratified-sampled to 100K cap
-2. **GitHub email-dataset**: `data/email-dataset-main/email-dataset-main/dataset/` — subfolder `1/` = ham, `2/` = spam (individual `.txt` files)
-Note: The old project's `retrain_student.py` uses these two sources (not `emails_raw.csv`). We match the same sources for comparable results.
-### Steps
-1. Load Kaggle CSV, normalize columns to `text` and `label` (lowercase: spam/ham)
-2. Stratified-sample Kaggle to 100K cap (same as old project)
-3. Load GitHub email-dataset by reading `.txt` files from `dataset/1/` (ham) and `dataset/2/` (spam)
-4. Combine both datasets
-5. Deduplicate by exact text match (after combining, before splitting)
-6. Preprocess text using `utils.preprocess_text()`
-7. TF-IDF vectorization: `TfidfVectorizer(max_features=3000, ngram_range=(1,3), min_df=2, max_df=0.90, sublinear_tf=True)` — exact same params as old project
-8. Compute 24 metadata features using `utils.compute_metadata_features()`
-9. Scale metadata features with `MinMaxScaler` (so they match TF-IDF range)
-10. Combine TF-IDF + metadata via `scipy.sparse.hstack`
-### Class Balancing Strategy
-- Use `class_weight='balanced'` on all classifiers (same as old project) rather than data-level undersampling
-- This lets sklearn adjust weights inversely proportional to class frequency without discarding training data
-### Train/Test Split
-- 70/30 split (`test_size=0.3`), stratified — same as old project for comparable metrics
-- `random_state=42` for reproducibility
-### Model Comparison
-Train three classifiers individually and print classification reports for each:
-- **Random Forest**: `RandomForestClassifier(n_jobs=-1, class_weight='balanced', random_state=42)`
-- **Logistic Regression**: `LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)`
-- **SVM**: `SVC(kernel='linear', class_weight='balanced', probability=True, random_state=42)`
-Compare using: accuracy, precision, recall, F1-score on the test set.
-### Ensemble
-1. Train all three models individually, collect F1 scores
-2. Construct a **new** `VotingClassifier(voting='soft')` using all three estimators (or top 2 if one significantly underperforms)
-3. Fit the VotingClassifier on the training data (it retrains the sub-estimators internally)
-4. **No separate `CalibratedClassifierCV` wrapper** — soft voting already averages probability outputs, and SVM with `probability=True` already uses Platt scaling internally
-5. Find optimal classification threshold using precision-recall curve on test set
-### Saved Artifacts (in `models/`)
-- `tfidf_vectorizer.joblib`
-- `meta_scaler.joblib`
-- `voting_model.joblib`
-- `feature_names.joblib` (list of all feature names: TF-IDF + metadata)
-- `optimal_threshold.joblib`
-- `training_sample.joblib` (200-row sample of training data, needed for LIME explainer)
-- `training_report.json` — schema: `{"random_forest": {"accuracy": float, "f1": float, "precision": float, "recall": float}, "logistic_regression": {...}, "svm": {...}, "voting_ensemble": {...}, "best_single_model": str}`
----
-## 3. Feature Engineering (`utils.py`)
-### Text Preprocessing (`preprocess_text`)
-Copied from `utils_student.py` logic:
-- Strip HTML tags
-- Remove URLs and email addresses
-- Remove non-alphabetic characters
-- Lowercase
-- Remove stopwords (NLTK English)
-- Porter stemming
-### Metadata Features (`compute_metadata_features`)
-24 features, copied from `utils_student.py`:
-1. exclamation_density (per sentence)
-2. dollar_sign_count
-3. caps_word_ratio
-4. spam_phrase_count (from phrase list)
-5. ham_phrase_count (from phrase list)
-6. net_spam_context (spam - ham phrase count)
-7. url_count
-8. html_tag_count
-9. email_length
-10. avg_sentence_length
-11. capitalization_ratio
-12. has_specific_date
-13. has_specific_time
-14. date_reference_count
-15. has_unsubscribe
-16. has_physical_address
-17. has_proper_greeting
-18. has_contact_info
-19. registration_language_score
-20. cta_to_info_ratio
-21. shortener_url_ratio
-22. legitimate_platform_count
-23. gov_edu_url_count
-24. question_mark_count
-### Phrase Lists
-Same lists from `utils_student.py`: `spam_context_phrases`, `ham_context_phrases`, `registration_phrases`, `url_shorteners`, `legitimate_platforms`.
-### Human-Readable Feature Descriptions
-A dictionary mapping feature names to plain-English descriptions, used by the summary generator:
-```python
-FEATURE_DESCRIPTIONS = {
-    'exclamation_density': 'Exclamation marks per sentence',
-    'dollar_sign_count': 'Dollar signs found',
-    'caps_word_ratio': 'ALL-CAPS word ratio',
-    ...
-}
-```
-### Code Style
-- Beginner-friendly: explicit `for` loops instead of one-liner comprehensions
-- Comments explaining *why*, referencing course concepts (e.g., "scaling features so kNN/SVM treats them equally — Module 7A")
-- No advanced Python patterns
----
-## 4. Gradio App (`app.py`)
-### Interface Layout
-- **Input area:**
-  - `gr.Textbox` — paste email text directly
-  - `gr.File` — upload `.txt` file (reads content into the text box)
-  - `gr.Examples` — 3-4 pre-loaded example emails
-- **Output tabs** (using `gr.Tab`):
-  - **Result** — spam/ham label, confidence %, plain-English summary
-  - **LIME** — LIME explanation plot (matplotlib figure)
-  - **SHAP** — SHAP feature importance bar chart (matplotlib figure)
-Note: `.eml` file support is out of scope (MIME parsing is complex). Only plain `.txt` files accepted for upload.
-### Example Emails (built-in)
-1. Nigerian Prince spam (obvious spam)
-2. Legitimate newsletter (ham)
-3. Phishing attempt (subtle spam)
-4. Normal personal email (ham)
-### Plain-English Summary
-Uses the LIME explanation output (already computed for the LIME tab) to get the top 5 most influential features and their contributions. Maps feature indices to human-readable descriptions via `FEATURE_DESCRIPTIONS` dict, then formats as bullet points:
-> This email was classified as **SPAM** (92% confidence) because:
-> - High exclamation density (3.2 per sentence)
-> - Contains spam phrases: 'act now', 'you have won'
-> - 4 suspicious URLs detected
-> - High ALL-CAPS word ratio (18%)
-For TF-IDF features (word-level), the summary says "Contains word: '[word]'" with contribution direction.
-### LIME Configuration
-- Use `lime.lime_tabular.LimeTabularExplainer`
-- Training data: loaded from `training_sample.joblib` (200-row dense sample saved during training)
-- Feature names: loaded from `feature_names.joblib`
-- `num_features=10` for explanation plots
-- The explainer works on the combined dense feature matrix (TF-IDF + metadata converted to dense for the single email being explained — this is fine for one sample at a time)
-### SHAP Configuration
-- Use `shap.KernelExplainer` with a small background sample (50 rows from training_sample)
-- This is model-agnostic and works with the VotingClassifier
-- For performance: only compute SHAP on the metadata features (24 features) — not the full 3000+ TF-IDF features. This keeps SHAP fast (<5 seconds) and the bar chart readable.
-- The SHAP tab title should note: "SHAP — Metadata Feature Importance"
-### Error Handling
-- **Models not trained**: Show a clear message "Models not found. Run `python train.py` first." and disable the classify button
-- **Empty input**: Show "Please enter email text or upload a file."
-- **Invalid file**: Show "Could not read file. Please upload a .txt file."
-### Classification Flow
-1. Read email text from input (textbox or uploaded file)
-2. Preprocess with `utils.preprocess_text()`
-3. TF-IDF transform with saved vectorizer
-4. Compute metadata features with `utils.compute_metadata_features()`
-5. Scale metadata with saved scaler
-6. Combine TF-IDF + metadata
-7. Predict with voting model
-8. Apply optimal threshold
-9. Generate LIME explanation (full feature space)
-10. Generate SHAP explanation (metadata features only)
-11. Generate plain-English summary (from LIME output)
-12. Return all outputs to Gradio tabs
----
-## 5. macOS Compatibility
-### Removed from old project
-- No `.bat` files — launch with `python app.py`
-- No hardcoded `C:\Users\balfo\...` paths
-- No `pytesseract` / OCR dependency
-- No `streamlit_js_eval` / browser localStorage
-- No Streamlit at all
-### Path handling
-- All paths use `pathlib.Path(__file__).parent` (cross-platform)
-- Data accessed via symlink to existing corpus
----
-## 6. Dependencies (`requirements.txt`)
-```
-numpy>=1.24.0
-pandas>=2.0.0
-matplotlib>=3.7.0
-scikit-learn>=1.3.0
-scipy>=1.11.0
-nltk>=3.8.0
-lime>=0.2.0
-shap>=0.44.0
-gradio>=4.0.0
-joblib>=1.3.0
-tqdm>=4.65.0
-```
-**Not included:** streamlit, eli5, wordcloud, seaborn, pytesseract, Pillow, streamlit-js-eval, requests (no Ollama).
----
-## 7. Changelog (`CHANGELOG.md`)
-Maintained in the new project root. Format:
-```markdown
-## vX.Y.Z — YYYY-MM-DD
-### Title
-- What changed and why
-```
-Updated every time a change or improvement is made. Serves as the primary reference for writing the course paper's methodology section.
----
-## 8. Retroactive Changelog for Old Project (separate task)
-Create `CHANGELOG.md` in `spam-xai-project/` by reconstructing the development history from:
-- File modification timestamps
-- Code comments (e.g., "Change 4: Context-aware phrase lists")
-- Progression from `app.py` → `app_student.py`, `retrain.py` → `retrain_student.py`
-- Feature additions visible in the code (11 features → 24 features, LLM integration, OCR support, etc.)
-This is a documentation task, separate from the new Gradio project implementation.
----
-## 9. Accuracy Improvements Over Old Project
-### Data Quality
-- **Deduplication**: Remove exact-duplicate emails that inflate metrics
-- **Same data sources**: Kaggle + GitHub email-dataset (matching old project)
-### Model Comparison
-- Old project: Random Forest only
-- New project: Compare RF, Logistic Regression, SVM — pick best by F1
-### Ensemble
-- Combine models via `VotingClassifier` (soft voting) — typically 2-5% better than any single model
-- Use `class_weight='balanced'` on all sub-estimators (same approach as old project)
-### Same Feature Set
-- Keep the 24 metadata features from `utils_student.py` — they're well-designed
-- Keep 3000 TF-IDF features with identical vectorizer params — proven effective
----
-## 10. Out of Scope
-- No LLM/Ollama integration
-- No OCR/image support
-- No ELI5
-- No `.eml` file parsing (only `.txt` uploads)
-- No deployment (Hugging Face Spaces, Vercel, etc.) — local only
-- No deep learning models
-- No fine-tuning
-- No database or persistent storage

docs/superpowers/specs/2026-03-23-mlx-spam-classifier-design.md DELETED Viewed

@@ -1,311 +0,0 @@
-# Design: Spam Classifier with Fine-Tuned LLM (MLX)
-**Date:** 2026-03-23
-**Project:** ENGT 375 — Applied Machine Learning, Spring 2026, ODU
-**Goal:** Fine-tune Qwen3.5-0.8B on spam/ham email classification using Apple MLX with LoRA, then build a Gradio UI with classify and chat modes. Optionally deploy to Hugging Face Spaces.
----
-## 1. Project Structure
-```
-/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx/
-├── prepare_data.py         # Generate training data using Qwen3.5-9B
-├── fine_tune.py            # Fine-tune Qwen3.5-0.8B with LoRA
-├── app.py                  # Gradio UI (classify tab + chat tab)
-├── requirements.txt        # Dependencies
-├── .gitignore              # Exclude model weights, cache, etc.
-├── CLAUDE.md               # Project instructions
-├── CHANGELOG.md            # Version history for paper reference
-├── launch.command          # Double-click to run Gradio app
-├── launch-notebook.command # Double-click to open notebook
-├── spam_classifier_mlx.ipynb # Project notebook for submission
-├── data/                   # Symlink → spam-xai-project/data
-├── training_data/          # Generated JSONL files
-│   ├── train.jsonl         # 500 training examples
-│   └── test.jsonl          # 100 test examples
-├── adapters/               # LoRA adapter weights (fine-tuning output)
-├── fused_model/            # (optional) Merged model for deployment
-└── models/                 # Base model (downloaded from HuggingFace)
-    └── Qwen3.5-0.8B-MLX-9bit/
-```
-**Location:** New folder `spam-classifier-mlx/` created alongside `spam-xai-project/` and `spam-classifier-gradio/` in the Applied Machine Learning directory.
-**Data strategy:** Symlink `data/` from `spam-xai-project/data/` (same as the Gradio project).
----
-## 2. Environment Setup
-### Python Environment
-- Create a dedicated venv: `python3 -m venv venv`
-- Activate: `source venv/bin/activate`
-- Python 3.9+ required (system Python 3.9 on this Mac is fine; 3.11+ preferred if available)
-### Install MLX-LM
-```bash
-pip install "mlx-lm[train]"
-```
-This installs: `mlx`, `mlx-lm`, `transformers`, `safetensors`, `sentencepiece`, `tiktoken` (transitive deps).
-### Hardware
-- **Machine:** MacBook Pro M4 Pro, 24GB unified RAM
-- **Memory budget — data generation:** ~6-8GB (9B model at 4-bit). Close other large apps during this phase.
-- **Memory budget — fine-tuning:** ~3-5GB (0.8B model at 9-bit + LoRA + gradient checkpointing). Comfortable on 24GB.
-- **Memory budget — inference:** ~1GB (0.8B model + adapter). Very light.
-### Models
-- **Base model for fine-tuning:** `inferencerlabs/Qwen3.5-0.8B-MLX-9bit` from Hugging Face
-  - 9-bit quantized, 847MB on disk, ~0.84 GiB in memory, ~231 tokens/s
-  - 24 transformer layers
-  - Source: https://huggingface.co/inferencerlabs/Qwen3.5-0.8B-MLX-9bit
-  - Download: `mlx_lm.generate --model inferencerlabs/Qwen3.5-0.8B-MLX-9bit --prompt "test"` (auto-downloads on first use, or `huggingface-cli download inferencerlabs/Qwen3.5-0.8B-MLX-9bit --local-dir models/Qwen3.5-0.8B-MLX-9bit`)
-- **Model for generating training data:** Local `Qwen3.5-9B-OptiQ-4bit` at `~/MLXModels/mlx-community/Qwen3.5-9B-OptiQ-4bit/`
-### MLX-LM CLI Reference (from official docs at github.com/ml-explore/mlx-lm)
-- **Train:** `mlx_lm.lora --model <path> --train --data <dir> --iters 600`
-- **Evaluate:** `mlx_lm.lora --model <path> --adapter-path adapters/ --data <dir> --test`
-- **Generate:** `mlx_lm.generate --model <path> --adapter-path adapters/ --prompt "..."`
-- **Fuse:** `mlx_lm.fuse --model <path>` → saves to `fused_model/`
-- **CLI flags use kebab-case:** `--mask-prompt`, `--grad-checkpoint`, `--num-layers`, `--batch-size`
-- **YAML config uses underscores or same kebab names**
-- **Data format:** JSONL with chat format (see Section 3)
-- **QLoRA:** Automatic when base model is quantized
----
-## 3. Data Preparation (`prepare_data.py`)
-### Source Data
-- Kaggle spam dataset at `data/spam_Emails_data.csv` — **193,852 emails** (102,160 ham, 91,692 spam)
-- Stratified sample: 300 spam + 300 ham = 600 total for training data generation
-### Generating Explanations
-Use the local Qwen3.5-9B-OptiQ-4bit model via `mlx_lm` Python API to create natural language explanations for each email.
-```python
-from mlx_lm import load, generate
-model, tokenizer = load("~/MLXModels/mlx-community/Qwen3.5-9B-OptiQ-4bit")
-response = generate(model, tokenizer, prompt=prompt_text, max_tokens=200)
-```
-Prompt template:
-```
-Classify this email as SPAM or HAM. Give your classification on the first line,
-then explain your reasoning in 2-3 sentences. Be specific about what words,
-patterns, or signals you noticed.
-Email:
-{email_text_truncated_to_500_chars}
-```
-### Synthetic Conversational Data
-In addition to the 600 classify examples, generate ~50 synthetic Q&A conversation examples for the chat mode. Topics: why spam uses dollar signs, how phishing differs from spam, what makes newsletters look like spam, common spam patterns, etc. Generated via the 9B model with diverse hardcoded prompts.
-### Output Format (MLX-LM chat JSONL)
-```json
-{"messages": [
-  {"role": "system", "content": "You are an email spam classifier. Analyze the email and classify it as SPAM or HAM. Explain your reasoning."},
-  {"role": "user", "content": "Classify this email:\n\nSubject: You Won $5M!!!..."},
-  {"role": "assistant", "content": "SPAM\n\nThis email uses classic lottery scam tactics: a large prize claim ($5M), urgency language ('act now'), and requests for bank details. The all-caps subject line and excessive exclamation marks are strong spam indicators."}
-]}
-```
-### Data Quality Validation
-After generation, before saving:
-1. Parse each response to extract the classification label (first line: SPAM or HAM)
-2. Compare against ground truth from the Kaggle dataset
-3. Discard mismatches (9B model classified differently than ground truth) — resample replacement emails
-4. **Manual inspection:** Print 10 random examples for spot-checking quality and JSONL formatting
-5. Verify all JSONL lines parse correctly with `json.loads()`
-### Split
-- `training_data/train.jsonl` — 500 examples (450 classify + 50 conversational)
-- `training_data/test.jsonl` — 100 examples (classify only, for perplexity evaluation)
-### Expected Time
-~20-40 minutes (600 emails × 9B model generation at ~231 tok/s per response)
----
-## 4. Fine-Tuning (`fine_tune.py`)
-### CLI Command (what the script wraps)
-```bash
-mlx_lm.lora \
-  --model models/Qwen3.5-0.8B-MLX-9bit \
-  --train \
-  --data training_data \
-  --iters 600 \
-  --batch-size 2 \
-  --learning-rate 1e-5 \
-  --num-layers 24 \
-  --adapter-path adapters \
-  --mask-prompt \
-  --grad-checkpoint
-```
-Key decisions:
-- `--mask-prompt` — only compute loss on the assistant's response, not the user's prompt
-- `--grad-checkpoint` — saves memory by recomputing activations during backward pass
-- `--batch-size 2` — conservative for 24GB memory
-- `--iters 600` — standard for small datasets (~1.2 epochs over 500 examples at batch 2)
-- `--num-layers 24` — fine-tune all 24 transformer layers with LoRA (the model has exactly 24 layers)
-### Script Behavior
-`fine_tune.py` is a thin wrapper that:
-1. Checks if base model exists locally, downloads if not
-2. Checks if `training_data/train.jsonl` exists
-3. Runs the `mlx_lm.lora` command via subprocess
-4. Prints training loss progress
-5. Runs evaluation on test set (prints perplexity)
-6. Prints "Training complete! Adapter saved to adapters/"
-### Capacity Expectations
-The fine-tuned 0.8B model's explanations will be simpler than the 9B model's training data. This is expected — the 0.8B model has less capacity for nuanced reasoning. For the project, this is fine and actually demonstrates an interesting finding for the paper: how model size affects explanation quality.
-### Estimated Time
-~10-20 minutes on M4 Pro.
-### Output
-- `adapters/` — LoRA adapter weights (small, ~10-50MB)
-- Training loss curve printed to terminal
----
-## 5. Gradio App (`app.py`)
-### Model Loading
-At startup, load the base model + LoRA adapter:
-```python
-from mlx_lm import load, generate
-model, tokenizer = load("models/Qwen3.5-0.8B-MLX-9bit", adapter_path="adapters")
-```
-### Tab 1: Classify
-- **Input:** `gr.Textbox` (paste email, 12 lines) + `gr.File` (.txt upload) + `gr.Examples`
-- **Output:** `gr.Markdown` with classification result and explanation
-- **Flow:** Wrap email in system+user prompt template → `generate(model, tokenizer, prompt, max_tokens=300)` → display result
-- **Example emails:** Same 4 from the sklearn project:
-  1. Nigerian Prince spam
-  2. Team meeting invite (ham)
-  3. Phishing attempt
-  4. Family Thanksgiving email (ham)
-### Tab 2: Chat
-- **Input:** `gr.ChatInterface` for conversational back-and-forth
-- **System prompt:** "You are a spam email analysis expert. You can classify emails as spam or ham, explain your reasoning, and answer questions about email security and spam patterns."
-- **Features:** Conversation history maintained via Gradio's built-in chat state
-- **Generation:** `generate(model, tokenizer, prompt, max_tokens=500)` — no streaming (mlx_lm CLI doesn't expose a public stream_generate API; standard generate is fast enough at 231 tok/s on this model)
-### Error Handling
-- Model/adapter not found → "Model not found. Run `python3 fine_tune.py` first."
-- Empty input → "Please enter email text or upload a file."
-- Generation: cap at 500 tokens max
----
-## 6. Dependencies (`requirements.txt`)
-```
-mlx>=0.22.0
-mlx-lm>=0.22.0
-gradio==4.19.2
-numpy>=1.24.0
-pandas>=2.0.0
-```
-Notes:
-- `gradio==4.19.2` pinned because 4.44.1 has a bug with Python 3.9 on this Mac (confirmed in the sklearn project)
-- `mlx-lm` pulls in `transformers`, `safetensors`, `sentencepiece`, `tiktoken` as transitive deps
-- `huggingface-hub` comes as a transitive dep of `mlx-lm` — no need to pin separately
----
-## 7. .gitignore
-```
-__pycache__/
-*.pyc
-.pytest_cache/
-venv/
-models/
-adapters/
-fused_model/
-training_data/
-data/
-*.egg-info/
-.DS_Store
-```
-Model weights, adapters, and generated training data are excluded from git (too large). The scripts that create them are tracked.
----
-## 8. Notebook (`spam_classifier_mlx.ipynb`)
-Step-by-step guide for course submission:
-1. **Introduction** — What is fine-tuning? What is MLX? Why Apple Silicon?
-2. **What is LoRA?** — Explain low-rank adaptation in simple terms (frozen base + small trainable matrices)
-3. **Environment Setup** — Installing mlx-lm, downloading the model
-4. **Data Preparation** — Loading emails, generating explanations with 9B model, formatting as JSONL
-5. **Inspecting Training Data** — Show 5 examples, discuss quality
-6. **Fine-Tuning** — Running mlx_lm.lora, monitoring loss, understanding the process
-7. **Evaluation** — Test perplexity, manual testing with example emails
-8. **Building the Gradio Interface** — Code walkthrough
-9. **Comparison with sklearn Approach** — Side-by-side: accuracy, edge cases (Lenovo email), explainability style
-10. **Results and Conclusions**
-Code style: Beginner-friendly, explicit loops, comments referencing course concepts.
----
-## 9. Optional: Hugging Face Spaces Deployment
-**Problem:** HF Spaces runs Linux servers (CPU/GPU), not Apple Silicon. MLX requires Apple Silicon.
-**Solution:** After fine-tuning locally:
-1. Fuse LoRA adapter: `mlx_lm.fuse --model models/Qwen3.5-0.8B-MLX-9bit`
-2. The fused model can be uploaded to HuggingFace and loaded with `transformers` on any hardware
-3. Create a separate `app_hf.py` that uses `transformers` + `torch` instead of `mlx_lm`
-4. Deploy to HF Spaces with appropriate `requirements.txt`
-This is a stretch goal — the main project works locally with MLX.
----
-## 10. CHANGELOG
-Same format as the sklearn project:
-```markdown
-## vX.Y.Z — YYYY-MM-DD
-### Title
-- What changed and why
-```
-Starting at v0.1.0. Updated with every change, improvement, bug fix, or finding.
----
-## 11. Code Style
-- Beginner-level Python matching ENGT 375 lecture style
-- Explicit `for` loops instead of comprehensions
-- Comments explaining *why*, referencing course concepts
-- Variable names that read like English
-- No decorators, no metaclasses, no advanced patterns
-- Each file under ~300 lines
----
-## 12. Out of Scope
-- No full fine-tuning (LoRA only)
-- No RLHF/DPO alignment
-- No sklearn/LIME/SHAP (that's the other project)
-- No Ollama dependency
-- No multi-GPU training
-- No custom tokenizer training