VoltageVagabond commited on
Commit
4617d12
·
verified ·
1 Parent(s): 6673246

Clean assistant workflow references

Browse files
docs/superpowers/plans/2026-03-23-gradio-spam-classifier.md DELETED
@@ -1,1105 +0,0 @@
1
- # Gradio Spam Classifier Implementation Plan
2
-
3
- > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
4
-
5
- **Goal:** Build a fresh, beginner-level Gradio spam classifier app with LIME, SHAP, and plain-English explanations — replacing the old Streamlit project.
6
-
7
- **Architecture:** New standalone project (`spam-classifier-gradio/`) that symlinks data from the old project. Three Python files: `utils.py` (shared preprocessing), `train.py` (model training + comparison), `app.py` (Gradio UI). Models saved to `models/` directory.
8
-
9
- **Tech Stack:** Python, scikit-learn, Gradio, LIME, SHAP, NLTK, pandas, numpy, matplotlib, joblib
10
-
11
- **Spec:** `docs/superpowers/specs/2026-03-23-gradio-spam-classifier-design.md`
12
-
13
- ---
14
-
15
- ### Task 1: Project Scaffolding
16
-
17
- **Files:**
18
- - Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/requirements.txt`
19
- - Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/CLAUDE.md`
20
- - Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/CHANGELOG.md`
21
- - Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/models/.gitkeep`
22
-
23
- - [ ] **Step 1: Create the project directory and models folder**
24
-
25
- ```bash
26
- mkdir -p "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/models"
27
- ```
28
-
29
- - [ ] **Step 2: Create symlink to data from old project**
30
-
31
- ```bash
32
- ln -s "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-xai-project/data" "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/data"
33
- ```
34
-
35
- Verify symlink works:
36
- ```bash
37
- ls "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/data/spam_Emails_data.csv"
38
- ```
39
- Expected: file should be listed (not "No such file")
40
-
41
- - [ ] **Step 3: Create requirements.txt**
42
-
43
- ```
44
- numpy>=1.24.0
45
- pandas>=2.0.0
46
- matplotlib>=3.7.0
47
- scikit-learn>=1.3.0
48
- scipy>=1.11.0
49
- nltk>=3.8.0
50
- lime>=0.2.0
51
- shap>=0.44.0
52
- gradio>=4.0.0
53
- joblib>=1.3.0
54
- tqdm>=4.65.0
55
- ```
56
-
57
- - [ ] **Step 4: Create CLAUDE.md**
58
-
59
- ```markdown
60
- # CLAUDE.md
61
-
62
- ## Project Context
63
- Spam email classifier with Gradio UI for ENGT 375 (Applied Machine Learning, Spring 2026, ODU).
64
- Uses scikit-learn (Random Forest, Logistic Regression, SVM ensemble) with LIME and SHAP for explainability.
65
-
66
- ## Code Style
67
- - Beginner-level Python: explicit for-loops, clear variable names, comments explaining why
68
- - No advanced patterns (decorators, metaclasses, complex comprehensions)
69
- - Reference course concepts in comments where applicable
70
-
71
- ## How to Run
72
- 1. Install deps: `pip install -r requirements.txt`
73
- 2. Train models: `python train.py`
74
- 3. Launch app: `python app.py`
75
-
76
- ## Key Files
77
- - `utils.py` — Shared text preprocessing and feature engineering (24 metadata features)
78
- - `train.py` — Data loading, model comparison (RF/LR/SVM), VotingClassifier ensemble, saves to models/
79
- - `app.py` — Gradio UI with Result, LIME, and SHAP tabs
80
-
81
- ## Data
82
- - `data/` is a symlink to `../spam-xai-project/data/`
83
- - Sources: Kaggle spam CSV + GitHub email-dataset
84
- ```
85
-
86
- - [ ] **Step 5: Create initial CHANGELOG.md**
87
-
88
- ```markdown
89
- # Changelog
90
-
91
- All notable changes to this project will be documented in this file.
92
- This serves as a reference for writing the course paper's methodology section.
93
-
94
- ## v0.1.0 — 2026-03-23
95
- ### Initial Project Setup
96
- - Created fresh Gradio-based spam classifier project
97
- - Symlinked data from old spam-xai-project
98
- - Set up requirements.txt with core dependencies
99
- ```
100
-
101
- - [ ] **Step 6: Create .gitignore**
102
-
103
- ```
104
- __pycache__/
105
- *.pyc
106
- .pytest_cache/
107
- models/*.joblib
108
- models/*.json
109
- data/
110
- *.egg-info/
111
- .DS_Store
112
- ```
113
-
114
- - [ ] **Step 7: Initialize git repo**
115
-
116
- ```bash
117
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio"
118
- git init
119
- git add requirements.txt CLAUDE.md CHANGELOG.md models/.gitkeep .gitignore
120
- git commit -m "chore: scaffold project with requirements, CLAUDE.md, CHANGELOG"
121
- ```
122
-
123
- - [ ] **Step 8: Install dependencies**
124
-
125
- ```bash
126
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio"
127
- pip install -r requirements.txt
128
- ```
129
-
130
- Verify: `python -c "import gradio; print(gradio.__version__)"`
131
- Expected: version 4.x or higher
132
-
133
- - [ ] **Step 9: Download NLTK stopwords (if not already present)**
134
-
135
- ```bash
136
- python -c "import nltk; nltk.download('stopwords', quiet=True); print('OK')"
137
- ```
138
-
139
- Expected: `OK`
140
-
141
- ---
142
-
143
- ### Task 2: Utilities Module (`utils.py`)
144
-
145
- **Files:**
146
- - Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/utils.py`
147
- - Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/test_utils.py`
148
-
149
- - [ ] **Step 1: Write test for `preprocess_text`**
150
-
151
- Create `test_utils.py`:
152
- ```python
153
- # Tests for the shared utility functions
154
- # Run: python -m pytest test_utils.py -v
155
-
156
- def test_preprocess_text_strips_html():
157
- from utils import preprocess_text
158
- result = preprocess_text('<b>Hello</b> world')
159
- assert '<' not in result
160
- assert '>' not in result
161
-
162
- def test_preprocess_text_removes_urls():
163
- from utils import preprocess_text
164
- result = preprocess_text('Visit http://example.com for details')
165
- assert 'http' not in result
166
- assert 'example' not in result
167
-
168
- def test_preprocess_text_removes_emails():
169
- from utils import preprocess_text
170
- result = preprocess_text('Contact user@example.com for info')
171
- assert '@' not in result
172
-
173
- def test_preprocess_text_lowercases():
174
- from utils import preprocess_text
175
- result = preprocess_text('HELLO WORLD')
176
- # After stemming, should be lowercase
177
- assert result == result.lower()
178
-
179
- def test_preprocess_text_removes_stopwords():
180
- from utils import preprocess_text
181
- result = preprocess_text('this is a test of the system')
182
- assert 'this' not in result.split()
183
- assert 'the' not in result.split()
184
-
185
- def test_preprocess_text_empty_input():
186
- from utils import preprocess_text
187
- result = preprocess_text('')
188
- assert result == ''
189
- ```
190
-
191
- - [ ] **Step 2: Run tests to verify they fail**
192
-
193
- ```bash
194
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio"
195
- python -m pytest test_utils.py -v
196
- ```
197
- Expected: FAIL — `ModuleNotFoundError: No module named 'utils'`
198
-
199
- - [ ] **Step 3: Write `utils.py` — phrase lists, constants, and `preprocess_text`**
200
-
201
- Copy the preprocessing logic from `spam-xai-project/utils_student.py` (lines 1-76), plus phrase lists (lines 20-58). Remove Ollama/LLM references. Add the `FEATURE_DESCRIPTIONS` dict and `META_FEATURE_NAMES` list.
202
-
203
- The file should contain:
204
- - Imports: `re`, `numpy`, `nltk.corpus.stopwords`, `nltk.stem.PorterStemmer`
205
- - Phrase lists: `spam_context_phrases`, `ham_context_phrases`, `registration_phrases`, `url_shorteners`, `legitimate_platforms`
206
- - `META_FEATURE_NAMES` — list of 24 strings
207
- - `FEATURE_DESCRIPTIONS` — dict mapping each metadata feature name to a human-readable string
208
- - `preprocess_text(text)` — same logic as `utils_student.py:65-76`
209
- - `compute_metadata_features(texts)` — placeholder (next step)
210
-
211
- - [ ] **Step 4: Run preprocessing tests to verify they pass**
212
-
213
- ```bash
214
- python -m pytest test_utils.py -v
215
- ```
216
- Expected: All preprocessing tests PASS
217
-
218
- - [ ] **Step 5: Write tests for `compute_metadata_features`**
219
-
220
- Add to `test_utils.py`:
221
- ```python
222
- import numpy as np
223
-
224
- def test_compute_metadata_features_shape():
225
- from utils import compute_metadata_features
226
- result = compute_metadata_features(['Hello world!', 'Buy now!!!'])
227
- assert isinstance(result, np.ndarray)
228
- assert result.shape == (2, 24)
229
-
230
- def test_compute_metadata_features_exclamation_density():
231
- from utils import compute_metadata_features
232
- # "Buy now!!!" has 3 exclamation marks, 1 sentence
233
- result = compute_metadata_features(['Buy now!!!'])
234
- exclamation_density = result[0][0]
235
- assert exclamation_density == 3.0
236
-
237
- def test_compute_metadata_features_dollar_count():
238
- from utils import compute_metadata_features
239
- result = compute_metadata_features(['Win $100 or $200'])
240
- dollar_count = result[0][1]
241
- assert dollar_count == 2
242
-
243
- def test_compute_metadata_features_spam_phrases():
244
- from utils import compute_metadata_features
245
- result = compute_metadata_features(['Act now! Buy now!'])
246
- spam_phrase_count = result[0][3]
247
- assert spam_phrase_count >= 2 # 'act now' and 'buy now'
248
- ```
249
-
250
- - [ ] **Step 6: Run tests to verify new tests fail**
251
-
252
- ```bash
253
- python -m pytest test_utils.py -v
254
- ```
255
- Expected: The new `test_compute_metadata_features_*` tests FAIL (function is placeholder)
256
-
257
- - [ ] **Step 7: Implement `compute_metadata_features` in `utils.py`**
258
-
259
- Copy the full 24-feature computation logic from `spam-xai-project/utils_student.py` lines 82-236. This is the exact same code — explicit `for` loops, same feature order, same comments.
260
-
261
- - [ ] **Step 8: Run all tests to verify they pass**
262
-
263
- ```bash
264
- python -m pytest test_utils.py -v
265
- ```
266
- Expected: All tests PASS (preprocessing + metadata)
267
-
268
- - [ ] **Step 9: Commit**
269
-
270
- ```bash
271
- git add utils.py test_utils.py
272
- git commit -m "feat: add utils.py with preprocessing and 24 metadata features"
273
- ```
274
-
275
- ---
276
-
277
- ### Task 3: Training Script (`train.py`)
278
-
279
- **Files:**
280
- - Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/train.py`
281
-
282
- This task builds the full training pipeline. No separate test file — the training script itself prints classification reports and saves a `training_report.json` as verification.
283
-
284
- - [ ] **Step 1: Write `train.py` — complete file**
285
-
286
- Write the full `train.py` file. Here is the complete code:
287
-
288
- ```python
289
- # Train the spam classifier — compare models and build ensemble
290
- # ENGT 375 Project - Spring 2026 - ODU
291
- # Run: python train.py
292
-
293
- import json
294
- import warnings
295
- import numpy as np
296
- import pandas as pd
297
- from pathlib import Path
298
- from sklearn.model_selection import train_test_split
299
- from sklearn.feature_extraction.text import TfidfVectorizer
300
- from sklearn.ensemble import RandomForestClassifier, VotingClassifier
301
- from sklearn.linear_model import LogisticRegression
302
- from sklearn.svm import SVC
303
- from sklearn.preprocessing import MinMaxScaler
304
- from sklearn.metrics import (classification_report, f1_score,
305
- accuracy_score, precision_score, recall_score,
306
- precision_recall_curve)
307
- from scipy.sparse import hstack, csr_matrix
308
- from tqdm import tqdm
309
- import joblib
310
- from utils import preprocess_text, compute_metadata_features, META_FEATURE_NAMES
311
-
312
- warnings.filterwarnings('ignore', category=FutureWarning)
313
- warnings.filterwarnings('ignore', category=DeprecationWarning)
314
-
315
- # Paths
316
- project_dir = Path(__file__).parent
317
- data_dir = project_dir / 'data'
318
- models_dir = project_dir / 'models'
319
- random_state = 42
320
-
321
- kaggle_csv = data_dir / 'spam_Emails_data.csv'
322
- github_dataset_dir = data_dir / 'email-dataset-main' / 'email-dataset-main' / 'dataset'
323
- KAGGLE_CAP = 100000 # stratified sample to keep training fast
324
-
325
-
326
- # ---- Data Loading ----
327
-
328
- print('Starting model training...')
329
-
330
- # Load Kaggle 190K spam dataset (stratified sample)
331
- df = pd.DataFrame(columns=['text', 'label'])
332
- if kaggle_csv.exists():
333
- print('Loading Kaggle spam dataset...')
334
- df_kaggle = pd.read_csv(kaggle_csv)
335
- # Normalize column names — Kaggle CSV has 'label' and 'text' columns
336
- # Labels are 'Ham'/'Spam' — normalize to lowercase
337
- df_kaggle = df_kaggle.rename(columns={'label': 'label_raw', 'text': 'text'})
338
- df_kaggle['label'] = df_kaggle['label_raw'].str.strip().str.lower().map({'ham': 'ham', 'spam': 'spam'})
339
- df_kaggle = df_kaggle[['text', 'label']].dropna(subset=['label', 'text'])
340
- print(' Kaggle total: %d emails' % len(df_kaggle))
341
- # Stratified sample to cap size (same as old project)
342
- if len(df_kaggle) > KAGGLE_CAP:
343
- df_kaggle = df_kaggle.groupby('label', group_keys=False).apply(
344
- lambda x: x.sample(n=int(KAGGLE_CAP * len(x) / len(df_kaggle)),
345
- random_state=random_state)
346
- )
347
- print(' Kaggle after stratified cap: %d emails' % len(df_kaggle))
348
- df = df_kaggle.reset_index(drop=True)
349
- else:
350
- print('WARNING: Kaggle CSV not found at %s' % str(kaggle_csv))
351
-
352
- # Load GitHub email-dataset (individual text files)
353
- # dataset/1/ = ham, dataset/2/ = spam
354
- if github_dataset_dir.exists():
355
- print('Loading GitHub email-dataset...')
356
- github_rows = []
357
- for subdir, lbl in [('1', 'ham'), ('2', 'spam')]:
358
- folder = github_dataset_dir / subdir
359
- if folder.exists():
360
- for fpath in folder.iterdir():
361
- if fpath.is_file():
362
- try:
363
- content = fpath.read_text(encoding='utf-8', errors='replace')
364
- if content.strip():
365
- github_rows.append({'text': content, 'label': lbl})
366
- except Exception:
367
- pass
368
- if github_rows:
369
- df_github = pd.DataFrame(github_rows)
370
- print(' GitHub dataset: %d emails (%d ham, %d spam)' % (
371
- len(df_github),
372
- len(df_github[df_github['label'] == 'ham']),
373
- len(df_github[df_github['label'] == 'spam'])
374
- ))
375
- df = pd.concat([df, df_github], ignore_index=True)
376
- else:
377
- print('WARNING: GitHub email-dataset not found at %s' % str(github_dataset_dir))
378
-
379
- if len(df) == 0:
380
- raise RuntimeError('No training data found! Check that data/ symlink is valid.')
381
-
382
- # Deduplicate
383
- before_dedup = len(df)
384
- df = df.drop_duplicates(subset=['text']).reset_index(drop=True)
385
- print('Combined dataset: %d emails (removed %d duplicates)' % (len(df), before_dedup - len(df)))
386
- print(' Ham: %d, Spam: %d' % (len(df[df['label'] == 'ham']), len(df[df['label'] == 'spam'])))
387
-
388
-
389
- # ---- Preprocessing & Feature Engineering ----
390
-
391
- print('Preprocessing text...')
392
- df['clean_text'] = df['text'].apply(preprocess_text)
393
-
394
- # TF-IDF: same parameters as the old project for comparable results
395
- print('Building TF-IDF features (max 3000, ngrams 1-3)...')
396
- tfidf = TfidfVectorizer(
397
- max_features=3000,
398
- ngram_range=(1, 3),
399
- min_df=2,
400
- max_df=0.90,
401
- sublinear_tf=True,
402
- )
403
- X_tfidf = tfidf.fit_transform(df['clean_text'])
404
-
405
- # 24 metadata features (exclamation density, dollar signs, caps ratio, etc.)
406
- print('Computing 24 metadata features...')
407
- X_meta = compute_metadata_features(df['text'].values)
408
-
409
- # Scale metadata to 0-1 range so they match TF-IDF scale
410
- # Without this, features like email_length (could be 1000+) would dominate
411
- meta_scaler = MinMaxScaler()
412
- X_meta_scaled = meta_scaler.fit_transform(X_meta)
413
-
414
- # Combine TF-IDF + metadata into one feature matrix
415
- X_combined = hstack([X_tfidf, csr_matrix(X_meta_scaled)])
416
- feature_names = list(tfidf.get_feature_names_out()) + META_FEATURE_NAMES
417
-
418
- # Encode labels: 1 = spam, 0 = ham
419
- y = (df['label'] == 'spam').astype(int)
420
-
421
- print('Total features: %d (%d TF-IDF + %d metadata)' % (
422
- len(feature_names), X_tfidf.shape[1], len(META_FEATURE_NAMES)))
423
- ```
424
-
425
- The rest of `train.py` (model comparison, ensemble, saving) is added in Steps 2-4 below — but they are all part of the same file, appended after this code.
426
-
427
- - [ ] **Step 2: Add model comparison section**
428
-
429
- After building `X_combined` and `y`, add:
430
- - 70/30 stratified split
431
- - Train RF, LR, SVM individually
432
- - Print classification reports for each
433
- - Collect F1 scores
434
-
435
- ```python
436
- # ---- Train/Test Split ----
437
-
438
- X_train, X_test, y_train, y_test = train_test_split(
439
- X_combined, y, test_size=0.3, random_state=random_state, stratify=y
440
- )
441
- print('Train: %d, Test: %d' % (X_train.shape[0], X_test.shape[0]))
442
-
443
-
444
- # Helper to collect metrics for the training report
445
- def get_metrics(y_true, y_pred):
446
- """Compute accuracy, precision, recall, F1 for the spam (1) class."""
447
- return {
448
- 'accuracy': round(accuracy_score(y_true, y_pred), 4),
449
- 'precision': round(precision_score(y_true, y_pred), 4),
450
- 'recall': round(recall_score(y_true, y_pred), 4),
451
- 'f1': round(f1_score(y_true, y_pred), 4),
452
- }
453
-
454
-
455
- # ---- Model Comparison ----
456
- # Train three classifiers individually and compare
457
-
458
- print('\n--- Random Forest ---')
459
- rf = RandomForestClassifier(n_estimators=200, n_jobs=-1, class_weight='balanced', random_state=random_state)
460
- rf.fit(X_train, y_train)
461
- rf_pred = rf.predict(X_test)
462
- rf_metrics = get_metrics(y_test, rf_pred)
463
- print(classification_report(y_test, rf_pred, target_names=['Ham', 'Spam']))
464
-
465
- print('\n--- Logistic Regression ---')
466
- lr = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=random_state)
467
- lr.fit(X_train, y_train)
468
- lr_pred = lr.predict(X_test)
469
- lr_metrics = get_metrics(y_test, lr_pred)
470
- print(classification_report(y_test, lr_pred, target_names=['Ham', 'Spam']))
471
-
472
- print('\n--- SVM (Linear) ---')
473
- svm = SVC(kernel='linear', class_weight='balanced', probability=True, random_state=random_state)
474
- svm.fit(X_train, y_train)
475
- svm_pred = svm.predict(X_test)
476
- svm_metrics = get_metrics(y_test, svm_pred)
477
- print(classification_report(y_test, svm_pred, target_names=['Ham', 'Spam']))
478
- ```
479
-
480
- - [ ] **Step 3: Add VotingClassifier ensemble and threshold optimization**
481
-
482
- ```python
483
- # Build VotingClassifier with all three models
484
- # VotingClassifier retrains the models internally, so we pass fresh estimators
485
- print('\n--- Voting Ensemble ---')
486
- voting = VotingClassifier(
487
- estimators=[
488
- ('rf', RandomForestClassifier(n_estimators=200, n_jobs=-1, class_weight='balanced', random_state=random_state)),
489
- ('lr', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=random_state)),
490
- ('svm', SVC(kernel='linear', class_weight='balanced', probability=True, random_state=random_state)),
491
- ],
492
- voting='soft',
493
- n_jobs=-1
494
- )
495
- voting.fit(X_train, y_train)
496
- voting_pred = voting.predict(X_test)
497
- voting_metrics = get_metrics(y_test, voting_pred)
498
- print(classification_report(y_test, voting_pred, target_names=['Ham', 'Spam']))
499
-
500
- # Find optimal threshold using precision-recall curve
501
- # We want the highest threshold where ham predictions are >= 99% precise
502
- y_proba = voting.predict_proba(X_test)[:, 1]
503
- precision, recall, thresholds_pr = precision_recall_curve(y_test, y_proba)
504
- y_test_arr = np.array(y_test) # convert to numpy to avoid pandas .values issues
505
- best_threshold = 0.50
506
- for t in sorted(thresholds_pr, reverse=True):
507
- predicted_ham_mask = y_proba < t
508
- if predicted_ham_mask.sum() == 0:
509
- continue
510
- ham_precision = (y_test_arr[predicted_ham_mask] == 0).sum() / predicted_ham_mask.sum()
511
- if ham_precision >= 0.99:
512
- best_threshold = t
513
- break
514
- optimal_threshold = best_threshold
515
- print('Optimal threshold (99%% ham precision): %.4f' % optimal_threshold)
516
- ```
517
-
518
- - [ ] **Step 4: Add model saving section**
519
-
520
- ```python
521
- # Save all model artifacts
522
- models_dir.mkdir(exist_ok=True)
523
- joblib.dump(voting, models_dir / 'voting_model.joblib')
524
- joblib.dump(tfidf, models_dir / 'tfidf_vectorizer.joblib')
525
- joblib.dump(meta_scaler, models_dir / 'meta_scaler.joblib')
526
- joblib.dump(feature_names, models_dir / 'feature_names.joblib')
527
- joblib.dump(optimal_threshold, models_dir / 'optimal_threshold.joblib')
528
-
529
- # Save 200-row training sample for LIME
530
- X_train_dense = X_train.toarray()
531
- rng = np.random.RandomState(random_state)
532
- sample_idx = rng.choice(X_train_dense.shape[0], size=min(200, X_train_dense.shape[0]), replace=False)
533
- training_sample = X_train_dense[sample_idx]
534
- joblib.dump(training_sample, models_dir / 'training_sample.joblib')
535
-
536
- # Save training report as JSON (includes accuracy, precision, recall, F1 per spec)
537
- report = {
538
- 'random_forest': rf_metrics,
539
- 'logistic_regression': lr_metrics,
540
- 'svm': svm_metrics,
541
- 'voting_ensemble': voting_metrics,
542
- 'optimal_threshold': round(optimal_threshold, 4),
543
- 'best_single_model': max(
544
- [('random_forest', rf_metrics['f1']),
545
- ('logistic_regression', lr_metrics['f1']),
546
- ('svm', svm_metrics['f1'])],
547
- key=lambda x: x[1]
548
- )[0],
549
- }
550
- with open(models_dir / 'training_report.json', 'w') as f:
551
- json.dump(report, f, indent=2)
552
-
553
- print('\nAll models saved to models/')
554
- print('Training report: %s' % json.dumps(report, indent=2))
555
- ```
556
-
557
- - [ ] **Step 5: Run training**
558
-
559
- ```bash
560
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio"
561
- python train.py
562
- ```
563
-
564
- Expected output:
565
- - Loading messages for Kaggle and GitHub datasets
566
- - Classification reports for RF, LR, SVM, and Voting Ensemble
567
- - Optimal threshold printed
568
- - "All models saved to models/"
569
- - Files in `models/`: `voting_model.joblib`, `tfidf_vectorizer.joblib`, `meta_scaler.joblib`, `feature_names.joblib`, `optimal_threshold.joblib`, `training_sample.joblib`, `training_report.json`
570
-
571
- Verify:
572
- ```bash
573
- ls models/
574
- cat models/training_report.json
575
- ```
576
-
577
- - [ ] **Step 6: Commit**
578
-
579
- ```bash
580
- git add train.py
581
- git commit -m "feat: add train.py with RF/LR/SVM comparison and VotingClassifier ensemble"
582
- ```
583
-
584
- Note: Model artifacts are already in `.gitignore` from Task 1.
585
-
586
- ---
587
-
588
- ### Task 4: Gradio App — Basic Classification (`app.py`)
589
-
590
- **Files:**
591
- - Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/app.py`
592
-
593
- - [ ] **Step 1: Write `app.py` — model loading and classification function**
594
-
595
- ```python
596
- # Spam Email Classifier with XAI Explanations — Gradio App
597
- # ENGT 375 Project - Spring 2026 - ODU
598
- # Run: python app.py
599
-
600
- import numpy as np
601
- import joblib
602
- import matplotlib
603
- matplotlib.use('Agg') # Non-interactive backend for Gradio
604
- import matplotlib.pyplot as plt
605
- import gradio as gr
606
- from pathlib import Path
607
- from scipy.sparse import hstack, csr_matrix
608
- from utils import (preprocess_text, compute_metadata_features,
609
- META_FEATURE_NAMES, FEATURE_DESCRIPTIONS)
610
-
611
- # Paths
612
- project_dir = Path(__file__).parent
613
- models_dir = project_dir / 'models'
614
-
615
- # Load trained model artifacts
616
- # These are created by running train.py first
617
- def load_models():
618
- """Load all saved model files. Returns None values if models not found."""
619
- try:
620
- model = joblib.load(models_dir / 'voting_model.joblib')
621
- vectorizer = joblib.load(models_dir / 'tfidf_vectorizer.joblib')
622
- scaler = joblib.load(models_dir / 'meta_scaler.joblib')
623
- feature_names = joblib.load(models_dir / 'feature_names.joblib')
624
- threshold = joblib.load(models_dir / 'optimal_threshold.joblib')
625
- training_sample = joblib.load(models_dir / 'training_sample.joblib')
626
- return model, vectorizer, scaler, feature_names, threshold, training_sample
627
- except FileNotFoundError:
628
- return None, None, None, None, None, None
629
-
630
- model, vectorizer, scaler, feature_names, threshold, training_sample = load_models()
631
-
632
-
633
- def classify_email(email_text):
634
- """Classify a single email. Returns (label, confidence, combined_features_sparse_matrix)."""
635
- if model is None:
636
- return "Models not found. Run `python train.py` first.", 0.0, "No model available."
637
-
638
- if not email_text or not email_text.strip():
639
- return "Please enter email text or upload a file.", 0.0, ""
640
-
641
- # Step 1: Preprocess
642
- clean = preprocess_text(email_text)
643
-
644
- # Step 2: TF-IDF
645
- tfidf_vec = vectorizer.transform([clean])
646
-
647
- # Step 3: Metadata features
648
- meta = compute_metadata_features([email_text])
649
- meta_scaled = scaler.transform(meta)
650
-
651
- # Step 4: Combine
652
- combined = hstack([tfidf_vec, csr_matrix(meta_scaled)])
653
-
654
- # Step 5: Predict
655
- proba = model.predict_proba(combined)[0][1] # probability of spam
656
- is_spam = proba >= threshold
657
- label = "SPAM" if is_spam else "HAM (Not Spam)"
658
- confidence = proba if is_spam else (1 - proba)
659
-
660
- return label, confidence, combined
661
- ```
662
-
663
- - [ ] **Step 2: Add plain-English summary function**
664
-
665
- ```python
666
- def generate_summary(label, confidence, email_text, lime_explanation=None):
667
- """Generate a plain-English explanation of the classification."""
668
- # Get metadata feature values for this email
669
- meta = compute_metadata_features([email_text])
670
- meta_values = meta[0]
671
-
672
- summary_lines = []
673
- summary_lines.append("This email was classified as **%s** (%.0f%% confidence).\n" % (label, confidence * 100))
674
- summary_lines.append("**Key factors:**\n")
675
-
676
- # If we have LIME results, use those for the top factors
677
- if lime_explanation is not None:
678
- feature_weights = lime_explanation.as_list()
679
- for feat_name, weight in feature_weights[:5]:
680
- direction = "toward spam" if weight > 0 else "toward ham"
681
- summary_lines.append("- **%s** pushes %s" % (feat_name, direction))
682
- else:
683
- # Fallback: report notable metadata values
684
- for i, name in enumerate(META_FEATURE_NAMES):
685
- val = meta_values[i]
686
- if val > 0:
687
- desc = FEATURE_DESCRIPTIONS.get(name, name)
688
- summary_lines.append("- %s: %.2f" % (desc, val))
689
-
690
- return "\n".join(summary_lines)
691
- ```
692
-
693
- - [ ] **Step 3: Add example emails**
694
-
695
- ```python
696
- # Example emails for quick testing (from the old project)
697
- EXAMPLE_EMAILS = [
698
- ["Subject: URGENT - You Have Won $5,000,000!!!\n\nDear Friend,\n\nCONGRATULATIONS!!! You have been selected as the winner of our international lottery program!!!\nTo claim your $5,000,000 USD prize, click the link below IMMEDIATELY and provide your bank details.\n\nACT NOW - This offer expires in 24 hours!!!\n\nClick here: http://totally-legit-prize.com/claim\nSend $500 processing fee to unlock your winnings.\n\nBest regards,\nDr. Prince Mohammed"],
699
- ["Subject: Team sync Thursday 2pm\n\nHi everyone,\n\nJust a reminder that we have our weekly team sync this Thursday at 2pm in Conference Room B.\n\nAgenda:\n- Sprint review\n- Q2 planning discussion\n- New hire onboarding update\n\nPlease come prepared with your status updates.\n\nThanks,\nSarah"],
700
- ["Subject: Your account has been compromised!\n\nDear Customer,\n\nWe detected suspicious activity on your account. Click here immediately to verify your identity: http://secure-bank-login.com/verify\n\nIf you do not verify within 24 hours, your account will be permanently locked.\n\nSecurity Team"],
701
- ["Subject: Thanksgiving dinner plans\n\nHi everyone!\n\nI wanted to start planning for Thanksgiving dinner. I'm thinking we could do it at my place this year. What does everyone think about 4pm?\n\nLet me know if you have any dietary restrictions or if you want to bring a dish.\n\nLove,\nMom"],
702
- ]
703
- ```
704
-
705
- - [ ] **Step 4: Add Gradio interface with basic Result tab**
706
-
707
- ```python
708
- def classify_and_explain(email_text, file_obj):
709
- """Main function called by Gradio. Returns result text, LIME plot, SHAP plot."""
710
- # Handle file upload
711
- if file_obj is not None:
712
- try:
713
- email_text = Path(file_obj.name).read_text(encoding='utf-8', errors='replace')
714
- except Exception:
715
- return "Could not read file. Please upload a .txt file.", None, None
716
-
717
- if not email_text or not email_text.strip():
718
- return "Please enter email text or upload a file.", None, None
719
-
720
- if model is None:
721
- return "Models not found. Run `python train.py` first.", None, None
722
-
723
- # Classify
724
- label, confidence, combined = classify_email(email_text)
725
-
726
- # Generate summary (LIME will be added in Task 5)
727
- summary = generate_summary(label, confidence, email_text)
728
-
729
- return summary, None, None # LIME and SHAP plots added in later tasks
730
-
731
-
732
- # Build the Gradio interface
733
- with gr.Blocks(title="Spam Email Classifier with XAI") as demo:
734
- gr.Markdown("# Spam Email Classifier with XAI Explanations")
735
- gr.Markdown("Paste an email below or upload a .txt file to classify it as spam or ham.")
736
-
737
- with gr.Row():
738
- with gr.Column(scale=1):
739
- email_input = gr.Textbox(
740
- label="Email Text",
741
- placeholder="Paste email content here...",
742
- lines=12,
743
- )
744
- file_input = gr.File(label="Or upload a .txt file", file_types=[".txt"])
745
- classify_btn = gr.Button("Classify Email", variant="primary")
746
- gr.Examples(
747
- examples=EXAMPLE_EMAILS,
748
- inputs=email_input,
749
- label="Example Emails",
750
- )
751
-
752
- with gr.Column(scale=1):
753
- with gr.Tab("Result"):
754
- result_output = gr.Markdown(label="Classification Result")
755
- with gr.Tab("LIME Explanation"):
756
- lime_output = gr.Plot(label="LIME")
757
- with gr.Tab("SHAP — Metadata Feature Importance"):
758
- shap_output = gr.Plot(label="SHAP")
759
-
760
- classify_btn.click(
761
- fn=classify_and_explain,
762
- inputs=[email_input, file_input],
763
- outputs=[result_output, lime_output, shap_output],
764
- )
765
-
766
- if __name__ == '__main__':
767
- demo.launch()
768
- ```
769
-
770
- - [ ] **Step 5: Test the basic app launches**
771
-
772
- ```bash
773
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio"
774
- python app.py
775
- ```
776
-
777
- Expected: Gradio launches at `http://127.0.0.1:7860`. Open in browser, paste Nigerian Prince email, click Classify. Should see the plain-English summary in the Result tab. LIME and SHAP tabs will be empty (None) for now.
778
-
779
- Stop the server with Ctrl+C after verifying.
780
-
781
- - [ ] **Step 6: Commit**
782
-
783
- ```bash
784
- git add app.py
785
- git commit -m "feat: add Gradio app with classification and plain-English summary"
786
- ```
787
-
788
- ---
789
-
790
- ### Task 5: LIME Explanation Tab
791
-
792
- **Files:**
793
- - Modify: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/app.py`
794
-
795
- - [ ] **Step 1: Add LIME imports and explainer setup**
796
-
797
- Add to the top of `app.py` after existing imports:
798
- ```python
799
- import lime
800
- import lime.lime_tabular
801
- ```
802
-
803
- After the `load_models()` call, add:
804
- ```python
805
- # Set up LIME explainer using the saved training sample
806
- # LIME needs a sample of training data to understand feature distributions
807
- if training_sample is not None:
808
- lime_explainer = lime.lime_tabular.LimeTabularExplainer(
809
- training_data=training_sample,
810
- feature_names=feature_names,
811
- class_names=['Ham', 'Spam'],
812
- mode='classification',
813
- )
814
- else:
815
- lime_explainer = None
816
- ```
817
-
818
- - [ ] **Step 2: Add `generate_lime_plot` function**
819
-
820
- ```python
821
- def generate_lime_plot(combined_features):
822
- """Generate a LIME explanation plot for the classified email."""
823
- if lime_explainer is None:
824
- return None
825
-
826
- # Convert sparse matrix to dense array for LIME
827
- # This is fine for a single email - only a problem with thousands
828
- instance = combined_features.toarray()[0]
829
-
830
- # LIME explains this single prediction
831
- explanation = lime_explainer.explain_instance(
832
- instance,
833
- model.predict_proba,
834
- num_features=10,
835
- )
836
-
837
- # Create matplotlib figure
838
- fig = explanation.as_pyplot_figure()
839
- fig.set_size_inches(10, 6)
840
- fig.tight_layout()
841
- return fig, explanation
842
- ```
843
-
844
- - [ ] **Step 3: Update `classify_and_explain` to include LIME**
845
-
846
- Replace the `classify_and_explain` function body to call `generate_lime_plot` and pass the LIME explanation to `generate_summary`:
847
-
848
- ```python
849
- def classify_and_explain(email_text, file_obj):
850
- # ... (file handling and validation unchanged) ...
851
-
852
- # Classify
853
- label, confidence, combined = classify_email(email_text)
854
-
855
- # LIME explanation
856
- lime_fig = None
857
- lime_exp = None
858
- if lime_explainer is not None:
859
- lime_fig, lime_exp = generate_lime_plot(combined)
860
-
861
- # Generate summary using LIME results
862
- summary = generate_summary(label, confidence, email_text, lime_explanation=lime_exp)
863
-
864
- return summary, lime_fig, None # SHAP added in Task 6
865
- ```
866
-
867
- - [ ] **Step 4: Test LIME tab works**
868
-
869
- ```bash
870
- python app.py
871
- ```
872
-
873
- Open browser, classify an example email. The LIME tab should show a horizontal bar chart with feature names and their contributions (green = ham, red = spam).
874
-
875
- Stop server with Ctrl+C.
876
-
877
- - [ ] **Step 5: Commit**
878
-
879
- ```bash
880
- git add app.py
881
- git commit -m "feat: add LIME explanation tab with feature importance plot"
882
- ```
883
-
884
- ---
885
-
886
- ### Task 6: SHAP Explanation Tab
887
-
888
- **Files:**
889
- - Modify: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/app.py`
890
-
891
- - [ ] **Step 1: Add SHAP import**
892
-
893
- Add at the top of `app.py`:
894
- ```python
895
- import shap
896
- ```
897
-
898
- - [ ] **Step 2: Add `generate_shap_plot` function**
899
-
900
- This uses KernelExplainer on metadata features only (24 features) for speed:
901
-
902
- ```python
903
- def generate_shap_plot(email_text):
904
- """Generate a SHAP bar chart for metadata features only.
905
-
906
- We only use the 24 metadata features (not 3000+ TF-IDF features)
907
- because KernelExplainer would be too slow on the full feature set.
908
- """
909
- if model is None or training_sample is None:
910
- return None
911
-
912
- # Compute metadata features for this email
913
- meta = compute_metadata_features([email_text])
914
- meta_scaled = scaler.transform(meta)
915
-
916
- # Get metadata columns from training sample for background
917
- # Metadata features are the last 24 columns in the combined feature matrix
918
- n_meta = len(META_FEATURE_NAMES)
919
- background_meta = training_sample[:50, -n_meta:]
920
-
921
- # We need a predict function that works on metadata-only input
922
- # We'll create a wrapper that fills in zeros for TF-IDF features
923
- n_tfidf = training_sample.shape[1] - n_meta
924
-
925
- def predict_with_meta_only(meta_array):
926
- """Predict using only metadata features (pad TF-IDF with zeros)."""
927
- zeros = np.zeros((meta_array.shape[0], n_tfidf))
928
- full = np.hstack([zeros, meta_array])
929
- return model.predict_proba(full)
930
-
931
- # Create SHAP explainer with small background sample
932
- explainer = shap.KernelExplainer(predict_with_meta_only, background_meta)
933
- shap_values = explainer.shap_values(meta_scaled, nsamples=100)
934
-
935
- # shap_values format depends on SHAP version:
936
- # - Older versions: list of arrays [ham_values, spam_values]
937
- # - Newer versions: single array for binary classification
938
- # We want the SHAP values for the spam class (class 1)
939
- if isinstance(shap_values, list) and len(shap_values) == 2:
940
- # List format: [ham_values, spam_values], each is (n_samples, n_features)
941
- spam_shap = shap_values[1][0]
942
- elif isinstance(shap_values, np.ndarray) and shap_values.ndim == 2:
943
- # Single array (n_samples, n_features) — this IS the spam class values
944
- spam_shap = shap_values[0]
945
- else:
946
- # Fallback: try to use as-is
947
- spam_shap = np.array(shap_values).flatten()[:n_meta]
948
-
949
- # Create bar chart
950
- fig, ax = plt.subplots(figsize=(10, 6))
951
- sorted_idx = np.argsort(np.abs(spam_shap))
952
- top_idx = sorted_idx[-10:] # top 10 features
953
-
954
- colors = ['#ff4444' if v > 0 else '#4444ff' for v in spam_shap[top_idx]]
955
- ax.barh(
956
- [META_FEATURE_NAMES[i] for i in top_idx],
957
- spam_shap[top_idx],
958
- color=colors,
959
- )
960
- ax.set_xlabel('SHAP Value (impact on spam prediction)')
961
- ax.set_title('SHAP — Metadata Feature Importance')
962
- ax.axvline(x=0, color='gray', linestyle='--', linewidth=0.5)
963
- fig.tight_layout()
964
- return fig
965
- ```
966
-
967
- - [ ] **Step 3: Update `classify_and_explain` to include SHAP**
968
-
969
- Replace the last return line:
970
- ```python
971
- # SHAP explanation (metadata features only)
972
- shap_fig = generate_shap_plot(email_text)
973
-
974
- return summary, lime_fig, shap_fig
975
- ```
976
-
977
- - [ ] **Step 4: Test SHAP tab works**
978
-
979
- ```bash
980
- python app.py
981
- ```
982
-
983
- Open browser, classify an email. The SHAP tab should show a horizontal bar chart with metadata feature names. Red bars = pushes toward spam, blue = pushes toward ham. May take a few seconds (KernelExplainer is slower than TreeExplainer).
984
-
985
- Stop server with Ctrl+C.
986
-
987
- - [ ] **Step 5: Commit**
988
-
989
- ```bash
990
- git add app.py
991
- git commit -m "feat: add SHAP metadata feature importance tab"
992
- ```
993
-
994
- ---
995
-
996
- ### Task 7: Update CHANGELOG and Final Polish
997
-
998
- **Files:**
999
- - Modify: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/CHANGELOG.md`
1000
-
1001
- - [ ] **Step 1: Update CHANGELOG.md with full v0.1.0 entry**
1002
-
1003
- ```markdown
1004
- # Changelog
1005
-
1006
- All notable changes to this project will be documented in this file.
1007
- This serves as a reference for writing the course paper's methodology section.
1008
-
1009
- ## v0.1.0 — 2026-03-23
1010
- ### Initial Build
1011
- - Created fresh project with Gradio UI (replacing old Streamlit version)
1012
- - Ported preprocessing and 24 metadata features from old project's utils_student.py
1013
- - Loaded Kaggle spam dataset (~190K emails, capped at 100K) + GitHub email-dataset
1014
- - Trained and compared 3 models: Random Forest, Logistic Regression, SVM
1015
- - Combined all 3 into a VotingClassifier (soft voting) for better accuracy
1016
- - Built Gradio interface with:
1017
- - Text input + file upload
1018
- - Result tab with plain-English summary (top 5 factors)
1019
- - LIME explanation tab (full feature space, top 10 features)
1020
- - SHAP tab (metadata features only, KernelExplainer)
1021
- - 4 built-in example emails for quick testing
1022
- - All paths cross-platform (macOS compatible, no Windows .bat files)
1023
- - No LLM/Ollama dependency — pure scikit-learn
1024
- ```
1025
-
1026
- - [ ] **Step 2: Run the app end-to-end and verify all tabs work**
1027
-
1028
- ```bash
1029
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio"
1030
- python app.py
1031
- ```
1032
-
1033
- Verify in browser:
1034
- 1. Paste Nigerian Prince email → Result tab shows SPAM with high confidence
1035
- 2. LIME tab shows feature importance bar chart
1036
- 3. SHAP tab shows metadata feature bar chart
1037
- 4. Paste meeting invite email → Result tab shows HAM
1038
- 5. Upload a .txt file → works
1039
- 6. Examples dropdown → works
1040
-
1041
- - [ ] **Step 3: Commit final state**
1042
-
1043
- ```bash
1044
- git add CHANGELOG.md
1045
- git commit -m "docs: update CHANGELOG with v0.1.0 initial build"
1046
- ```
1047
-
1048
- ---
1049
-
1050
- ### Task 8: Retroactive Changelog for Old Project
1051
-
1052
- **Files:**
1053
- - Create: `/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-xai-project/CHANGELOG.md`
1054
-
1055
- This is a documentation-only task. No code changes to the old project.
1056
-
1057
- - [ ] **Step 1: Examine old project file timestamps and code comments**
1058
-
1059
- Check modification dates:
1060
- ```bash
1061
- ls -lt "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-xai-project/"
1062
- ```
1063
-
1064
- Read code comments with "Change" markers:
1065
- ```bash
1066
- grep -n "Change\|change\|--- " "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-xai-project/app.py" | head -20
1067
- grep -n "Change\|change\|--- " "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-xai-project/retrain.py" | head -20
1068
- ```
1069
-
1070
- - [ ] **Step 2: Write `CHANGELOG.md` for old project**
1071
-
1072
- Reconstruct the development timeline from file dates and code comments. The changelog should cover:
1073
- - Initial Streamlit app (`app.py`) with basic spam classification
1074
- - Feature engineering evolution (11 → 24 metadata features)
1075
- - Addition of LIME, SHAP, ELI5 explanations
1076
- - Ollama/Qwen LLM integration for AI explanations
1077
- - Student version creation (`app_student.py`, `utils_student.py`, `retrain_student.py`)
1078
- - Context-aware phrase lists (Change 4)
1079
- - Domain whitelist (Change 2)
1080
- - Newsletter augmentation (Change 9)
1081
- - OCR support addition
1082
- - Dark mode fixes
1083
-
1084
- Structure as dated version entries using file modification timestamps.
1085
-
1086
- - [ ] **Step 3: Commit to old project (if git initialized) or just save**
1087
-
1088
- If the old project has no git repo, just save the file. The user can decide whether to initialize git later.
1089
-
1090
- ---
1091
-
1092
- ### Task Summary
1093
-
1094
- | Task | Description | Depends On |
1095
- |------|-------------|------------|
1096
- | 1 | Project scaffolding (dirs, symlink, deps, git) | — |
1097
- | 2 | `utils.py` with preprocessing + 24 features | Task 1 |
1098
- | 3 | `train.py` with model comparison + ensemble | Task 2 |
1099
- | 4 | `app.py` basic Gradio UI + classification | Task 3 |
1100
- | 5 | LIME explanation tab | Task 4 |
1101
- | 6 | SHAP explanation tab | Task 5 |
1102
- | 7 | CHANGELOG update + final verification | Task 6 |
1103
- | 8 | Retroactive changelog for old project | — (independent) |
1104
-
1105
- Tasks 1-7 are sequential. Task 8 can run in parallel with any task.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/superpowers/plans/2026-03-23-mlx-spam-classifier.md DELETED
@@ -1,848 +0,0 @@
1
- # MLX Spam Classifier Implementation Plan
2
-
3
- > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
4
-
5
- **Goal:** Fine-tune Qwen3.5-0.8B on spam classification using Apple MLX with LoRA, build a Gradio UI with classify + chat tabs, and create comprehensive documentation.
6
-
7
- **Architecture:** New standalone project (`spam-classifier-mlx/`). Three main scripts: `prepare_data.py` (generates training JSONL using local 9B model), `fine_tune.py` (LoRA fine-tuning wrapper), `app.py` (Gradio UI). A `docs/` folder contains all reference documentation. Models downloaded from HuggingFace.
8
-
9
- **Tech Stack:** Python, Apple MLX, mlx-lm, Gradio, pandas, numpy
10
-
11
- **Spec:** `docs/superpowers/specs/2026-03-23-mlx-spam-classifier-design.md`
12
-
13
- **Agent Team:**
14
- - **Implementer agents** — one per task, writes code
15
- - **QA agent** — dispatched after each task to verify: plan compliance, code quality, wiring correctness, beginner-level code style matching ENGT 375 lectures
16
-
17
- ---
18
-
19
- ### Task 1: Project Scaffolding + Documentation Folder
20
-
21
- **Files:**
22
- - Create: `spam-classifier-mlx/requirements.txt`
23
- - Create: `spam-classifier-mlx/.gitignore`
24
- - Create: `spam-classifier-mlx/CLAUDE.md`
25
- - Create: `spam-classifier-mlx/CHANGELOG.md`
26
- - Create: `spam-classifier-mlx/docs/README.md`
27
- - Create: `spam-classifier-mlx/docs/01-what-is-mlx.md`
28
- - Create: `spam-classifier-mlx/docs/02-what-is-lora.md`
29
- - Create: `spam-classifier-mlx/docs/03-training-guide.md`
30
- - Create: `spam-classifier-mlx/docs/04-mlx-lm-reference.md`
31
- - Create: `spam-classifier-mlx/docs/05-deployment-guide.md`
32
-
33
- - [ ] **Step 1: Create project directory, subdirectories, and data symlink**
34
-
35
- ```bash
36
- mkdir -p "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"/{models,adapters,training_data,docs}
37
- ln -s "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-xai-project/data" \
38
- "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx/data"
39
- ```
40
-
41
- Verify: `ls "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx/data/spam_Emails_data.csv"`
42
-
43
- - [ ] **Step 2: Create requirements.txt**
44
-
45
- ```
46
- mlx-lm[train]>=0.31.0
47
- gradio==4.19.2
48
- numpy>=1.24.0
49
- pandas>=2.0.0
50
- ```
51
-
52
- Note: `mlx-lm[train]` installs `mlx`, `transformers`, `safetensors`, `sentencepiece`, `tiktoken`, and training deps. No need to list `mlx` separately.
53
-
54
- - [ ] **Step 3: Create .gitignore**
55
-
56
- ```
57
- __pycache__/
58
- *.pyc
59
- .pytest_cache/
60
- venv/
61
- models/
62
- adapters/
63
- fused_model/
64
- training_data/
65
- data/
66
- *.egg-info/
67
- .DS_Store
68
- ```
69
-
70
- - [ ] **Step 4: Create CLAUDE.md**
71
-
72
- ```markdown
73
- # CLAUDE.md
74
-
75
- ## Project Context
76
- Fine-tuned LLM spam classifier using Apple MLX for ENGT 375 (Applied Machine Learning, Spring 2026, ODU).
77
- Uses LoRA fine-tuning on Qwen3.5-0.8B-MLX-9bit for spam/ham classification with natural language explanations.
78
-
79
- ## Code Style
80
- - Beginner-level Python: explicit for-loops, clear variable names, comments explaining why
81
- - No advanced patterns (decorators, metaclasses, complex comprehensions)
82
- - Reference course concepts in comments where applicable
83
-
84
- ## How to Run
85
- 1. Create venv: `python3 -m venv venv && source venv/bin/activate`
86
- 2. Install deps: `pip install -r requirements.txt`
87
- 3. Generate training data: `python3 prepare_data.py` (~20-40 min, needs 9B model)
88
- 4. Fine-tune model: `python3 fine_tune.py` (~10-20 min)
89
- 5. Launch app: `python3 app.py`
90
-
91
- ## Key Files
92
- - `prepare_data.py` — Generates training JSONL using local Qwen3.5-9B model
93
- - `fine_tune.py` — Wrapper around mlx_lm.lora for LoRA fine-tuning
94
- - `app.py` — Gradio UI with Classify and Chat tabs
95
- - `docs/` — Reference documentation (MLX guide, LoRA explanation, training guide, etc.)
96
-
97
- ## Data
98
- - `data/` is a symlink to `../spam-xai-project/data/`
99
- - `training_data/` contains generated JSONL (created by prepare_data.py)
100
- - `models/` contains downloaded base model
101
- - `adapters/` contains LoRA weights (created by fine_tune.py)
102
- ```
103
-
104
- - [ ] **Step 5: Create initial CHANGELOG.md**
105
-
106
- ```markdown
107
- # Changelog
108
-
109
- All notable changes to this project will be documented in this file.
110
- This serves as a reference for writing the course paper's methodology section.
111
-
112
- ## v0.1.0 — 2026-03-23
113
- ### Initial Project Setup
114
- - Created project scaffold for MLX-based spam classifier
115
- - Set up documentation folder with MLX, LoRA, training, and deployment guides
116
- - Symlinked data from spam-xai-project
117
- ```
118
-
119
- - [ ] **Step 6: Create docs/README.md**
120
-
121
- ```markdown
122
- # Documentation
123
-
124
- Reference guides for the MLX spam classifier project.
125
-
126
- | Document | Description |
127
- |----------|-------------|
128
- | [01-what-is-mlx.md](01-what-is-mlx.md) | What is Apple MLX and why use it on Apple Silicon |
129
- | [02-what-is-lora.md](02-what-is-lora.md) | LoRA fine-tuning explained for beginners |
130
- | [03-training-guide.md](03-training-guide.md) | Step-by-step: preparing data, fine-tuning, evaluating |
131
- | [04-mlx-lm-reference.md](04-mlx-lm-reference.md) | mlx-lm CLI commands and Python API reference |
132
- | [05-deployment-guide.md](05-deployment-guide.md) | How to deploy to Hugging Face Spaces |
133
- ```
134
-
135
- - [ ] **Step 7: Create docs/01-what-is-mlx.md**
136
-
137
- Write a beginner-friendly guide (~200 words) covering:
138
- - MLX is Apple's ML framework built for Apple Silicon (M1/M2/M3/M4 chips)
139
- - Uses unified memory — GPU and CPU share the same RAM (no copying data between them)
140
- - Alternative to PyTorch/TensorFlow for Mac users
141
- - Why it matters: fine-tune LLMs on your laptop instead of needing a cloud GPU
142
- - Reference: https://github.com/ml-explore/mlx
143
-
144
- - [ ] **Step 8: Create docs/02-what-is-lora.md**
145
-
146
- Write a beginner-friendly guide (~300 words) covering:
147
- - Normal fine-tuning: update ALL model weights (billions of parameters, huge memory)
148
- - LoRA: freeze the original weights, add tiny "adapter" matrices alongside them
149
- - Only train the adapters (~1-5% of total parameters)
150
- - Result: same quality, fraction of the memory and time
151
- - QLoRA: base model is quantized (compressed) to save even more memory
152
- - Analogy: "Instead of rewriting a textbook, you add sticky notes with corrections"
153
- - Reference: https://arxiv.org/abs/2106.09685
154
-
155
- - [ ] **Step 9: Create docs/03-training-guide.md**
156
-
157
- Write a step-by-step guide (~400 words) covering:
158
- 1. Prepare your data as JSONL (chat format with system/user/assistant messages)
159
- 2. Download the base model from HuggingFace
160
- 3. Run `mlx_lm.lora --model <path> --train --data <dir>` with key flags
161
- 4. Monitor training loss (should decrease over iterations)
162
- 5. Evaluate with `--test` flag (lower perplexity = better)
163
- 6. Test with `mlx_lm.generate` to see real outputs
164
- 7. Fuse adapter into base model with `mlx_lm.fuse` for deployment
165
- 8. Memory tips: `--grad-checkpoint`, reduce `--batch-size`, reduce `--num-layers`
166
-
167
- - [ ] **Step 10: Create docs/04-mlx-lm-reference.md**
168
-
169
- Write a reference card with all mlx-lm commands:
170
- - `mlx_lm.lora` — all flags: `--model`, `--train`, `--test`, `--data`, `--iters`, `--batch-size`, `--learning-rate`, `--num-layers`, `--adapter-path`, `--mask-prompt`, `--grad-checkpoint`, `--fine-tune-type` (lora/dora/full)
171
- - `mlx_lm.generate` — `--model`, `--adapter-path`, `--prompt`, `--max-tokens`
172
- - `mlx_lm.fuse` — `--model`, `--adapter-path`, `--upload-repo`, `--export-gguf`
173
- - Python API: `from mlx_lm import load, generate`
174
- - Data format: JSONL chat, completions, text formats with examples
175
- - Source: https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/LORA.md
176
-
177
- - [ ] **Step 11: Create docs/05-deployment-guide.md**
178
-
179
- Write a guide (~200 words) covering:
180
- - Problem: HF Spaces runs Linux, not Apple Silicon. MLX won't work there.
181
- - Solution: Fuse adapter → convert to transformers format → deploy with torch
182
- - Step 1: `mlx_lm.fuse --model models/Qwen3.5-0.8B-MLX-9bit`
183
- - Step 2: Upload fused model to HuggingFace Hub
184
- - Step 3: Create `app_hf.py` using `transformers` instead of `mlx_lm`
185
- - Step 4: Create HF Space with `requirements.txt` listing `transformers`, `torch`, `gradio`
186
- - Reference: https://huggingface.co/docs/hub/spaces-sdks-gradio
187
-
188
- - [ ] **Step 12: Initialize git repo and commit**
189
-
190
- ```bash
191
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
192
- git init
193
- git add requirements.txt .gitignore CLAUDE.md CHANGELOG.md docs/
194
- git commit -m "chore: scaffold project with docs, requirements, CLAUDE.md, CHANGELOG"
195
- ```
196
-
197
- - [ ] **Step 13: Create venv and install dependencies**
198
-
199
- ```bash
200
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
201
- python3 -m venv venv
202
- source venv/bin/activate
203
- pip install -r requirements.txt
204
- ```
205
-
206
- Verify: `python3 -c "import mlx_lm; print('mlx-lm OK')"`
207
-
208
- - [ ] **Step 14: Download the base model**
209
-
210
- ```bash
211
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
212
- source venv/bin/activate
213
- huggingface-cli download inferencerlabs/Qwen3.5-0.8B-MLX-9bit --local-dir models/Qwen3.5-0.8B-MLX-9bit
214
- ```
215
-
216
- Verify: `ls models/Qwen3.5-0.8B-MLX-9bit/config.json` (config.json always exists; safetensors may be sharded into multiple files)
217
-
218
- **Fallback model:** If LoRA training on Qwen3.5-0.8B produces poor results (the model uses non-standard GatedDeltaNet attention layers), try `mlx-community/Qwen2.5-0.5B-Instruct-4bit` as an alternative — standard transformer architecture, explicitly supported by mlx-lm.
219
-
220
- Quick test:
221
- ```bash
222
- mlx_lm.generate --model models/Qwen3.5-0.8B-MLX-9bit --prompt "Hello, how are you?" --max-tokens 50
223
- ```
224
- Expected: Model generates a short text response.
225
-
226
- ---
227
-
228
- ### Task 2: Data Preparation (`prepare_data.py`)
229
-
230
- **Files:**
231
- - Create: `spam-classifier-mlx/prepare_data.py`
232
-
233
- - [ ] **Step 1: Write the COMPLETE prepare_data.py**
234
-
235
- This is the most complex script. The implementing agent MUST include all of the following in the file. Key technical details:
236
-
237
- **CRITICAL — Chat Template:** The `mlx_lm.generate()` Python API does NOT auto-apply the chat template. You MUST use `tokenizer.apply_chat_template()` before calling `generate()`. Qwen3.5 uses ChatML format (`<|im_start|>system\n...<|im_end|>`).
238
-
239
- **CRITICAL — Thinking Mode:** Qwen3.5 outputs `<think>...</think>` tags by default. Pass `enable_thinking=False` in `apply_chat_template` to suppress this. If thinking tokens leak into training data, the fine-tuned model will learn to produce them.
240
-
241
- **CRITICAL — Response Parsing:** Strip any `<think>...</think>` blocks from responses as a safety measure, then extract the first line (SPAM or HAM) for validation.
242
-
243
- The complete file must contain:
244
-
245
- ```python
246
- # Generate training data for the spam classifier
247
- # ENGT 375 Project - Spring 2026 - ODU
248
- #
249
- # This script uses the local Qwen3.5-9B model to generate
250
- # classification explanations for each email, then saves
251
- # them as JSONL files for fine-tuning the 0.8B model.
252
- #
253
- # Run: python3 prepare_data.py
254
- # Requires: ~/MLXModels/mlx-community/Qwen3.5-9B-OptiQ-4bit/
255
- # Time: ~30-60 minutes (600 emails through 9B model)
256
-
257
- import json
258
- import re
259
- import random
260
- import pandas as pd
261
- from pathlib import Path
262
- from mlx_lm import load, generate
263
-
264
- # Paths
265
- project_dir = Path(__file__).parent
266
- data_dir = project_dir / 'data'
267
- output_dir = project_dir / 'training_data'
268
- output_dir.mkdir(exist_ok=True)
269
-
270
- # The 9B model generates explanations (smarter than the 0.8B we'll fine-tune)
271
- MODEL_9B_PATH = str(Path.home() / 'MLXModels' / 'mlx-community' / 'Qwen3.5-9B-OptiQ-4bit')
272
-
273
- random.seed(42)
274
-
275
- SYSTEM_PROMPT = "You are an email spam classifier. Analyze the email and classify it as SPAM or HAM. Explain your reasoning."
276
-
277
- CLASSIFY_PROMPT = """Classify this email as SPAM or HAM. Give your classification on the first line, then explain your reasoning in 2-3 sentences. Be specific about what words, patterns, or signals you noticed.
278
-
279
- Email:
280
- {email_text}"""
281
-
282
- # Hardcoded Q&A topics for conversational training data
283
- QA_PROMPTS = [
284
- "Why do spam emails often use urgency language like 'act now' or 'limited time'?",
285
- "What is the difference between spam and phishing emails?",
286
- "How can you tell if a marketing email from a legitimate company is not spam?",
287
- "Why do spam emails use dollar signs and large numbers?",
288
- "What makes newsletters sometimes look like spam to filters?",
289
- "What are common red flags in email headers that indicate spam?",
290
- "Why do spam emails sometimes misspell words intentionally?",
291
- "How do spammers try to bypass email filters?",
292
- "What should I do if I receive a suspicious email?",
293
- "What is a ham email?",
294
- # ... (implement at least 50 diverse prompts covering spam patterns,
295
- # email security, classification techniques, etc.)
296
- ]
297
- # The implementing agent should expand this to 50 prompts.
298
-
299
-
300
- def strip_thinking(text):
301
- """Remove any <think>...</think> blocks from the model's response."""
302
- cleaned = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
303
- return cleaned.strip()
304
-
305
-
306
- def generate_with_chat_template(model, tokenizer, system_msg, user_msg, max_tokens=200):
307
- """Generate a response using the proper chat template.
308
-
309
- IMPORTANT: The mlx_lm.generate() Python API does NOT auto-apply the
310
- chat template. We must format the prompt ourselves using
311
- tokenizer.apply_chat_template(). Without this, the model gets raw
312
- text and produces garbage output.
313
- """
314
- messages = [
315
- {"role": "system", "content": system_msg},
316
- {"role": "user", "content": user_msg},
317
- ]
318
- # apply_chat_template converts messages to the ChatML format the model expects
319
- # enable_thinking=False suppresses the <think>...</think> chain-of-thought output
320
- prompt = tokenizer.apply_chat_template(
321
- messages,
322
- tokenize=False,
323
- add_generation_prompt=True,
324
- enable_thinking=False,
325
- )
326
- response = generate(model, tokenizer, prompt=prompt, max_tokens=max_tokens)
327
- # Safety: strip any thinking tags that slipped through
328
- response = strip_thinking(response)
329
- return response
330
-
331
-
332
- def parse_classification(response_text):
333
- """Extract SPAM or HAM from the first line of the model's response."""
334
- first_line = response_text.strip().split('\n')[0].upper()
335
- if 'SPAM' in first_line:
336
- return 'spam'
337
- elif 'HAM' in first_line:
338
- return 'ham'
339
- return None
340
-
341
-
342
- def format_as_jsonl(system_prompt, user_content, assistant_content):
343
- """Format one training example as a JSONL chat message dict."""
344
- return json.dumps({
345
- "messages": [
346
- {"role": "system", "content": system_prompt},
347
- {"role": "user", "content": user_content},
348
- {"role": "assistant", "content": assistant_content},
349
- ]
350
- })
351
- ```
352
-
353
- Then the script body must:
354
- 1. Load Kaggle CSV, oversample to 350 spam + 350 ham (700 total, to account for mismatches after validation)
355
- 2. Load the 9B model with `load(MODEL_9B_PATH)`
356
- 3. For each email (printing progress every 10), truncate to 500 chars, call `generate_with_chat_template()`, parse classification, validate against ground truth
357
- 4. Keep matches, discard mismatches, print running success rate
358
- 5. Format matches as JSONL using `format_as_jsonl()`
359
- 6. Generate 50 conversational Q&A pairs using the same 9B model (with a conversational system prompt)
360
- 7. Combine classify + Q&A examples, shuffle
361
- 8. Split: first 500 → `train.jsonl`, remaining → `test.jsonl`
362
- 9. Print 10 random examples for manual inspection
363
- 10. Print final stats (total examples, train/test split, match rate)
364
- 11. Unload model (del model) to free memory
365
-
366
- - [ ] **Step 2: Run prepare_data.py**
367
-
368
- ```bash
369
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
370
- source venv/bin/activate
371
- python3 prepare_data.py
372
- ```
373
-
374
- Expected output:
375
- - "Loading 9B model..." (takes ~30s)
376
- - Progress: "Processing email 1/600..." through "600/600"
377
- - "Matched: X/600, Mismatched: Y/600"
378
- - "Generated 50 conversational Q&A pairs"
379
- - "Saved 500 examples to training_data/train.jsonl"
380
- - "Saved 100 examples to training_data/test.jsonl"
381
- - 10 sample examples printed for inspection
382
-
383
- Verify:
384
- ```bash
385
- wc -l training_data/train.jsonl training_data/test.jsonl
386
- python3 -c "import json; [json.loads(l) for l in open('training_data/train.jsonl')]; print('Valid JSONL')"
387
- ```
388
-
389
- - [ ] **Step 3: Commit**
390
-
391
- ```bash
392
- git add prepare_data.py
393
- git commit -m "feat: add prepare_data.py for training data generation with 9B model"
394
- ```
395
-
396
- ---
397
-
398
- ### Task 3: Fine-Tuning Script (`fine_tune.py`)
399
-
400
- **Files:**
401
- - Create: `spam-classifier-mlx/fine_tune.py`
402
-
403
- - [ ] **Step 1: Write fine_tune.py**
404
-
405
- ```python
406
- # Fine-tune Qwen3.5-0.8B on spam classification using LoRA
407
- # ENGT 375 Project - Spring 2026 - ODU
408
- # This is a wrapper around mlx_lm.lora that sets up the right
409
- # parameters for our spam classification task.
410
- #
411
- # Run: python3 fine_tune.py
412
- # Requires: models/Qwen3.5-0.8B-MLX-9bit/ and training_data/train.jsonl
413
- # Time: ~10-20 minutes on M4 Pro
414
-
415
- import subprocess
416
- import sys
417
- from pathlib import Path
418
-
419
- project_dir = Path(__file__).parent
420
- model_path = project_dir / 'models' / 'Qwen3.5-0.8B-MLX-9bit'
421
- data_path = project_dir / 'training_data'
422
- adapter_path = project_dir / 'adapters'
423
-
424
- def check_prerequisites():
425
- """Make sure the model and training data exist before we start."""
426
- if not model_path.exists():
427
- print('ERROR: Base model not found at %s' % model_path)
428
- print('Download it first:')
429
- print(' huggingface-cli download inferencerlabs/Qwen3.5-0.8B-MLX-9bit --local-dir models/Qwen3.5-0.8B-MLX-9bit')
430
- sys.exit(1)
431
-
432
- train_file = data_path / 'train.jsonl'
433
- if not train_file.exists():
434
- print('ERROR: Training data not found at %s' % train_file)
435
- print('Generate it first: python3 prepare_data.py')
436
- sys.exit(1)
437
-
438
- print('Model found: %s' % model_path)
439
- print('Training data found: %s' % train_file)
440
-
441
-
442
- def run_training():
443
- """Run LoRA fine-tuning using mlx_lm.lora CLI."""
444
- print('\nStarting LoRA fine-tuning...')
445
- print('This will take about 10-20 minutes on M4 Pro.')
446
- print('The model has 24 transformer layers — we are adding small')
447
- print('LoRA adapter matrices to each layer and only training those.\n')
448
-
449
- # Build the command
450
- # mlx_lm.lora is the CLI tool from the mlx-lm package
451
- cmd = [
452
- sys.executable, '-m', 'mlx_lm.lora',
453
- '--model', str(model_path),
454
- '--train',
455
- '--data', str(data_path),
456
- '--iters', '600',
457
- '--batch-size', '2',
458
- '--learning-rate', '1e-5',
459
- '--num-layers', '24',
460
- '--adapter-path', str(adapter_path),
461
- '--mask-prompt',
462
- '--grad-checkpoint',
463
- ]
464
-
465
- print('Running: %s\n' % ' '.join(cmd))
466
-
467
- # Run the training and show output in real time
468
- result = subprocess.run(cmd)
469
-
470
- if result.returncode != 0:
471
- print('\nERROR: Training failed with exit code %d' % result.returncode)
472
- sys.exit(1)
473
-
474
- print('\nTraining complete!')
475
- print('Adapter weights saved to: %s' % adapter_path)
476
-
477
-
478
- def run_evaluation():
479
- """Evaluate the fine-tuned model on the test set."""
480
- test_file = data_path / 'test.jsonl'
481
- if not test_file.exists():
482
- print('No test.jsonl found — skipping evaluation.')
483
- return
484
-
485
- print('\nEvaluating on test set...')
486
- cmd = [
487
- sys.executable, '-m', 'mlx_lm.lora',
488
- '--model', str(model_path),
489
- '--adapter-path', str(adapter_path),
490
- '--data', str(data_path),
491
- '--test',
492
- ]
493
- subprocess.run(cmd)
494
-
495
-
496
- def test_generation():
497
- """Quick test: classify a sample email with the fine-tuned model."""
498
- print('\n--- Quick Test ---')
499
- test_prompt = 'Classify this email:\n\nSubject: You Won $5M!!!\nDear Friend, CONGRATULATIONS!!! Click here to claim your prize!'
500
-
501
- system_prompt = "You are an email spam classifier. Analyze the email and classify it as SPAM or HAM. Explain your reasoning."
502
- cmd = [
503
- sys.executable, '-m', 'mlx_lm.generate',
504
- '--model', str(model_path),
505
- '--adapter-path', str(adapter_path),
506
- '--prompt', test_prompt,
507
- '--max-tokens', '200',
508
- ]
509
- # Note: the CLI auto-applies the chat template, but we should verify
510
- # the output looks like a proper classification response
511
- subprocess.run(cmd)
512
- print('\n--- End Test ---')
513
-
514
-
515
- if __name__ == '__main__':
516
- check_prerequisites()
517
- run_training()
518
- run_evaluation()
519
- test_generation()
520
- print('\nAll done! You can now run: python3 app.py')
521
- ```
522
-
523
- - [ ] **Step 2: Run fine-tuning** (this takes ~10-20 minutes)
524
-
525
- ```bash
526
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
527
- source venv/bin/activate
528
- python3 fine_tune.py
529
- ```
530
-
531
- Expected:
532
- - Prerequisites check passes
533
- - Training output shows iteration number and loss (loss should decrease)
534
- - Evaluation prints perplexity on test set
535
- - Quick test generates a spam classification response
536
- - "All done!" at the end
537
-
538
- Verify:
539
- ```bash
540
- ls adapters/
541
- ```
542
- Expected: adapter config and weight files present
543
-
544
- - [ ] **Step 3: Commit**
545
-
546
- ```bash
547
- git add fine_tune.py
548
- git commit -m "feat: add fine_tune.py LoRA training wrapper for Qwen3.5-0.8B"
549
- ```
550
-
551
- ---
552
-
553
- ### Task 4: Gradio App (`app.py`)
554
-
555
- **Files:**
556
- - Create: `spam-classifier-mlx/app.py`
557
-
558
- - [ ] **Step 1: Write app.py — model loading and classify function**
559
-
560
- ```python
561
- # Spam Email Classifier — Fine-Tuned LLM with Gradio UI
562
- # ENGT 375 Project - Spring 2026 - ODU
563
- # Uses Qwen3.5-0.8B fine-tuned with LoRA on spam/ham data
564
- #
565
- # Run: python3 app.py
566
- # Requires: models/Qwen3.5-0.8B-MLX-9bit/ and adapters/
567
-
568
- import gradio as gr
569
- from pathlib import Path
570
- from mlx_lm import load, generate
571
-
572
- # Paths
573
- project_dir = Path(__file__).parent
574
- model_path = str(project_dir / 'models' / 'Qwen3.5-0.8B-MLX-9bit')
575
- adapter_path = str(project_dir / 'adapters')
576
-
577
- # System prompt tells the model what role to play
578
- SYSTEM_PROMPT = "You are an email spam classifier. Analyze the email and classify it as SPAM or HAM. Explain your reasoning."
579
-
580
- CHAT_SYSTEM_PROMPT = "You are a spam email analysis expert. You can classify emails as spam or ham, explain your reasoning, and answer questions about email security and spam patterns."
581
-
582
- # Load model at startup (only happens once)
583
- print('Loading fine-tuned model...')
584
- try:
585
- model, tokenizer = load(model_path, adapter_path=adapter_path)
586
- print('Model loaded successfully!')
587
- MODEL_LOADED = True
588
- except Exception as e:
589
- print('Could not load model: %s' % str(e))
590
- print('Run python3 fine_tune.py first.')
591
- model, tokenizer = None, None
592
- MODEL_LOADED = False
593
- ```
594
-
595
- Then add these functions (CRITICAL — must use chat template, not raw prompts):
596
-
597
- - `build_classify_prompt(email_text)`:
598
- ```python
599
- messages = [
600
- {"role": "system", "content": SYSTEM_PROMPT},
601
- {"role": "user", "content": "Classify this email:\n\n" + email_text[:500]},
602
- ]
603
- return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
604
- ```
605
-
606
- - `classify_email(email_text, file_obj)` — handles file upload, calls `build_classify_prompt`, then `generate(model, tokenizer, prompt, max_tokens=300)`, strips `<think>` tags, returns markdown result
607
-
608
- - `chat_respond(message, history)`:
609
- ```python
610
- # Gradio 4.19.2 ChatInterface passes history as list of {"role":..., "content":...} dicts
611
- messages = [{"role": "system", "content": CHAT_SYSTEM_PROMPT}]
612
- for msg in history:
613
- messages.append(msg)
614
- messages.append({"role": "user", "content": message})
615
- prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
616
- response = generate(model, tokenizer, prompt, max_tokens=500)
617
- # Strip any thinking tags
618
- import re
619
- response = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()
620
- return response
621
- ```
622
- - Example emails (same 4 as sklearn project)
623
- - Gradio Blocks layout with two tabs:
624
- - Tab 1 "Classify": Textbox + File + Examples → Markdown output
625
- - Tab 2 "Chat": gr.ChatInterface with chat_respond function
626
- - `demo.launch()` at bottom
627
-
628
- - [ ] **Step 2: Test the app launches**
629
-
630
- ```bash
631
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
632
- source venv/bin/activate
633
- timeout 15 python3 app.py 2>&1 | grep -E "Running|loaded|Error" || echo "Check output"
634
- ```
635
-
636
- Expected: "Model loaded successfully!" and "Running on local URL: http://127.0.0.1:7860"
637
-
638
- - [ ] **Step 3: Commit**
639
-
640
- ```bash
641
- git add app.py
642
- git commit -m "feat: add Gradio app with Classify and Chat tabs"
643
- ```
644
-
645
- ---
646
-
647
- ### Task 5: Launch Scripts
648
-
649
- **Files:**
650
- - Create: `spam-classifier-mlx/launch.command`
651
- - Create: `spam-classifier-mlx/launch-notebook.command`
652
-
653
- - [ ] **Step 1: Create launch.command**
654
-
655
- ```bash
656
- #!/bin/bash
657
- # Double-click this file in Finder to launch the Spam Classifier UI
658
- cd "$(dirname "$0")"
659
- source venv/bin/activate
660
- echo "Starting MLX Spam Classifier..."
661
- echo "Opening http://127.0.0.1:7860 in your browser..."
662
- sleep 2 && open http://127.0.0.1:7860 &
663
- python3 app.py
664
- ```
665
-
666
- ```bash
667
- chmod +x "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx/launch.command"
668
- ```
669
-
670
- - [ ] **Step 2: Create launch-notebook.command**
671
-
672
- ```bash
673
- #!/bin/bash
674
- # Double-click this file in Finder to open the project notebook
675
- cd "$(dirname "$0")"
676
- source venv/bin/activate
677
- pip install jupyter -q 2>/dev/null
678
- echo "Opening notebook..."
679
- jupyter notebook spam_classifier_mlx.ipynb
680
- ```
681
-
682
- ```bash
683
- chmod +x "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx/launch-notebook.command"
684
- ```
685
-
686
- - [ ] **Step 3: Commit**
687
-
688
- ```bash
689
- git add launch.command launch-notebook.command
690
- git commit -m "feat: add launch scripts for Gradio app and Jupyter notebook"
691
- ```
692
-
693
- ---
694
-
695
- ### Task 6: Project Notebook (`spam_classifier_mlx.ipynb`)
696
-
697
- **Files:**
698
- - Create: `spam-classifier-mlx/spam_classifier_mlx.ipynb`
699
-
700
- - [ ] **Step 1: Create the notebook**
701
-
702
- Create a Jupyter notebook with these sections (each as markdown + code cells):
703
-
704
- 1. **Title + Introduction** — "Fine-Tuning a Small LLM for Spam Classification with Apple MLX"
705
- - What is this project, why LLM approach vs traditional ML
706
- - Course context (ENGT 375)
707
-
708
- 2. **What is MLX?** — Markdown explaining Apple MLX (reference docs/01-what-is-mlx.md)
709
-
710
- 3. **What is LoRA?** — Markdown explaining LoRA (reference docs/02-what-is-lora.md)
711
- - Diagram concept: original weights frozen, small adapter matrices added
712
-
713
- 4. **Environment Setup** — Code cell: check mlx-lm version, check model exists
714
-
715
- 5. **Data Loading** — Code cells:
716
- - Load Kaggle CSV, show shape and class distribution
717
- - Sample 10 spam + 10 ham, display them
718
- - Explain the strategy: 600 emails → generate explanations → JSONL
719
-
720
- 6. **Inspecting Training Data** — Code cells:
721
- - Load train.jsonl, show 5 examples
722
- - Count label distribution in training data
723
- - Show average response length
724
-
725
- 7. **Fine-Tuning** — Code cells:
726
- - Show the mlx_lm.lora command that was run
727
- - Display training config (iters, batch_size, num_layers, etc.)
728
- - If adapters exist, show adapter file sizes
729
- - Explain what happened during training
730
-
731
- 8. **Evaluation** — Code cells:
732
- - Load model + adapter
733
- - Test on 5 example emails (3 from training set, 2 new)
734
- - Show the model's responses
735
- - Compare to ground truth labels
736
-
737
- 9. **Comparison with sklearn** — Markdown + code:
738
- - Table: sklearn VotingClassifier (97.4% accuracy) vs fine-tuned LLM
739
- - Test the Lenovo email — does the LLM handle it better?
740
- - Discussion: when does each approach win?
741
-
742
- 10. **Results and Conclusions** — Markdown:
743
- - Summary of findings
744
- - Limitations (0.8B model capacity, training data quality)
745
- - Future work (bigger model, more training data, RLHF)
746
-
747
- Code style: Beginner-friendly, `%%time` on slow cells, `print()` for results.
748
-
749
- - [ ] **Step 2: Commit**
750
-
751
- ```bash
752
- git add spam_classifier_mlx.ipynb
753
- git commit -m "feat: add project notebook for course submission"
754
- ```
755
-
756
- ---
757
-
758
- ### Task 7: CHANGELOG Update + Final Verification
759
-
760
- **Files:**
761
- - Modify: `spam-classifier-mlx/CHANGELOG.md`
762
-
763
- - [ ] **Step 1: Update CHANGELOG.md with full v0.1.0 entry**
764
-
765
- ```markdown
766
- # Changelog
767
-
768
- All notable changes to this project will be documented in this file.
769
- This serves as a reference for writing the course paper's methodology section.
770
-
771
- ## v0.1.0 — 2026-03-23
772
- ### Initial Build
773
- - Created project with comprehensive documentation (docs/ folder):
774
- - What is MLX guide
775
- - What is LoRA guide
776
- - Step-by-step training guide
777
- - mlx-lm CLI reference
778
- - Hugging Face deployment guide
779
- - Generated training data: 500 train + 100 test examples
780
- - Used local Qwen3.5-9B-OptiQ-4bit to generate classification explanations
781
- - 450 email classify examples + 50 conversational Q&A pairs
782
- - Validated against ground truth labels
783
- - Fine-tuned Qwen3.5-0.8B-MLX-9bit with LoRA:
784
- - 600 iterations, batch size 2, learning rate 1e-5
785
- - All 24 layers with LoRA adapters
786
- - QLoRA (automatic — base model is 9-bit quantized)
787
- - ~10-20 minutes on M4 Pro
788
- - Built Gradio interface with:
789
- - Classify tab: paste email → get SPAM/HAM + explanation
790
- - Chat tab: conversational Q&A about spam patterns
791
- - 4 built-in example emails
792
- - Project notebook for course submission
793
- - macOS native — runs on Apple Silicon via MLX
794
- ```
795
-
796
- - [ ] **Step 2: Verify end-to-end**
797
-
798
- ```bash
799
- cd "/Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx"
800
- source venv/bin/activate
801
-
802
- # Check all files exist
803
- ls prepare_data.py fine_tune.py app.py requirements.txt CLAUDE.md CHANGELOG.md
804
- ls docs/*.md
805
- ls training_data/train.jsonl training_data/test.jsonl
806
- ls adapters/
807
- ls models/Qwen3.5-0.8B-MLX-9bit/
808
-
809
- # Quick model test
810
- python3 -c "
811
- from mlx_lm import load, generate
812
- model, tok = load('models/Qwen3.5-0.8B-MLX-9bit', adapter_path='adapters')
813
- result = generate(model, tok, prompt='Is this spam? Hello, meeting at 3pm.', max_tokens=100)
814
- print(result)
815
- "
816
- ```
817
-
818
- - [ ] **Step 3: Commit final state**
819
-
820
- ```bash
821
- git add CHANGELOG.md
822
- git commit -m "docs: update CHANGELOG with v0.1.0 build details"
823
- ```
824
-
825
- ---
826
-
827
- ### Task Summary
828
-
829
- | Task | Description | Depends On | Estimated Time |
830
- |------|-------------|------------|----------------|
831
- | 1 | Scaffolding + docs + venv + model download | — | 15 min |
832
- | 2 | `prepare_data.py` (generate training JSONL) | Task 1 | 30-45 min (9B model generation) |
833
- | 3 | `fine_tune.py` (LoRA training wrapper) | Task 2 | 15-25 min (includes training time) |
834
- | 4 | `app.py` (Gradio UI with classify + chat) | Task 3 | 10 min |
835
- | 5 | Launch scripts (.command files) | Task 4 | 2 min |
836
- | 6 | Project notebook | Task 3 | 15 min |
837
- | 7 | CHANGELOG + final verification | Task 6 | 5 min |
838
-
839
- Tasks 1-5 are strictly sequential. Task 6 depends on Task 3 (needs adapters to exist) but can run in parallel with Tasks 4-5.
840
-
841
- ### QA Agent Checklist (run after each task)
842
-
843
- The QA agent verifies after each task:
844
- 1. **Plan compliance:** Does the implementation match what the plan specified?
845
- 2. **Code quality:** No syntax errors, no import errors, files run without crashing
846
- 3. **Wiring:** Do the pieces connect? (prepare_data output → fine_tune input → app.py loads result)
847
- 4. **Beginner-level code:** Explicit loops, clear variable names, comments explaining why, no advanced patterns
848
- 5. **Documentation:** CHANGELOG updated, docs accurate, CLAUDE.md correct
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/superpowers/plans/2026-04-14-spam-xai-v2-simplify.md DELETED
@@ -1,383 +0,0 @@
1
- # Spam XAI Project v2 — Simplification Plan
2
-
3
- > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
4
-
5
- **Goal:** Create a `spam-xai-project-v2/` folder that is a simplified, beginner-friendly copy of `spam-xai-project/`, reducing `app.py` from 674 lines to roughly 400–450 lines without removing any features, and update the CHANGELOG to document v2.0.
6
-
7
- **Architecture:** Copy the entire project folder to `spam-xai-project-v2/`, then simplify `app.py` in place — breaking up repeated patterns, flattening nested functions, and merging near-duplicate functions. All existing features (LIME, SHAP, ELI5, Gradio UI, feedback logging) are kept. No model files are changed. Lecture patterns from Module_5A and Module_7B are referenced for plot style only.
8
-
9
- **Tech Stack:** Python 3.11, scikit-learn, gradio, lime, shap, eli5, joblib, matplotlib
10
-
11
- **Beginner-code rules (must follow throughout):**
12
- - Use plain `for` loops — no list comprehensions with lambdas
13
- - No decorators
14
- - No `functools`, no `lru_cache`, no `partial`
15
- - Variable names spell out what they hold (e.g., `spam_words` not `sw`)
16
- - Every non-obvious line gets a short plain-English comment
17
-
18
- ---
19
-
20
- ## File Structure
21
-
22
- | File | Action | Notes |
23
- |------|--------|-------|
24
- | `spam-xai-project-v2/app.py` | Simplify (main work) | Target ~420 lines |
25
- | `spam-xai-project-v2/CHANGELOG.md` | Update | Add v2.0 entry |
26
- | All other files in `spam-xai-project-v2/` | Copy unchanged | models/, data/, utils.py, retrain.py, etc. |
27
-
28
- ---
29
-
30
- ## Task 1: Copy the project folder
31
-
32
- **Files:**
33
- - Create: `spam-xai-project-v2/` (full copy of `spam-xai-project/`)
34
-
35
- - [ ] **Step 1: Copy the folder**
36
-
37
- Run from the `LLM Project/` directory:
38
- ```bash
39
- cp -r "spam-xai-project" "spam-xai-project-v2"
40
- ```
41
-
42
- - [ ] **Step 2: Verify the copy**
43
-
44
- Run:
45
- ```bash
46
- ls "spam-xai-project-v2/"
47
- ```
48
- Expected: Same files as `spam-xai-project/` — `app.py`, `utils.py`, `retrain.py`, `CHANGELOG.md`, `models/`, `data/`, etc.
49
-
50
- - [ ] **Step 3: Confirm app.py line count**
51
-
52
- Run:
53
- ```bash
54
- wc -l "spam-xai-project-v2/app.py"
55
- ```
56
- Expected: 674 lines (unchanged copy)
57
-
58
- - [ ] **Step 4: Commit the copy as a baseline**
59
-
60
- ```bash
61
- cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
62
- git add "LLM Project/spam-xai-project-v2/"
63
- git commit -m "feat: add spam-xai-project-v2 baseline copy (pre-simplification)"
64
- ```
65
-
66
- ---
67
-
68
- ## Task 2: Merge the two duplicate feedback handler functions
69
-
70
- **Files:**
71
- - Modify: `spam-xai-project-v2/app.py` lines 395–425
72
-
73
- **What exists now:** Two functions `handle_correct()` and `handle_wrong()` that share ~90% of their code. Both unpack the same hidden state string, both call `log_feedback()`, and both call `count_corrections()`. The only difference is that `handle_wrong()` also takes a `user_label` parameter.
74
-
75
- **What to do:** Replace both with a single `handle_feedback()` function that takes an extra `is_correct` flag and optional `user_label`. Then wire Gradio buttons to call this one function.
76
-
77
- - [ ] **Step 1: Read the current feedback handlers**
78
-
79
- Read `spam-xai-project-v2/app.py` lines 390–435 to see the exact current code before changing anything.
80
-
81
- - [ ] **Step 2: Replace both handlers with one function**
82
-
83
- Find both `handle_correct` and `handle_wrong` function definitions and replace them with a single `handle_feedback` function that does:
84
- 1. Parse the hidden state string (same as before)
85
- 2. If `is_correct` is True, log positive feedback
86
- 3. If `is_correct` is False, log a correction using `user_label`
87
- 4. Return the correction count string
88
-
89
- - [ ] **Step 3: Update the Gradio button wiring**
90
-
91
- In the Gradio UI section (around line 430+), find where `.click()` is called on the correct/wrong buttons and update both to call `handle_feedback` with the right arguments (`is_correct=True` or `is_correct=False`).
92
-
93
- - [ ] **Step 4: Run the app to verify feedback still works**
94
-
95
- ```bash
96
- cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING/LLM Project/spam-xai-project-v2"
97
- python3 app.py
98
- ```
99
- Expected: App launches without errors. Test the "That's correct" and "That's wrong" buttons with a sample email.
100
-
101
- - [ ] **Step 5: Commit**
102
-
103
- ```bash
104
- cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
105
- git add "LLM Project/spam-xai-project-v2/app.py"
106
- git commit -m "refactor(v2): merge duplicate feedback handlers into one handle_feedback function"
107
- ```
108
-
109
- ---
110
-
111
- ## Task 3: Flatten the nested SHAP function
112
-
113
- **Files:**
114
- - Modify: `spam-xai-project-v2/app.py` lines 119–165
115
-
116
- **What exists now:** `generate_shap_explanation()` contains an inner function `predict_with_meta_only()` defined inside it (lines 130–135). This pattern (function inside a function) is confusing for beginners.
117
-
118
- **What to do:** Move `predict_with_meta_only()` out to be a regular top-level function, defined before `generate_shap_explanation()`. No logic changes — just move it up.
119
-
120
- - [ ] **Step 1: Read lines 119–170 of app.py**
121
-
122
- Read the exact current code for `generate_shap_explanation` and its nested function.
123
-
124
- - [ ] **Step 2: Cut the inner function out and paste it as a top-level function**
125
-
126
- Place `predict_with_meta_only()` as a standalone function right above `generate_shap_explanation()`. Add a short comment explaining what it does in plain English.
127
-
128
- - [ ] **Step 3: Remove the inner definition from inside generate_shap_explanation**
129
-
130
- The body of `generate_shap_explanation` should now just call `predict_with_meta_only` as a normal function (it was already calling it this way — it just won't be defined inside anymore).
131
-
132
- - [ ] **Step 4: Run the app to verify SHAP still works**
133
-
134
- ```bash
135
- python3 app.py
136
- ```
137
- Expected: App launches. Classify a sample email and confirm the SHAP tab shows a chart.
138
-
139
- - [ ] **Step 5: Commit**
140
-
141
- ```bash
142
- cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
143
- git add "LLM Project/spam-xai-project-v2/app.py"
144
- git commit -m "refactor(v2): move nested SHAP predict function to top-level"
145
- ```
146
-
147
- ---
148
-
149
- ## Task 4: Simplify generate_comparison() repeated extraction logic
150
-
151
- **Files:**
152
- - Modify: `spam-xai-project-v2/app.py` lines 246–288
153
-
154
- **What exists now:** `generate_comparison()` extracts the top-3 features from LIME, SHAP, and ELI5 using three near-identical blocks. Each block does: get the values, sort them, take the top 3. This is the same operation written out three times.
155
-
156
- **What to do:** Write a plain helper function `get_top_features(explanation, method_name)` above `generate_comparison()` that handles one explanation object and returns a list of the top-3 feature names. Then call it three times in a simple loop inside `generate_comparison()`.
157
-
158
- Keep the output (the markdown comparison table) exactly the same — only the internal logic changes.
159
-
160
- - [ ] **Step 1: Read lines 246–295 of app.py**
161
-
162
- Read the exact current code.
163
-
164
- - [ ] **Step 2: Write the helper function**
165
-
166
- Add a plain `get_top_features(explanation, method_name)` function above `generate_comparison()`. It takes one explanation object and a string name, and returns a list of 3 feature name strings. Write it with a plain `for` loop, no comprehensions.
167
-
168
- - [ ] **Step 3: Rewrite generate_comparison() to use the helper**
169
-
170
- The function body should:
171
- 1. Call `get_top_features` once for LIME, once for SHAP, once for ELI5
172
- 2. Store the results in three plain lists
173
- 3. Find the overlap (features that appear in all three) with a plain `for` loop
174
- 4. Build and return the same markdown table as before
175
-
176
- - [ ] **Step 4: Run the app and verify comparison tab is unchanged**
177
-
178
- ```bash
179
- python3 app.py
180
- ```
181
- Expected: Classify an email. The "Compare" tab output looks identical to the original.
182
-
183
- - [ ] **Step 5: Commit**
184
-
185
- ```bash
186
- cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
187
- git add "LLM Project/spam-xai-project-v2/app.py"
188
- git commit -m "refactor(v2): extract top-feature helper to replace 3x repeated extraction in generate_comparison"
189
- ```
190
-
191
- ---
192
-
193
- ## Task 5: Simplify generate_plain_summary() badge logic
194
-
195
- **Files:**
196
- - Modify: `spam-xai-project-v2/app.py` lines 196–245
197
-
198
- **What exists now:** `generate_plain_summary()` is 50 lines. It has nested ternary operators to pick badge text, color, and icon based on label and confidence — all mixed into one dense block. This is hard to read at a beginner level.
199
-
200
- **What to do:** Pull the badge/icon selection out into a small helper function `get_result_badge(label, confidence)` that uses plain `if/elif/else` statements to return a dictionary with keys `color`, `icon`, and `text`. Then `generate_plain_summary()` just calls it and uses the returned values.
201
-
202
- - [ ] **Step 1: Read lines 196–250 of app.py**
203
-
204
- Read the exact current code.
205
-
206
- - [ ] **Step 2: Write get_result_badge()**
207
-
208
- Add a new function `get_result_badge(label, confidence)` above `generate_plain_summary()` using only `if/elif/else` — no ternaries. It returns a plain dictionary like:
209
- ```
210
- {"color": "red", "icon": "🚨", "text": "SPAM"}
211
- ```
212
-
213
- - [ ] **Step 3: Simplify generate_plain_summary()**
214
-
215
- Replace the nested ternary block with a single call to `get_result_badge()`. The rest of the markdown assembly stays the same. The function output (the markdown string returned) must be identical to the original.
216
-
217
- - [ ] **Step 4: Run the app and verify the Result tab looks identical**
218
-
219
- ```bash
220
- python3 app.py
221
- ```
222
- Expected: Classify an email. The summary card/badge looks exactly the same as in the original project.
223
-
224
- - [ ] **Step 5: Commit**
225
-
226
- ```bash
227
- cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
228
- git add "LLM Project/spam-xai-project-v2/app.py"
229
- git commit -m "refactor(v2): extract badge/icon logic into get_result_badge helper"
230
- ```
231
-
232
- ---
233
-
234
- ## Task 6: Add comments to classify_and_explain() orchestrator
235
-
236
- **Files:**
237
- - Modify: `spam-xai-project-v2/app.py` lines 339–394
238
-
239
- **What exists now:** `classify_and_explain()` is 56 lines that calls 6 different functions in sequence. There are no section comments explaining the flow. For a beginner reading this for the first time, it is not obvious why 7 values are returned or what the hidden state string is for.
240
-
241
- **What to do:** Add short plain-English section comments (not docstrings, not multi-line blocks — just `# one-line comments`) at the start of each logical step: input handling, classification, each explainer call, hidden state packing, and return. Do not change any logic.
242
-
243
- - [ ] **Step 1: Read lines 339–400 of app.py**
244
-
245
- Read the exact current function.
246
-
247
- - [ ] **Step 2: Add section comments**
248
-
249
- Insert comments before each logical group:
250
- - Before the file vs text input check: `# Figure out if the user pasted text or uploaded a file`
251
- - Before classify_email(): `# Run the email through the model to get spam/ham prediction`
252
- - Before each explainer call: `# Generate [LIME/SHAP/ELI5] explanation`
253
- - Before the hidden state pack: `# Pack the email and prediction into one string so the feedback buttons can use it later`
254
- - Before return: `# Send all results back to the Gradio interface`
255
-
256
- - [ ] **Step 3: Verify app still runs**
257
-
258
- ```bash
259
- python3 app.py
260
- ```
261
- Expected: No errors. Classify a test email to confirm all tabs still populate.
262
-
263
- - [ ] **Step 4: Commit**
264
-
265
- ```bash
266
- cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
267
- git add "LLM Project/spam-xai-project-v2/app.py"
268
- git commit -m "docs(v2): add plain-English section comments to classify_and_explain orchestrator"
269
- ```
270
-
271
- ---
272
-
273
- ## Task 7: Final line count check and app.py header comment
274
-
275
- **Files:**
276
- - Modify: `spam-xai-project-v2/app.py` top of file
277
-
278
- - [ ] **Step 1: Check the final line count**
279
-
280
- ```bash
281
- wc -l "spam-xai-project-v2/app.py"
282
- ```
283
- Expected: ~420–450 lines (down from 674). If still above 460, re-read and identify any remaining duplicate blocks before continuing.
284
-
285
- - [ ] **Step 2: Add a short header comment to app.py**
286
-
287
- At the very top of `spam-xai-project-v2/app.py`, before the imports, add a 4-line block comment explaining what the file does, for a beginner reader:
288
-
289
- ```
290
- # app.py — Spam Email Classifier with Explanations
291
- # This file runs the Gradio web app.
292
- # It loads a trained model, classifies an email as spam or not spam,
293
- # and shows three different explanations of why it made that choice.
294
- ```
295
-
296
- - [ ] **Step 3: Run the app one final time end-to-end**
297
-
298
- ```bash
299
- python3 app.py
300
- ```
301
- Expected: Launches cleanly. Test all 7 tabs: Result, LIME, SHAP, ELI5, Compare, Summary, How It Works. Confirm feedback buttons work.
302
-
303
- - [ ] **Step 4: Commit**
304
-
305
- ```bash
306
- cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
307
- git add "LLM Project/spam-xai-project-v2/app.py"
308
- git commit -m "docs(v2): add top-of-file header comment to app.py"
309
- ```
310
-
311
- ---
312
-
313
- ## Task 8: Update CHANGELOG.md with v2.0 entry
314
-
315
- **Files:**
316
- - Modify: `spam-xai-project-v2/CHANGELOG.md`
317
-
318
- **What to do:** Add a new version entry at the top of the file for v2.0. It should document each simplification made, which lines changed, and what the beginner-friendliness goal was. Use the same markdown format as the existing v1.5 entry.
319
-
320
- - [ ] **Step 1: Read the top of CHANGELOG.md**
321
-
322
- Read the first 60 lines of `spam-xai-project-v2/CHANGELOG.md` to see the exact format of the existing version entries (headings, bullet style, date format).
323
-
324
- - [ ] **Step 2: Write the v2.0 CHANGELOG entry**
325
-
326
- Add this entry at the very top of the file, above any existing entries, following the format exactly as found in Step 1:
327
-
328
- ```markdown
329
- ## [v2.0] — 2026-04-14
330
-
331
- ### Summary
332
- Simplified `app.py` from 674 lines to ~420 lines for a beginner audience (ENGT 375, Spring 2026).
333
- No features were removed. All tabs, explanations, and feedback logging work identically.
334
- This version lives in `spam-xai-project-v2/`.
335
-
336
- ### Changes
337
-
338
- - **Merged duplicate feedback handlers** — `handle_correct()` and `handle_wrong()` (which shared 90% of their code) were combined into one `handle_feedback()` function with an `is_correct` flag. Saves ~20 lines and removes confusing duplication.
339
-
340
- - **Flattened nested SHAP function** — `predict_with_meta_only()` was defined inside `generate_shap_explanation()`. Moved it to the top level so it reads like a normal function. No logic change.
341
-
342
- - **Simplified comparison feature extraction** — `generate_comparison()` had three near-identical code blocks to get the top-3 features from LIME, SHAP, and ELI5 separately. Replaced with a single `get_top_features()` helper called three times.
343
-
344
- - **Simplified badge logic** — `generate_plain_summary()` used nested ternary operators to pick the spam/ham badge color, icon, and text. Replaced with a plain `get_result_badge()` function using `if/elif/else` statements.
345
-
346
- - **Added section comments to orchestrator** — `classify_and_explain()` (the main function that runs when you click Classify) had no comments explaining its steps. Added short plain-English comments so a student can follow the flow.
347
-
348
- - **Added file header comment** — Four-line comment at the top of `app.py` explaining what the file does in plain English.
349
-
350
- ### Files Changed
351
- - `app.py` — simplified (674 → ~420 lines)
352
- - `CHANGELOG.md` — this entry
353
-
354
- ### Files Unchanged
355
- - `utils.py`, `retrain.py`, `retrain_student.py`, `train_ensemble.py`
356
- - All model artifacts in `models/`
357
- - All data in `data/`
358
- - All notebooks in `notebooks/`
359
- ```
360
-
361
- - [ ] **Step 3: Verify CHANGELOG looks correct**
362
-
363
- Read the top 80 lines of `spam-xai-project-v2/CHANGELOG.md` to confirm the new entry is above the old entries and the formatting matches.
364
-
365
- - [ ] **Step 4: Commit**
366
-
367
- ```bash
368
- cd "/Users/dakwanbalfour/APPLIED MACHINE LEARNING"
369
- git add "LLM Project/spam-xai-project-v2/CHANGELOG.md"
370
- git commit -m "docs(v2): add v2.0 CHANGELOG entry documenting all simplifications"
371
- ```
372
-
373
- ---
374
-
375
- ## Self-Review Checklist
376
-
377
- - [x] All 5 simplification areas from the audit are covered (feedback handlers, SHAP nested fn, comparison loop, badge logic, orchestrator comments)
378
- - [x] No features removed — LIME, SHAP, ELI5, Gradio UI, feedback logging all stay
379
- - [x] Every new function uses plain if/else, plain for loops — no comprehensions, no decorators
380
- - [x] CHANGELOG entry is written out fully — no TBDs
381
- - [x] Each task ends with a working app run before committing
382
- - [x] v2 lives in a new folder — original `spam-xai-project/` is untouched
383
- - [x] No model files are touched
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/superpowers/specs/2026-03-23-gradio-spam-classifier-design.md DELETED
@@ -1,298 +0,0 @@
1
- # Design: Spam Email Classifier with Gradio UI
2
-
3
- **Date:** 2026-03-23
4
- **Project:** ENGT 375 — Applied Machine Learning, Spring 2026, ODU
5
- **Goal:** Create a fresh, beginner-level spam classifier with a Gradio web interface, replacing the old Streamlit-based project. Runs on macOS. Includes LIME, SHAP, and plain-English explanations.
6
-
7
- ---
8
-
9
- ## 1. Project Structure
10
-
11
- ```
12
- /Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-gradio/
13
- ├── app.py # Gradio UI (main entry point)
14
- ├── train.py # Train & compare models, save best ensemble
15
- ├── utils.py # Shared preprocessing & feature engineering
16
- ├── requirements.txt # Dependencies
17
- ├── CHANGELOG.md # Version history for paper reference
18
- ├── CLAUDE.md # Claude Code project instructions
19
- ├── models/ # Saved trained models (generated by train.py)
20
- └── data/ # Symlinked from spam-xai-project/data
21
- ```
22
-
23
- **Location:** New folder `spam-classifier-gradio/` alongside `spam-xai-project/` in the Applied Machine Learning directory.
24
-
25
- **Data strategy:** Symlink `data/` from `spam-xai-project/data` to avoid duplicating ~500MB+ of email corpora.
26
-
27
- ---
28
-
29
- ## 2. Data Pipeline (`train.py`)
30
-
31
- ### Data Sources (matching old project's `retrain_student.py`)
32
- 1. **Kaggle spam dataset**: `data/spam_Emails_data.csv` — ~190K emails, stratified-sampled to 100K cap
33
- 2. **GitHub email-dataset**: `data/email-dataset-main/email-dataset-main/dataset/` — subfolder `1/` = ham, `2/` = spam (individual `.txt` files)
34
-
35
- Note: The old project's `retrain_student.py` uses these two sources (not `emails_raw.csv`). We match the same sources for comparable results.
36
-
37
- ### Steps
38
- 1. Load Kaggle CSV, normalize columns to `text` and `label` (lowercase: spam/ham)
39
- 2. Stratified-sample Kaggle to 100K cap (same as old project)
40
- 3. Load GitHub email-dataset by reading `.txt` files from `dataset/1/` (ham) and `dataset/2/` (spam)
41
- 4. Combine both datasets
42
- 5. Deduplicate by exact text match (after combining, before splitting)
43
- 6. Preprocess text using `utils.preprocess_text()`
44
- 7. TF-IDF vectorization: `TfidfVectorizer(max_features=3000, ngram_range=(1,3), min_df=2, max_df=0.90, sublinear_tf=True)` — exact same params as old project
45
- 8. Compute 24 metadata features using `utils.compute_metadata_features()`
46
- 9. Scale metadata features with `MinMaxScaler` (so they match TF-IDF range)
47
- 10. Combine TF-IDF + metadata via `scipy.sparse.hstack`
48
-
49
- ### Class Balancing Strategy
50
- - Use `class_weight='balanced'` on all classifiers (same as old project) rather than data-level undersampling
51
- - This lets sklearn adjust weights inversely proportional to class frequency without discarding training data
52
-
53
- ### Train/Test Split
54
- - 70/30 split (`test_size=0.3`), stratified — same as old project for comparable metrics
55
- - `random_state=42` for reproducibility
56
-
57
- ### Model Comparison
58
- Train three classifiers individually and print classification reports for each:
59
- - **Random Forest**: `RandomForestClassifier(n_jobs=-1, class_weight='balanced', random_state=42)`
60
- - **Logistic Regression**: `LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)`
61
- - **SVM**: `SVC(kernel='linear', class_weight='balanced', probability=True, random_state=42)`
62
-
63
- Compare using: accuracy, precision, recall, F1-score on the test set.
64
-
65
- ### Ensemble
66
- 1. Train all three models individually, collect F1 scores
67
- 2. Construct a **new** `VotingClassifier(voting='soft')` using all three estimators (or top 2 if one significantly underperforms)
68
- 3. Fit the VotingClassifier on the training data (it retrains the sub-estimators internally)
69
- 4. **No separate `CalibratedClassifierCV` wrapper** — soft voting already averages probability outputs, and SVM with `probability=True` already uses Platt scaling internally
70
- 5. Find optimal classification threshold using precision-recall curve on test set
71
-
72
- ### Saved Artifacts (in `models/`)
73
- - `tfidf_vectorizer.joblib`
74
- - `meta_scaler.joblib`
75
- - `voting_model.joblib`
76
- - `feature_names.joblib` (list of all feature names: TF-IDF + metadata)
77
- - `optimal_threshold.joblib`
78
- - `training_sample.joblib` (200-row sample of training data, needed for LIME explainer)
79
- - `training_report.json` — schema: `{"random_forest": {"accuracy": float, "f1": float, "precision": float, "recall": float}, "logistic_regression": {...}, "svm": {...}, "voting_ensemble": {...}, "best_single_model": str}`
80
-
81
- ---
82
-
83
- ## 3. Feature Engineering (`utils.py`)
84
-
85
- ### Text Preprocessing (`preprocess_text`)
86
- Copied from `utils_student.py` logic:
87
- - Strip HTML tags
88
- - Remove URLs and email addresses
89
- - Remove non-alphabetic characters
90
- - Lowercase
91
- - Remove stopwords (NLTK English)
92
- - Porter stemming
93
-
94
- ### Metadata Features (`compute_metadata_features`)
95
- 24 features, copied from `utils_student.py`:
96
- 1. exclamation_density (per sentence)
97
- 2. dollar_sign_count
98
- 3. caps_word_ratio
99
- 4. spam_phrase_count (from phrase list)
100
- 5. ham_phrase_count (from phrase list)
101
- 6. net_spam_context (spam - ham phrase count)
102
- 7. url_count
103
- 8. html_tag_count
104
- 9. email_length
105
- 10. avg_sentence_length
106
- 11. capitalization_ratio
107
- 12. has_specific_date
108
- 13. has_specific_time
109
- 14. date_reference_count
110
- 15. has_unsubscribe
111
- 16. has_physical_address
112
- 17. has_proper_greeting
113
- 18. has_contact_info
114
- 19. registration_language_score
115
- 20. cta_to_info_ratio
116
- 21. shortener_url_ratio
117
- 22. legitimate_platform_count
118
- 23. gov_edu_url_count
119
- 24. question_mark_count
120
-
121
- ### Phrase Lists
122
- Same lists from `utils_student.py`: `spam_context_phrases`, `ham_context_phrases`, `registration_phrases`, `url_shorteners`, `legitimate_platforms`.
123
-
124
- ### Human-Readable Feature Descriptions
125
- A dictionary mapping feature names to plain-English descriptions, used by the summary generator:
126
- ```python
127
- FEATURE_DESCRIPTIONS = {
128
- 'exclamation_density': 'Exclamation marks per sentence',
129
- 'dollar_sign_count': 'Dollar signs found',
130
- 'caps_word_ratio': 'ALL-CAPS word ratio',
131
- ...
132
- }
133
- ```
134
-
135
- ### Code Style
136
- - Beginner-friendly: explicit `for` loops instead of one-liner comprehensions
137
- - Comments explaining *why*, referencing course concepts (e.g., "scaling features so kNN/SVM treats them equally — Module 7A")
138
- - No advanced Python patterns
139
-
140
- ---
141
-
142
- ## 4. Gradio App (`app.py`)
143
-
144
- ### Interface Layout
145
- - **Input area:**
146
- - `gr.Textbox` — paste email text directly
147
- - `gr.File` — upload `.txt` file (reads content into the text box)
148
- - `gr.Examples` — 3-4 pre-loaded example emails
149
-
150
- - **Output tabs** (using `gr.Tab`):
151
- - **Result** — spam/ham label, confidence %, plain-English summary
152
- - **LIME** — LIME explanation plot (matplotlib figure)
153
- - **SHAP** — SHAP feature importance bar chart (matplotlib figure)
154
-
155
- Note: `.eml` file support is out of scope (MIME parsing is complex). Only plain `.txt` files accepted for upload.
156
-
157
- ### Example Emails (built-in)
158
- 1. Nigerian Prince spam (obvious spam)
159
- 2. Legitimate newsletter (ham)
160
- 3. Phishing attempt (subtle spam)
161
- 4. Normal personal email (ham)
162
-
163
- ### Plain-English Summary
164
- Uses the LIME explanation output (already computed for the LIME tab) to get the top 5 most influential features and their contributions. Maps feature indices to human-readable descriptions via `FEATURE_DESCRIPTIONS` dict, then formats as bullet points:
165
-
166
- > This email was classified as **SPAM** (92% confidence) because:
167
- > - High exclamation density (3.2 per sentence)
168
- > - Contains spam phrases: 'act now', 'you have won'
169
- > - 4 suspicious URLs detected
170
- > - High ALL-CAPS word ratio (18%)
171
-
172
- For TF-IDF features (word-level), the summary says "Contains word: '[word]'" with contribution direction.
173
-
174
- ### LIME Configuration
175
- - Use `lime.lime_tabular.LimeTabularExplainer`
176
- - Training data: loaded from `training_sample.joblib` (200-row dense sample saved during training)
177
- - Feature names: loaded from `feature_names.joblib`
178
- - `num_features=10` for explanation plots
179
- - The explainer works on the combined dense feature matrix (TF-IDF + metadata converted to dense for the single email being explained — this is fine for one sample at a time)
180
-
181
- ### SHAP Configuration
182
- - Use `shap.KernelExplainer` with a small background sample (50 rows from training_sample)
183
- - This is model-agnostic and works with the VotingClassifier
184
- - For performance: only compute SHAP on the metadata features (24 features) — not the full 3000+ TF-IDF features. This keeps SHAP fast (<5 seconds) and the bar chart readable.
185
- - The SHAP tab title should note: "SHAP — Metadata Feature Importance"
186
-
187
- ### Error Handling
188
- - **Models not trained**: Show a clear message "Models not found. Run `python train.py` first." and disable the classify button
189
- - **Empty input**: Show "Please enter email text or upload a file."
190
- - **Invalid file**: Show "Could not read file. Please upload a .txt file."
191
-
192
- ### Classification Flow
193
- 1. Read email text from input (textbox or uploaded file)
194
- 2. Preprocess with `utils.preprocess_text()`
195
- 3. TF-IDF transform with saved vectorizer
196
- 4. Compute metadata features with `utils.compute_metadata_features()`
197
- 5. Scale metadata with saved scaler
198
- 6. Combine TF-IDF + metadata
199
- 7. Predict with voting model
200
- 8. Apply optimal threshold
201
- 9. Generate LIME explanation (full feature space)
202
- 10. Generate SHAP explanation (metadata features only)
203
- 11. Generate plain-English summary (from LIME output)
204
- 12. Return all outputs to Gradio tabs
205
-
206
- ---
207
-
208
- ## 5. macOS Compatibility
209
-
210
- ### Removed from old project
211
- - No `.bat` files — launch with `python app.py`
212
- - No hardcoded `C:\Users\balfo\...` paths
213
- - No `pytesseract` / OCR dependency
214
- - No `streamlit_js_eval` / browser localStorage
215
- - No Streamlit at all
216
-
217
- ### Path handling
218
- - All paths use `pathlib.Path(__file__).parent` (cross-platform)
219
- - Data accessed via symlink to existing corpus
220
-
221
- ---
222
-
223
- ## 6. Dependencies (`requirements.txt`)
224
-
225
- ```
226
- numpy>=1.24.0
227
- pandas>=2.0.0
228
- matplotlib>=3.7.0
229
- scikit-learn>=1.3.0
230
- scipy>=1.11.0
231
- nltk>=3.8.0
232
- lime>=0.2.0
233
- shap>=0.44.0
234
- gradio>=4.0.0
235
- joblib>=1.3.0
236
- tqdm>=4.65.0
237
- ```
238
-
239
- **Not included:** streamlit, eli5, wordcloud, seaborn, pytesseract, Pillow, streamlit-js-eval, requests (no Ollama).
240
-
241
- ---
242
-
243
- ## 7. Changelog (`CHANGELOG.md`)
244
-
245
- Maintained in the new project root. Format:
246
-
247
- ```markdown
248
- ## vX.Y.Z — YYYY-MM-DD
249
- ### Title
250
- - What changed and why
251
- ```
252
-
253
- Updated every time a change or improvement is made. Serves as the primary reference for writing the course paper's methodology section.
254
-
255
- ---
256
-
257
- ## 8. Retroactive Changelog for Old Project (separate task)
258
-
259
- Create `CHANGELOG.md` in `spam-xai-project/` by reconstructing the development history from:
260
- - File modification timestamps
261
- - Code comments (e.g., "Change 4: Context-aware phrase lists")
262
- - Progression from `app.py` → `app_student.py`, `retrain.py` → `retrain_student.py`
263
- - Feature additions visible in the code (11 features → 24 features, LLM integration, OCR support, etc.)
264
-
265
- This is a documentation task, separate from the new Gradio project implementation.
266
-
267
- ---
268
-
269
- ## 9. Accuracy Improvements Over Old Project
270
-
271
- ### Data Quality
272
- - **Deduplication**: Remove exact-duplicate emails that inflate metrics
273
- - **Same data sources**: Kaggle + GitHub email-dataset (matching old project)
274
-
275
- ### Model Comparison
276
- - Old project: Random Forest only
277
- - New project: Compare RF, Logistic Regression, SVM — pick best by F1
278
-
279
- ### Ensemble
280
- - Combine models via `VotingClassifier` (soft voting) — typically 2-5% better than any single model
281
- - Use `class_weight='balanced'` on all sub-estimators (same approach as old project)
282
-
283
- ### Same Feature Set
284
- - Keep the 24 metadata features from `utils_student.py` — they're well-designed
285
- - Keep 3000 TF-IDF features with identical vectorizer params — proven effective
286
-
287
- ---
288
-
289
- ## 10. Out of Scope
290
-
291
- - No LLM/Ollama integration
292
- - No OCR/image support
293
- - No ELI5
294
- - No `.eml` file parsing (only `.txt` uploads)
295
- - No deployment (Hugging Face Spaces, Vercel, etc.) — local only
296
- - No deep learning models
297
- - No fine-tuning
298
- - No database or persistent storage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/superpowers/specs/2026-03-23-mlx-spam-classifier-design.md DELETED
@@ -1,311 +0,0 @@
1
- # Design: Spam Classifier with Fine-Tuned LLM (MLX)
2
-
3
- **Date:** 2026-03-23
4
- **Project:** ENGT 375 — Applied Machine Learning, Spring 2026, ODU
5
- **Goal:** Fine-tune Qwen3.5-0.8B on spam/ham email classification using Apple MLX with LoRA, then build a Gradio UI with classify and chat modes. Optionally deploy to Hugging Face Spaces.
6
-
7
- ---
8
-
9
- ## 1. Project Structure
10
-
11
- ```
12
- /Volumes/Projects/Spring 2026/APPLIED MACHINE LEARNING/spam-classifier-mlx/
13
- ├── prepare_data.py # Generate training data using Qwen3.5-9B
14
- ├── fine_tune.py # Fine-tune Qwen3.5-0.8B with LoRA
15
- ├── app.py # Gradio UI (classify tab + chat tab)
16
- ├── requirements.txt # Dependencies
17
- ├── .gitignore # Exclude model weights, cache, etc.
18
- ├── CLAUDE.md # Project instructions
19
- ├── CHANGELOG.md # Version history for paper reference
20
- ├── launch.command # Double-click to run Gradio app
21
- ├── launch-notebook.command # Double-click to open notebook
22
- ├── spam_classifier_mlx.ipynb # Project notebook for submission
23
- ├── data/ # Symlink → spam-xai-project/data
24
- ├── training_data/ # Generated JSONL files
25
- │ ├── train.jsonl # 500 training examples
26
- │ └── test.jsonl # 100 test examples
27
- ├── adapters/ # LoRA adapter weights (fine-tuning output)
28
- ├── fused_model/ # (optional) Merged model for deployment
29
- └── models/ # Base model (downloaded from HuggingFace)
30
- └── Qwen3.5-0.8B-MLX-9bit/
31
- ```
32
-
33
- **Location:** New folder `spam-classifier-mlx/` created alongside `spam-xai-project/` and `spam-classifier-gradio/` in the Applied Machine Learning directory.
34
-
35
- **Data strategy:** Symlink `data/` from `spam-xai-project/data/` (same as the Gradio project).
36
-
37
- ---
38
-
39
- ## 2. Environment Setup
40
-
41
- ### Python Environment
42
- - Create a dedicated venv: `python3 -m venv venv`
43
- - Activate: `source venv/bin/activate`
44
- - Python 3.9+ required (system Python 3.9 on this Mac is fine; 3.11+ preferred if available)
45
-
46
- ### Install MLX-LM
47
- ```bash
48
- pip install "mlx-lm[train]"
49
- ```
50
- This installs: `mlx`, `mlx-lm`, `transformers`, `safetensors`, `sentencepiece`, `tiktoken` (transitive deps).
51
-
52
- ### Hardware
53
- - **Machine:** MacBook Pro M4 Pro, 24GB unified RAM
54
- - **Memory budget — data generation:** ~6-8GB (9B model at 4-bit). Close other large apps during this phase.
55
- - **Memory budget — fine-tuning:** ~3-5GB (0.8B model at 9-bit + LoRA + gradient checkpointing). Comfortable on 24GB.
56
- - **Memory budget — inference:** ~1GB (0.8B model + adapter). Very light.
57
-
58
- ### Models
59
- - **Base model for fine-tuning:** `inferencerlabs/Qwen3.5-0.8B-MLX-9bit` from Hugging Face
60
- - 9-bit quantized, 847MB on disk, ~0.84 GiB in memory, ~231 tokens/s
61
- - 24 transformer layers
62
- - Source: https://huggingface.co/inferencerlabs/Qwen3.5-0.8B-MLX-9bit
63
- - Download: `mlx_lm.generate --model inferencerlabs/Qwen3.5-0.8B-MLX-9bit --prompt "test"` (auto-downloads on first use, or `huggingface-cli download inferencerlabs/Qwen3.5-0.8B-MLX-9bit --local-dir models/Qwen3.5-0.8B-MLX-9bit`)
64
- - **Model for generating training data:** Local `Qwen3.5-9B-OptiQ-4bit` at `~/MLXModels/mlx-community/Qwen3.5-9B-OptiQ-4bit/`
65
-
66
- ### MLX-LM CLI Reference (from official docs at github.com/ml-explore/mlx-lm)
67
- - **Train:** `mlx_lm.lora --model <path> --train --data <dir> --iters 600`
68
- - **Evaluate:** `mlx_lm.lora --model <path> --adapter-path adapters/ --data <dir> --test`
69
- - **Generate:** `mlx_lm.generate --model <path> --adapter-path adapters/ --prompt "..."`
70
- - **Fuse:** `mlx_lm.fuse --model <path>` → saves to `fused_model/`
71
- - **CLI flags use kebab-case:** `--mask-prompt`, `--grad-checkpoint`, `--num-layers`, `--batch-size`
72
- - **YAML config uses underscores or same kebab names**
73
- - **Data format:** JSONL with chat format (see Section 3)
74
- - **QLoRA:** Automatic when base model is quantized
75
-
76
- ---
77
-
78
- ## 3. Data Preparation (`prepare_data.py`)
79
-
80
- ### Source Data
81
- - Kaggle spam dataset at `data/spam_Emails_data.csv` — **193,852 emails** (102,160 ham, 91,692 spam)
82
- - Stratified sample: 300 spam + 300 ham = 600 total for training data generation
83
-
84
- ### Generating Explanations
85
- Use the local Qwen3.5-9B-OptiQ-4bit model via `mlx_lm` Python API to create natural language explanations for each email.
86
-
87
- ```python
88
- from mlx_lm import load, generate
89
-
90
- model, tokenizer = load("~/MLXModels/mlx-community/Qwen3.5-9B-OptiQ-4bit")
91
- response = generate(model, tokenizer, prompt=prompt_text, max_tokens=200)
92
- ```
93
-
94
- Prompt template:
95
- ```
96
- Classify this email as SPAM or HAM. Give your classification on the first line,
97
- then explain your reasoning in 2-3 sentences. Be specific about what words,
98
- patterns, or signals you noticed.
99
-
100
- Email:
101
- {email_text_truncated_to_500_chars}
102
- ```
103
-
104
- ### Synthetic Conversational Data
105
- In addition to the 600 classify examples, generate ~50 synthetic Q&A conversation examples for the chat mode. Topics: why spam uses dollar signs, how phishing differs from spam, what makes newsletters look like spam, common spam patterns, etc. Generated via the 9B model with diverse hardcoded prompts.
106
-
107
- ### Output Format (MLX-LM chat JSONL)
108
- ```json
109
- {"messages": [
110
- {"role": "system", "content": "You are an email spam classifier. Analyze the email and classify it as SPAM or HAM. Explain your reasoning."},
111
- {"role": "user", "content": "Classify this email:\n\nSubject: You Won $5M!!!..."},
112
- {"role": "assistant", "content": "SPAM\n\nThis email uses classic lottery scam tactics: a large prize claim ($5M), urgency language ('act now'), and requests for bank details. The all-caps subject line and excessive exclamation marks are strong spam indicators."}
113
- ]}
114
- ```
115
-
116
- ### Data Quality Validation
117
- After generation, before saving:
118
- 1. Parse each response to extract the classification label (first line: SPAM or HAM)
119
- 2. Compare against ground truth from the Kaggle dataset
120
- 3. Discard mismatches (9B model classified differently than ground truth) — resample replacement emails
121
- 4. **Manual inspection:** Print 10 random examples for spot-checking quality and JSONL formatting
122
- 5. Verify all JSONL lines parse correctly with `json.loads()`
123
-
124
- ### Split
125
- - `training_data/train.jsonl` — 500 examples (450 classify + 50 conversational)
126
- - `training_data/test.jsonl` — 100 examples (classify only, for perplexity evaluation)
127
-
128
- ### Expected Time
129
- ~20-40 minutes (600 emails × 9B model generation at ~231 tok/s per response)
130
-
131
- ---
132
-
133
- ## 4. Fine-Tuning (`fine_tune.py`)
134
-
135
- ### CLI Command (what the script wraps)
136
- ```bash
137
- mlx_lm.lora \
138
- --model models/Qwen3.5-0.8B-MLX-9bit \
139
- --train \
140
- --data training_data \
141
- --iters 600 \
142
- --batch-size 2 \
143
- --learning-rate 1e-5 \
144
- --num-layers 24 \
145
- --adapter-path adapters \
146
- --mask-prompt \
147
- --grad-checkpoint
148
- ```
149
-
150
- Key decisions:
151
- - `--mask-prompt` — only compute loss on the assistant's response, not the user's prompt
152
- - `--grad-checkpoint` — saves memory by recomputing activations during backward pass
153
- - `--batch-size 2` — conservative for 24GB memory
154
- - `--iters 600` — standard for small datasets (~1.2 epochs over 500 examples at batch 2)
155
- - `--num-layers 24` — fine-tune all 24 transformer layers with LoRA (the model has exactly 24 layers)
156
-
157
- ### Script Behavior
158
- `fine_tune.py` is a thin wrapper that:
159
- 1. Checks if base model exists locally, downloads if not
160
- 2. Checks if `training_data/train.jsonl` exists
161
- 3. Runs the `mlx_lm.lora` command via subprocess
162
- 4. Prints training loss progress
163
- 5. Runs evaluation on test set (prints perplexity)
164
- 6. Prints "Training complete! Adapter saved to adapters/"
165
-
166
- ### Capacity Expectations
167
- The fine-tuned 0.8B model's explanations will be simpler than the 9B model's training data. This is expected — the 0.8B model has less capacity for nuanced reasoning. For the project, this is fine and actually demonstrates an interesting finding for the paper: how model size affects explanation quality.
168
-
169
- ### Estimated Time
170
- ~10-20 minutes on M4 Pro.
171
-
172
- ### Output
173
- - `adapters/` — LoRA adapter weights (small, ~10-50MB)
174
- - Training loss curve printed to terminal
175
-
176
- ---
177
-
178
- ## 5. Gradio App (`app.py`)
179
-
180
- ### Model Loading
181
- At startup, load the base model + LoRA adapter:
182
- ```python
183
- from mlx_lm import load, generate
184
-
185
- model, tokenizer = load("models/Qwen3.5-0.8B-MLX-9bit", adapter_path="adapters")
186
- ```
187
-
188
- ### Tab 1: Classify
189
- - **Input:** `gr.Textbox` (paste email, 12 lines) + `gr.File` (.txt upload) + `gr.Examples`
190
- - **Output:** `gr.Markdown` with classification result and explanation
191
- - **Flow:** Wrap email in system+user prompt template → `generate(model, tokenizer, prompt, max_tokens=300)` → display result
192
- - **Example emails:** Same 4 from the sklearn project:
193
- 1. Nigerian Prince spam
194
- 2. Team meeting invite (ham)
195
- 3. Phishing attempt
196
- 4. Family Thanksgiving email (ham)
197
-
198
- ### Tab 2: Chat
199
- - **Input:** `gr.ChatInterface` for conversational back-and-forth
200
- - **System prompt:** "You are a spam email analysis expert. You can classify emails as spam or ham, explain your reasoning, and answer questions about email security and spam patterns."
201
- - **Features:** Conversation history maintained via Gradio's built-in chat state
202
- - **Generation:** `generate(model, tokenizer, prompt, max_tokens=500)` — no streaming (mlx_lm CLI doesn't expose a public stream_generate API; standard generate is fast enough at 231 tok/s on this model)
203
-
204
- ### Error Handling
205
- - Model/adapter not found → "Model not found. Run `python3 fine_tune.py` first."
206
- - Empty input → "Please enter email text or upload a file."
207
- - Generation: cap at 500 tokens max
208
-
209
- ---
210
-
211
- ## 6. Dependencies (`requirements.txt`)
212
-
213
- ```
214
- mlx>=0.22.0
215
- mlx-lm>=0.22.0
216
- gradio==4.19.2
217
- numpy>=1.24.0
218
- pandas>=2.0.0
219
- ```
220
-
221
- Notes:
222
- - `gradio==4.19.2` pinned because 4.44.1 has a bug with Python 3.9 on this Mac (confirmed in the sklearn project)
223
- - `mlx-lm` pulls in `transformers`, `safetensors`, `sentencepiece`, `tiktoken` as transitive deps
224
- - `huggingface-hub` comes as a transitive dep of `mlx-lm` — no need to pin separately
225
-
226
- ---
227
-
228
- ## 7. .gitignore
229
-
230
- ```
231
- __pycache__/
232
- *.pyc
233
- .pytest_cache/
234
- venv/
235
- models/
236
- adapters/
237
- fused_model/
238
- training_data/
239
- data/
240
- *.egg-info/
241
- .DS_Store
242
- ```
243
-
244
- Model weights, adapters, and generated training data are excluded from git (too large). The scripts that create them are tracked.
245
-
246
- ---
247
-
248
- ## 8. Notebook (`spam_classifier_mlx.ipynb`)
249
-
250
- Step-by-step guide for course submission:
251
-
252
- 1. **Introduction** — What is fine-tuning? What is MLX? Why Apple Silicon?
253
- 2. **What is LoRA?** — Explain low-rank adaptation in simple terms (frozen base + small trainable matrices)
254
- 3. **Environment Setup** — Installing mlx-lm, downloading the model
255
- 4. **Data Preparation** — Loading emails, generating explanations with 9B model, formatting as JSONL
256
- 5. **Inspecting Training Data** — Show 5 examples, discuss quality
257
- 6. **Fine-Tuning** — Running mlx_lm.lora, monitoring loss, understanding the process
258
- 7. **Evaluation** — Test perplexity, manual testing with example emails
259
- 8. **Building the Gradio Interface** — Code walkthrough
260
- 9. **Comparison with sklearn Approach** — Side-by-side: accuracy, edge cases (Lenovo email), explainability style
261
- 10. **Results and Conclusions**
262
-
263
- Code style: Beginner-friendly, explicit loops, comments referencing course concepts.
264
-
265
- ---
266
-
267
- ## 9. Optional: Hugging Face Spaces Deployment
268
-
269
- **Problem:** HF Spaces runs Linux servers (CPU/GPU), not Apple Silicon. MLX requires Apple Silicon.
270
-
271
- **Solution:** After fine-tuning locally:
272
- 1. Fuse LoRA adapter: `mlx_lm.fuse --model models/Qwen3.5-0.8B-MLX-9bit`
273
- 2. The fused model can be uploaded to HuggingFace and loaded with `transformers` on any hardware
274
- 3. Create a separate `app_hf.py` that uses `transformers` + `torch` instead of `mlx_lm`
275
- 4. Deploy to HF Spaces with appropriate `requirements.txt`
276
-
277
- This is a stretch goal — the main project works locally with MLX.
278
-
279
- ---
280
-
281
- ## 10. CHANGELOG
282
-
283
- Same format as the sklearn project:
284
- ```markdown
285
- ## vX.Y.Z — YYYY-MM-DD
286
- ### Title
287
- - What changed and why
288
- ```
289
- Starting at v0.1.0. Updated with every change, improvement, bug fix, or finding.
290
-
291
- ---
292
-
293
- ## 11. Code Style
294
-
295
- - Beginner-level Python matching ENGT 375 lecture style
296
- - Explicit `for` loops instead of comprehensions
297
- - Comments explaining *why*, referencing course concepts
298
- - Variable names that read like English
299
- - No decorators, no metaclasses, no advanced patterns
300
- - Each file under ~300 lines
301
-
302
- ---
303
-
304
- ## 12. Out of Scope
305
-
306
- - No full fine-tuning (LoRA only)
307
- - No RLHF/DPO alignment
308
- - No sklearn/LIME/SHAP (that's the other project)
309
- - No Ollama dependency
310
- - No multi-GPU training
311
- - No custom tokenizer training