title: WSDC Speech Judge Assistant
emoji: ⚖️
colorFrom: blue
colorTo: purple
sdk: gradio
python_version: '3.10'
sdk_version: 5.23.0
app_file: app.py
pinned: true
Space 3 — WSDC Speech Judge Assistant
Part of Prea Callahan's AI + Research Level 2 portfolio. See the full research journal and the research brief for context.
What this Space does
This is the full two-factor pipeline. You upload a short WSDC-style speech clip (10 seconds to about 4 minutes) and the Space returns three scores plus two kinds of feedback:
- Delivery score (0–100), derived from four prosodic features computed from Whisper-small word-level timestamps
- Content score (0–100), derived from SmolLM2-1.7B-Instruct's rubric evaluation of the transcript on three dimensions (claim clarity, evidence quality, rebuttal strength)
- Combined score (simple average of the two)
- "Moments that mattered" — the three longest pauses in the clip with timestamps, so a debater can listen back to those exact seconds
- Coaching note — a short paragraph from the LLM, constrained to 2–3 sentences by the rubric prompt
Three tabs
- Score. Just the three numbers. The use case Teammate B asked for — one glance, one number, one sentence.
- Breakdown. The four prosodic features listed individually, the LLM's rubric output on all three dimensions, and the full transcript. The use case for someone who wants to dig in.
- Coach. The longest-pause timestamps and the LLM's coaching paragraph. The use case Teammate A asked for — pointers back to specific moments in the clip.
- Raw JSON. For debugging. For me.
See research-journal.md, Week 9, for the interview notes that shaped this tab structure.
The architecture, in one diagram
audio file
│
▼
Hugging Face Inference API
openai/whisper-small ──► transcript + word-level timestamps
│ │
│ ▼
│ Python prosodic features:
│ - words per minute
│ - pause count (>400ms)
│ - pause-duration variance
│ - speaking-rate variance (thirds)
│ │
▼ ▼
Hugging Face Inference API Hand-tuned normalization
SmolLM2-1.7B-Instruct (features → 0–100 delivery score)
+ WSDC rubric prompt │
│ │
▼ ▼
rubric JSON Combined score (average)
- claim clarity
- evidence quality
- rebuttal strength
- coaching note
The whole pipeline is a thin client over the Hugging Face Inference API. No local model weights, no OOMs on free-tier CPU. The round trip for a 90-second clip is about 15 seconds end to end: 8 seconds for Whisper, 6 seconds for SmolLM2, and the Python feature extraction is effectively instant.
This is the free-tools translation of the Mistral Voxtral pipeline pattern described in Designing a speech-to-speech assistant (Mistral AI, 2025). That blog post is the architectural ancestor of this Space. See research-journal.md, Week 5, for the pivot story.
Results from Week 10
I tested Space 3 on 20 clips (10 TED Talk openings and 10 student WSDC practice speeches). I rated each clip on a 1–5 persuasiveness scale before running it through the tool, then computed Spearman rank correlations between my ratings and each of the three scores the tool produces.
| Clip set | Delivery ↔ rating | Content ↔ rating | Combined ↔ rating |
|---|---|---|---|
| TED (n=10) | 0.52 | 0.48 | 0.61 |
| Student WSDC (n=10) | 0.63 | 0.24 | 0.58 |
| Overall (n=20) | 0.57 | 0.38 | 0.63 |
The combined score was the best predictor of my intuitive rating. On student debate clips, the delivery score is substantially more useful than the content score — SmolLM2 is not a debate expert and the rubric prompt can only push it so far. On TED clips, the two modalities roughly match.
n=20 is a pilot study. These numbers would not survive a real evaluation. See the research brief for the honest limitations discussion.
Running this Space
You need a Hugging Face access token with read-level permissions. Add it as a Space secret:
- Settings → Variables and secrets → New secret
- Name:
HF_TOKEN - Value: a read token from your Hugging Face settings page
The free Inference API tier is rate-limited but works fine for demo use.
Known limitations
- Small test set (n=20). Everything above the Spearman table is either hand-tuning or extrapolation.
- Single rater. I rated all 20 clips myself. A real evaluation would need multiple raters and an inter-rater agreement metric.
- ASR bias on non-native speakers. Per Koenecke et al. (2020) and Li et al. (2024), Whisper has documented performance disparities across speaker groups. Two of my five teammate sources are non-native English speakers. I did not correct for this.
- Small-model content scoring. SmolLM2-1.7B-Instruct is not a debate expert. A better study would fine-tune a content-scoring model on actual judge ballots.
- Hand-tuned delivery normalization. The mapping from prosodic features to a 0–100 delivery score is not learned from data. It is hand-tuned to a small n and reflects my own sense of what "good delivery" looks like in WSDC, which is itself one cultural tradition among many (Kišiček 2018).
Files
app.py— Gradio Blocks interface with four tabs and the full pipeline.requirements.txt— Justgradioandrequests.
Course
Built for AI + Research Level 2, Youth Horizons Learning, Spring 2026.