---
title: WSDC Speech Judge Assistant
emoji: ⚖️
colorFrom: blue
colorTo: purple
sdk: gradio
python_version: "3.10"
sdk_version: "5.23.0"
app_file: app.py
pinned: true
---

# Space 3 — WSDC Speech Judge Assistant

*Part of Prea Callahan's AI + Research Level 2 portfolio. See the full [research journal](./research-journal.md) and the [research brief](./research-brief.md) for context.*

## What this Space does

This is the full two-factor pipeline. You upload a short WSDC-style speech clip (10 seconds to about 4 minutes) and the Space returns three scores plus two kinds of feedback:

- **Delivery score** (0–100), derived from four prosodic features computed from Whisper-small word-level timestamps
- **Content score** (0–100), derived from SmolLM2-1.7B-Instruct's rubric evaluation of the transcript on three dimensions (claim clarity, evidence quality, rebuttal strength)
- **Combined score** (simple average of the two)
- **"Moments that mattered"** — the three longest pauses in the clip with timestamps, so a debater can listen back to those exact seconds
- **Coaching note** — a short paragraph from the LLM, constrained to 2–3 sentences by the rubric prompt

## Three tabs

1. **Score.** Just the three numbers. The use case Teammate B asked for — one glance, one number, one sentence.
2. **Breakdown.** The four prosodic features listed individually, the LLM's rubric output on all three dimensions, and the full transcript. The use case for someone who wants to dig in.
3. **Coach.** The longest-pause timestamps and the LLM's coaching paragraph. The use case Teammate A asked for — pointers back to specific moments in the clip.
4. **Raw JSON.** For debugging. For me.

See research-journal.md, Week 9, for the interview notes that shaped this tab structure.

## The architecture, in one diagram

```
  audio file
      │
      ▼
  Hugging Face Inference API
    openai/whisper-small         ──► transcript + word-level timestamps
      │                               │
      │                               ▼
      │                         Python prosodic features:
      │                           - words per minute
      │                           - pause count (>400ms)
      │                           - pause-duration variance
      │                           - speaking-rate variance (thirds)
      │                               │
      ▼                               ▼
  Hugging Face Inference API     Hand-tuned normalization
    SmolLM2-1.7B-Instruct         (features → 0–100 delivery score)
    + WSDC rubric prompt                  │
      │                                   │
      ▼                                   ▼
  rubric JSON                         Combined score (average)
    - claim clarity
    - evidence quality
    - rebuttal strength
    - coaching note
```

The whole pipeline is a thin client over the Hugging Face Inference API. No local model weights, no OOMs on free-tier CPU. The round trip for a 90-second clip is about 15 seconds end to end: 8 seconds for Whisper, 6 seconds for SmolLM2, and the Python feature extraction is effectively instant.

This is the free-tools translation of the Mistral Voxtral pipeline pattern described in *Designing a speech-to-speech assistant* (Mistral AI, 2025). That blog post is the architectural ancestor of this Space. See research-journal.md, Week 5, for the pivot story.

## Results from Week 10

I tested Space 3 on 20 clips (10 TED Talk openings and 10 student WSDC practice speeches). I rated each clip on a 1–5 persuasiveness scale before running it through the tool, then computed Spearman rank correlations between my ratings and each of the three scores the tool produces.

| Clip set                   | Delivery ↔ rating | Content ↔ rating | Combined ↔ rating |
|----------------------------|-------------------|------------------|-------------------|
| TED (n=10)                 | 0.52              | 0.48             | 0.61              |
| Student WSDC (n=10)        | 0.63              | 0.24             | 0.58              |
| **Overall (n=20)**         | **0.57**          | **0.38**         | **0.63**          |

The combined score was the best predictor of my intuitive rating. On student debate clips, the delivery score is substantially more useful than the content score — SmolLM2 is not a debate expert and the rubric prompt can only push it so far. On TED clips, the two modalities roughly match.

**n=20 is a pilot study.** These numbers would not survive a real evaluation. See the research brief for the honest limitations discussion.

## Running this Space

You need a Hugging Face access token with read-level permissions. Add it as a Space secret:

1. **Settings → Variables and secrets → New secret**
2. Name: `HF_TOKEN`
3. Value: a read token from your [Hugging Face settings page](https://huggingface.co/settings/tokens)

The free Inference API tier is rate-limited but works fine for demo use.

## Known limitations

- **Small test set (n=20).** Everything above the Spearman table is either hand-tuning or extrapolation.
- **Single rater.** I rated all 20 clips myself. A real evaluation would need multiple raters and an inter-rater agreement metric.
- **ASR bias on non-native speakers.** Per [Koenecke et al. (2020)](https://www.pnas.org/doi/10.1073/pnas.1915768117) and [Li et al. (2024)](https://aclanthology.org/2024.naacl-long.246/), Whisper has documented performance disparities across speaker groups. Two of my five teammate sources are non-native English speakers. I did not correct for this.
- **Small-model content scoring.** SmolLM2-1.7B-Instruct is not a debate expert. A better study would fine-tune a content-scoring model on actual judge ballots.
- **Hand-tuned delivery normalization.** The mapping from prosodic features to a 0–100 delivery score is not learned from data. It is hand-tuned to a small n and reflects my own sense of what "good delivery" looks like in WSDC, which is itself one cultural tradition among many ([Kišiček 2018](https://doi.org/10.22329/i.v0i0.5098)).

## Files

- `app.py` — Gradio Blocks interface with four tabs and the full pipeline.
- `requirements.txt` — Just `gradio` and `requests`.

## Course

Built for AI + Research Level 2, Youth Horizons Learning, Spring 2026.