Spaces:
Running on Zero
Running on Zero
File size: 10,149 Bytes
70650b7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 | # M19 β Speech I/O (STT + TTS)
**Spec version:** v1.0 (Phase 2)
**Depends on:** M03 (bus), M07 (blobs, for audio I/O), X04 (config), X03 (observability), `openai-whisper`, `TTS` (Coqui XTTS-v2), `edge-tts` libs
**Depended on by:** M08 UI (voice query button), M22 mobile (voice notes), M18 (STT can chain into translation)
---
## 1. Responsibility
Two capabilities:
- `stt.transcribe@1.0` β audio β text, with optional translate-to-English
- `tts.synthesize@1.0` β text β audio
Two services in the same module because they share the speech domain and often pair (voice query β STT β LLM β TTS).
---
## 2. File layout
```
hearthnet/services/speech/
βββ __init__.py
βββ stt_service.py
βββ tts_service.py
βββ backends/
βββ __init__.py
βββ base.py # SttBackend, TtsBackend protocols
βββ whisper.py # OpenAI Whisper local
βββ whisper_remote.py # HF inference API alternative
βββ xtts.py # Coqui XTTS-v2 (cloned voices)
βββ edge_tts.py # Microsoft Edge-TTS (Christof has existing pipeline)
```
---
## 3. STT β public API
### 3.1 `backends/base.py` (STT)
```python
@dataclass(frozen=True)
class SttSegment:
start_seconds: float
end_seconds: float
text: str
language: str
speaker: str | None # only if diarization enabled
confidence: float | None
@dataclass(frozen=True)
class SttResult:
segments: list[SttSegment]
language: str
duration_seconds: float
ms: int
class SttBackend(Protocol):
name: str
models: list[str] # "tiny" | "base" | "small" | "medium" | "large-v3"
languages_supported: list[str] # ISO 639-1
supports_diarization: bool
async def warm(self, model: str) -> None: ...
async def close(self) -> None: ...
async def transcribe(
self,
audio_bytes: bytes,
*,
model: str,
language: str | None, # "auto" handled by caller
diarize: bool,
translate_to_en: bool,
) -> AsyncIterator[SttSegment]:
"""Yields segments as they are produced. Backend may produce in big chunks
or near-realtime depending on model + hardware."""
def health(self) -> dict: ...
```
### 3.2 `stt_service.py`
```python
class SttService:
name = "stt"
version = "1.0"
def __init__(self, config: SpeechConfig, blob_store: BlobStore):
...
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
"""One stt.transcribe per (backend, model) combo."""
async def start(self) -> None: ...
async def stop(self) -> None: ...
def health(self) -> dict: ...
async def handle_transcribe(self, req: RouteRequest) -> AsyncIterator[dict]:
"""CAP2 Β§4.11.
1. Fetch audio blob by CID
2. Verify duration β€ STT_MAX_AUDIO_SECONDS
3. Stream segments
4. Emit done with total stats"""
```
### 3.3 Concrete STT backends
```python
class WhisperBackend(SttBackend):
"""Local Whisper via openai-whisper or faster-whisper."""
def __init__(self, models_dir: Path, default_model: str = "large-v3", device: str = "auto"):
...
class WhisperRemoteBackend(SttBackend):
"""HF Inference API. requires_internet=True. Used as fallback when local Whisper not available."""
def __init__(self, model: str = "openai/whisper-large-v3", token_env: str = "HF_TOKEN"):
...
```
---
## 4. TTS β public API
### 4.1 `backends/base.py` (TTS)
```python
@dataclass(frozen=True)
class TtsResult:
audio_format: str # "ogg_vorbis" | "mp3" | "wav"
sample_rate: int # Hz
duration_seconds: float
total_bytes: int
ms: int
class TtsBackend(Protocol):
name: str
voices: list[str]
languages_supported: list[str]
formats_supported: list[str]
cloned_voices_supported: bool
async def warm(self, voice: str) -> None: ...
async def close(self) -> None: ...
async def synthesize(
self,
text: str,
*,
voice: str,
language: str,
speed: float, # 0.5..2.0; 1.0 default
output_format: str, # "ogg_vorbis"|"mp3"|"wav"
chunk_size_bytes: int = 16384,
) -> AsyncIterator[bytes]:
"""Yields raw audio chunks."""
def health(self) -> dict: ...
```
### 4.2 `tts_service.py`
```python
class TtsService:
name = "tts"
version = "1.0"
def __init__(self, config: SpeechConfig):
...
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
"""One tts.synthesize per (backend, voice) pair (or backend-only if many voices)."""
async def start(self) -> None: ...
async def stop(self) -> None: ...
def health(self) -> dict: ...
async def handle_synthesize(self, req: RouteRequest) -> AsyncIterator[dict]:
"""CAP2 Β§4.12.
1. Validate text length β€ TTS_MAX_TEXT_CHARS
2. Pick backend and voice
3. Stream chunks (base64 in 'chunk' frame)
4. Emit done with metadata"""
```
### 4.3 Concrete TTS backends
```python
class XttsBackend(TtsBackend):
"""Coqui XTTS-v2 (Christof has the pipeline from his podcast generator).
Supports voice cloning via reference audio."""
def __init__(
self,
model: str = "tts_models/multilingual/multi-dataset/xtts_v2",
voices_dir: Path = Path("~/.hearthnet/voices"),
device: str = "auto",
):
...
class EdgeTtsBackend(TtsBackend):
"""Microsoft Edge-TTS β requires internet, many voices, very natural.
Used as default when xtts is too slow on a node."""
def __init__(self, default_voice: str = "de-DE-KatjaNeural"):
...
```
---
## 5. Behaviour
### 5.1 STT streaming
For long audio:
- Local Whisper produces segments incrementally (~real time on a 4090, slower on CPU)
- Service emits one SSE `segment` frame per finalised segment
- Final `done` frame includes total duration and full language detection
### 5.2 STT max length
`STT_MAX_AUDIO_SECONDS = 300`. Longer audio: caller chunks into 5-minute segments and concatenates results. Caller's responsibility to manage cross-chunk speaker continuity.
### 5.3 Voice cloning (XTTS)
`XttsBackend` supports voice cloning when given a reference audio file:
```python
config.tts.cloned_voices = [
ClonedVoiceConfig(name="hannes_v1", reference_path=Path("~/.hearthnet/voices/hannes-3s.wav"))
]
```
Each cloned voice is registered as a separate `voice` entry in the descriptor params. Cloning happens once at startup; serves quickly thereafter.
**Privacy note:** Voice cloning is powerful and risky. Communities SHOULD policy-restrict who can register cloned voices (suggested: `trust_required="anchor"` for voice cloning). MVP allows any member; document the risk.
### 5.4 Audio format negotiation
- Input STT: any common format Whisper accepts (mp3, ogg, wav, m4a). Service normalises via `ffmpeg`.
- Output TTS: `ogg_vorbis` default (smallest), `mp3` widely-compatible, `wav` lossless.
### 5.5 Edge-TTS internet dependency
`EdgeTtsBackend` requires internet. Deregistered automatically by [M09](../../modules/M09-emergency.md) when offline. XTTS local backend continues to work.
### 5.6 STT β TTS chain (voice assistant pattern)
The voice query button in M08 UI ext:
```
mic β audio blob via M07 β stt.transcribe β text
text β llm.chat β response text
response text β tts.synthesize β audio chunks β speaker
```
This is composed at the UI layer, not internally in the speech services.
### 5.7 Christof's existing pipeline reuse
Christof has an established XTTS-v2 + Edge-TTS podcast generator pipeline. The `XttsBackend` and `EdgeTtsBackend` are designed to be drop-ins for that pipeline, sharing the same models directory.
---
## 6. Errors
| Condition | Wire code |
|-----------|-----------|
| Audio > STT_MAX_AUDIO_SECONDS | `bad_request` |
| Text > TTS_MAX_TEXT_CHARS | `bad_request` |
| Unknown voice | `not_found` |
| Audio decode failed (corrupt blob) | `bad_request` |
| Backend GPU OOM | `capacity_exceeded` |
---
## 7. Configuration
```python
config.speech.enabled = True
config.speech.stt_backends = [
SttBackendConfig(name="whisper", default_model="large-v3", device="auto"),
]
config.speech.tts_backends = [
TtsBackendConfig(name="xtts", voices_dir=Path("~/.hearthnet/voices")),
TtsBackendConfig(name="edge_tts", default_voice="de-DE-KatjaNeural"),
]
config.speech.cloned_voices = [] # list[ClonedVoiceConfig]
```
Constants: `STT_MAX_AUDIO_SECONDS`, `TTS_MAX_TEXT_CHARS`.
---
## 8. Tests
### Unit
- `test_stt_descriptor_per_model`
- `test_tts_descriptor_per_voice`
- `test_stt_max_duration_rejected`
- `test_tts_max_length_rejected`
### Integration
- `test_whisper_transcribes_de_audio` (test asset)
- `test_xtts_synthesises_then_decodes_to_correct_duration`
- `test_voice_chain_stt_llm_tts` β end-to-end
- `test_edge_tts_deregistered_when_offline`
---
## 9. Cross-references
| What | Where |
|------|-------|
| `stt.transcribe@1.0` wire | [CAP2 Β§4.11](../CAPABILITY_CONTRACT_v2.md) |
| `tts.synthesize@1.0` wire | [CAP2 Β§4.12](../CAPABILITY_CONTRACT_v2.md) |
| Voice query UI | M08 ext |
| Mobile voice notes | [M22 Β§4](M22-mobile-native.md) |
| Translation chain | [M18](M18-translation.md) |
| Emergency dereg for internet-bound backends | [M09 Β§5.2](../../modules/M09-emergency.md) |
---
## 10. Open questions
1. **Streaming STT (mic input β live caption)** β Phase 2.5. Requires WebSocket and a different backend init pattern.
2. **Real-time TTS (sub-100ms first audio)** β XTTS is 500ms+; piper-tts is fast but limited voices. Phase 3.
3. **Speaker enrollment** β explicit "this is who I am" speech sample so diarization can label by name. Phase 2.5.
4. **Audio at-rest privacy** β should voice notes be E2E? [M23](M23-e2e-encryption.md) supports it; default ON for chat attachments.
|