File size: 10,149 Bytes
70650b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
# M19 β€” Speech I/O (STT + TTS)

**Spec version:** v1.0 (Phase 2)
**Depends on:** M03 (bus), M07 (blobs, for audio I/O), X04 (config), X03 (observability), `openai-whisper`, `TTS` (Coqui XTTS-v2), `edge-tts` libs
**Depended on by:** M08 UI (voice query button), M22 mobile (voice notes), M18 (STT can chain into translation)

---

## 1. Responsibility

Two capabilities:

- `stt.transcribe@1.0` β€” audio β†’ text, with optional translate-to-English
- `tts.synthesize@1.0` β€” text β†’ audio

Two services in the same module because they share the speech domain and often pair (voice query β†’ STT β†’ LLM β†’ TTS).

---

## 2. File layout

```
hearthnet/services/speech/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ stt_service.py
β”œβ”€β”€ tts_service.py
└── backends/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ base.py            # SttBackend, TtsBackend protocols
    β”œβ”€β”€ whisper.py         # OpenAI Whisper local
    β”œβ”€β”€ whisper_remote.py  # HF inference API alternative
    β”œβ”€β”€ xtts.py            # Coqui XTTS-v2 (cloned voices)
    └── edge_tts.py        # Microsoft Edge-TTS (Christof has existing pipeline)
```

---

## 3. STT β€” public API

### 3.1 `backends/base.py` (STT)

```python
@dataclass(frozen=True)
class SttSegment:
    start_seconds:  float
    end_seconds:    float
    text:           str
    language:       str
    speaker:        str | None     # only if diarization enabled
    confidence:     float | None

@dataclass(frozen=True)
class SttResult:
    segments:        list[SttSegment]
    language:        str
    duration_seconds: float
    ms:              int

class SttBackend(Protocol):
    name:        str
    models:      list[str]            # "tiny" | "base" | "small" | "medium" | "large-v3"
    languages_supported: list[str]    # ISO 639-1
    supports_diarization: bool

    async def warm(self, model: str) -> None: ...
    async def close(self) -> None: ...

    async def transcribe(
        self,
        audio_bytes: bytes,
        *,
        model: str,
        language: str | None,         # "auto" handled by caller
        diarize: bool,
        translate_to_en: bool,
    ) -> AsyncIterator[SttSegment]:
        """Yields segments as they are produced. Backend may produce in big chunks
        or near-realtime depending on model + hardware."""

    def health(self) -> dict: ...
```

### 3.2 `stt_service.py`

```python
class SttService:
    name    = "stt"
    version = "1.0"

    def __init__(self, config: SpeechConfig, blob_store: BlobStore):
        ...

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One stt.transcribe per (backend, model) combo."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    async def handle_transcribe(self, req: RouteRequest) -> AsyncIterator[dict]:
        """CAP2 Β§4.11.
        1. Fetch audio blob by CID
        2. Verify duration ≀ STT_MAX_AUDIO_SECONDS
        3. Stream segments
        4. Emit done with total stats"""
```

### 3.3 Concrete STT backends

```python
class WhisperBackend(SttBackend):
    """Local Whisper via openai-whisper or faster-whisper."""

    def __init__(self, models_dir: Path, default_model: str = "large-v3", device: str = "auto"):
        ...

class WhisperRemoteBackend(SttBackend):
    """HF Inference API. requires_internet=True. Used as fallback when local Whisper not available."""

    def __init__(self, model: str = "openai/whisper-large-v3", token_env: str = "HF_TOKEN"):
        ...
```

---

## 4. TTS β€” public API

### 4.1 `backends/base.py` (TTS)

```python
@dataclass(frozen=True)
class TtsResult:
    audio_format:    str         # "ogg_vorbis" | "mp3" | "wav"
    sample_rate:     int         # Hz
    duration_seconds: float
    total_bytes:     int
    ms:              int

class TtsBackend(Protocol):
    name:        str
    voices:      list[str]
    languages_supported: list[str]
    formats_supported: list[str]
    cloned_voices_supported: bool

    async def warm(self, voice: str) -> None: ...
    async def close(self) -> None: ...

    async def synthesize(
        self,
        text: str,
        *,
        voice: str,
        language: str,
        speed: float,                # 0.5..2.0; 1.0 default
        output_format: str,          # "ogg_vorbis"|"mp3"|"wav"
        chunk_size_bytes: int = 16384,
    ) -> AsyncIterator[bytes]:
        """Yields raw audio chunks."""

    def health(self) -> dict: ...
```

### 4.2 `tts_service.py`

```python
class TtsService:
    name    = "tts"
    version = "1.0"

    def __init__(self, config: SpeechConfig):
        ...

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One tts.synthesize per (backend, voice) pair (or backend-only if many voices)."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    async def handle_synthesize(self, req: RouteRequest) -> AsyncIterator[dict]:
        """CAP2 Β§4.12.
        1. Validate text length ≀ TTS_MAX_TEXT_CHARS
        2. Pick backend and voice
        3. Stream chunks (base64 in 'chunk' frame)
        4. Emit done with metadata"""
```

### 4.3 Concrete TTS backends

```python
class XttsBackend(TtsBackend):
    """Coqui XTTS-v2 (Christof has the pipeline from his podcast generator).
       Supports voice cloning via reference audio."""

    def __init__(
        self,
        model: str = "tts_models/multilingual/multi-dataset/xtts_v2",
        voices_dir: Path = Path("~/.hearthnet/voices"),
        device: str = "auto",
    ):
        ...

class EdgeTtsBackend(TtsBackend):
    """Microsoft Edge-TTS β€” requires internet, many voices, very natural.
       Used as default when xtts is too slow on a node."""

    def __init__(self, default_voice: str = "de-DE-KatjaNeural"):
        ...
```

---

## 5. Behaviour

### 5.1 STT streaming

For long audio:
- Local Whisper produces segments incrementally (~real time on a 4090, slower on CPU)
- Service emits one SSE `segment` frame per finalised segment
- Final `done` frame includes total duration and full language detection

### 5.2 STT max length

`STT_MAX_AUDIO_SECONDS = 300`. Longer audio: caller chunks into 5-minute segments and concatenates results. Caller's responsibility to manage cross-chunk speaker continuity.

### 5.3 Voice cloning (XTTS)

`XttsBackend` supports voice cloning when given a reference audio file:

```python
config.tts.cloned_voices = [
    ClonedVoiceConfig(name="hannes_v1", reference_path=Path("~/.hearthnet/voices/hannes-3s.wav"))
]
```

Each cloned voice is registered as a separate `voice` entry in the descriptor params. Cloning happens once at startup; serves quickly thereafter.

**Privacy note:** Voice cloning is powerful and risky. Communities SHOULD policy-restrict who can register cloned voices (suggested: `trust_required="anchor"` for voice cloning). MVP allows any member; document the risk.

### 5.4 Audio format negotiation

- Input STT: any common format Whisper accepts (mp3, ogg, wav, m4a). Service normalises via `ffmpeg`.
- Output TTS: `ogg_vorbis` default (smallest), `mp3` widely-compatible, `wav` lossless.

### 5.5 Edge-TTS internet dependency

`EdgeTtsBackend` requires internet. Deregistered automatically by [M09](../../modules/M09-emergency.md) when offline. XTTS local backend continues to work.

### 5.6 STT β†’ TTS chain (voice assistant pattern)

The voice query button in M08 UI ext:
```
mic β†’ audio blob via M07 β†’ stt.transcribe β†’ text
text β†’ llm.chat β†’ response text
response text β†’ tts.synthesize β†’ audio chunks β†’ speaker
```

This is composed at the UI layer, not internally in the speech services.

### 5.7 Christof's existing pipeline reuse

Christof has an established XTTS-v2 + Edge-TTS podcast generator pipeline. The `XttsBackend` and `EdgeTtsBackend` are designed to be drop-ins for that pipeline, sharing the same models directory.

---

## 6. Errors

| Condition | Wire code |
|-----------|-----------|
| Audio > STT_MAX_AUDIO_SECONDS | `bad_request` |
| Text > TTS_MAX_TEXT_CHARS | `bad_request` |
| Unknown voice | `not_found` |
| Audio decode failed (corrupt blob) | `bad_request` |
| Backend GPU OOM | `capacity_exceeded` |

---

## 7. Configuration

```python
config.speech.enabled              = True
config.speech.stt_backends = [
    SttBackendConfig(name="whisper", default_model="large-v3", device="auto"),
]
config.speech.tts_backends = [
    TtsBackendConfig(name="xtts", voices_dir=Path("~/.hearthnet/voices")),
    TtsBackendConfig(name="edge_tts", default_voice="de-DE-KatjaNeural"),
]
config.speech.cloned_voices = []   # list[ClonedVoiceConfig]
```

Constants: `STT_MAX_AUDIO_SECONDS`, `TTS_MAX_TEXT_CHARS`.

---

## 8. Tests

### Unit
- `test_stt_descriptor_per_model`
- `test_tts_descriptor_per_voice`
- `test_stt_max_duration_rejected`
- `test_tts_max_length_rejected`

### Integration
- `test_whisper_transcribes_de_audio` (test asset)
- `test_xtts_synthesises_then_decodes_to_correct_duration`
- `test_voice_chain_stt_llm_tts` β€” end-to-end
- `test_edge_tts_deregistered_when_offline`

---

## 9. Cross-references

| What | Where |
|------|-------|
| `stt.transcribe@1.0` wire | [CAP2 Β§4.11](../CAPABILITY_CONTRACT_v2.md) |
| `tts.synthesize@1.0` wire | [CAP2 Β§4.12](../CAPABILITY_CONTRACT_v2.md) |
| Voice query UI | M08 ext |
| Mobile voice notes | [M22 Β§4](M22-mobile-native.md) |
| Translation chain | [M18](M18-translation.md) |
| Emergency dereg for internet-bound backends | [M09 Β§5.2](../../modules/M09-emergency.md) |

---

## 10. Open questions

1. **Streaming STT (mic input β†’ live caption)** β€” Phase 2.5. Requires WebSocket and a different backend init pattern.
2. **Real-time TTS (sub-100ms first audio)** β€” XTTS is 500ms+; piper-tts is fast but limited voices. Phase 3.
3. **Speaker enrollment** β€” explicit "this is who I am" speech sample so diarization can label by name. Phase 2.5.
4. **Audio at-rest privacy** β€” should voice notes be E2E? [M23](M23-e2e-encryption.md) supports it; default ON for chat attachments.