HearthNet-Nemotron / docs /p2_p3 /M19-stt-tts.md
Chris4K's picture
p2, p3
70650b7
|
Raw
History Blame
10.1 kB

M19 β€” Speech I/O (STT + TTS)

Spec version: v1.0 (Phase 2) Depends on: M03 (bus), M07 (blobs, for audio I/O), X04 (config), X03 (observability), openai-whisper, TTS (Coqui XTTS-v2), edge-tts libs Depended on by: M08 UI (voice query button), M22 mobile (voice notes), M18 (STT can chain into translation)


1. Responsibility

Two capabilities:

  • stt.transcribe@1.0 β€” audio β†’ text, with optional translate-to-English
  • tts.synthesize@1.0 β€” text β†’ audio

Two services in the same module because they share the speech domain and often pair (voice query β†’ STT β†’ LLM β†’ TTS).


2. File layout

hearthnet/services/speech/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ stt_service.py
β”œβ”€β”€ tts_service.py
└── backends/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ base.py            # SttBackend, TtsBackend protocols
    β”œβ”€β”€ whisper.py         # OpenAI Whisper local
    β”œβ”€β”€ whisper_remote.py  # HF inference API alternative
    β”œβ”€β”€ xtts.py            # Coqui XTTS-v2 (cloned voices)
    └── edge_tts.py        # Microsoft Edge-TTS (Christof has existing pipeline)

3. STT β€” public API

3.1 backends/base.py (STT)

@dataclass(frozen=True)
class SttSegment:
    start_seconds:  float
    end_seconds:    float
    text:           str
    language:       str
    speaker:        str | None     # only if diarization enabled
    confidence:     float | None

@dataclass(frozen=True)
class SttResult:
    segments:        list[SttSegment]
    language:        str
    duration_seconds: float
    ms:              int

class SttBackend(Protocol):
    name:        str
    models:      list[str]            # "tiny" | "base" | "small" | "medium" | "large-v3"
    languages_supported: list[str]    # ISO 639-1
    supports_diarization: bool

    async def warm(self, model: str) -> None: ...
    async def close(self) -> None: ...

    async def transcribe(
        self,
        audio_bytes: bytes,
        *,
        model: str,
        language: str | None,         # "auto" handled by caller
        diarize: bool,
        translate_to_en: bool,
    ) -> AsyncIterator[SttSegment]:
        """Yields segments as they are produced. Backend may produce in big chunks
        or near-realtime depending on model + hardware."""

    def health(self) -> dict: ...

3.2 stt_service.py

class SttService:
    name    = "stt"
    version = "1.0"

    def __init__(self, config: SpeechConfig, blob_store: BlobStore):
        ...

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One stt.transcribe per (backend, model) combo."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    async def handle_transcribe(self, req: RouteRequest) -> AsyncIterator[dict]:
        """CAP2 Β§4.11.
        1. Fetch audio blob by CID
        2. Verify duration ≀ STT_MAX_AUDIO_SECONDS
        3. Stream segments
        4. Emit done with total stats"""

3.3 Concrete STT backends

class WhisperBackend(SttBackend):
    """Local Whisper via openai-whisper or faster-whisper."""

    def __init__(self, models_dir: Path, default_model: str = "large-v3", device: str = "auto"):
        ...

class WhisperRemoteBackend(SttBackend):
    """HF Inference API. requires_internet=True. Used as fallback when local Whisper not available."""

    def __init__(self, model: str = "openai/whisper-large-v3", token_env: str = "HF_TOKEN"):
        ...

4. TTS β€” public API

4.1 backends/base.py (TTS)

@dataclass(frozen=True)
class TtsResult:
    audio_format:    str         # "ogg_vorbis" | "mp3" | "wav"
    sample_rate:     int         # Hz
    duration_seconds: float
    total_bytes:     int
    ms:              int

class TtsBackend(Protocol):
    name:        str
    voices:      list[str]
    languages_supported: list[str]
    formats_supported: list[str]
    cloned_voices_supported: bool

    async def warm(self, voice: str) -> None: ...
    async def close(self) -> None: ...

    async def synthesize(
        self,
        text: str,
        *,
        voice: str,
        language: str,
        speed: float,                # 0.5..2.0; 1.0 default
        output_format: str,          # "ogg_vorbis"|"mp3"|"wav"
        chunk_size_bytes: int = 16384,
    ) -> AsyncIterator[bytes]:
        """Yields raw audio chunks."""

    def health(self) -> dict: ...

4.2 tts_service.py

class TtsService:
    name    = "tts"
    version = "1.0"

    def __init__(self, config: SpeechConfig):
        ...

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One tts.synthesize per (backend, voice) pair (or backend-only if many voices)."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    async def handle_synthesize(self, req: RouteRequest) -> AsyncIterator[dict]:
        """CAP2 Β§4.12.
        1. Validate text length ≀ TTS_MAX_TEXT_CHARS
        2. Pick backend and voice
        3. Stream chunks (base64 in 'chunk' frame)
        4. Emit done with metadata"""

4.3 Concrete TTS backends

class XttsBackend(TtsBackend):
    """Coqui XTTS-v2 (Christof has the pipeline from his podcast generator).
       Supports voice cloning via reference audio."""

    def __init__(
        self,
        model: str = "tts_models/multilingual/multi-dataset/xtts_v2",
        voices_dir: Path = Path("~/.hearthnet/voices"),
        device: str = "auto",
    ):
        ...

class EdgeTtsBackend(TtsBackend):
    """Microsoft Edge-TTS β€” requires internet, many voices, very natural.
       Used as default when xtts is too slow on a node."""

    def __init__(self, default_voice: str = "de-DE-KatjaNeural"):
        ...

5. Behaviour

5.1 STT streaming

For long audio:

  • Local Whisper produces segments incrementally (~real time on a 4090, slower on CPU)
  • Service emits one SSE segment frame per finalised segment
  • Final done frame includes total duration and full language detection

5.2 STT max length

STT_MAX_AUDIO_SECONDS = 300. Longer audio: caller chunks into 5-minute segments and concatenates results. Caller's responsibility to manage cross-chunk speaker continuity.

5.3 Voice cloning (XTTS)

XttsBackend supports voice cloning when given a reference audio file:

config.tts.cloned_voices = [
    ClonedVoiceConfig(name="hannes_v1", reference_path=Path("~/.hearthnet/voices/hannes-3s.wav"))
]

Each cloned voice is registered as a separate voice entry in the descriptor params. Cloning happens once at startup; serves quickly thereafter.

Privacy note: Voice cloning is powerful and risky. Communities SHOULD policy-restrict who can register cloned voices (suggested: trust_required="anchor" for voice cloning). MVP allows any member; document the risk.

5.4 Audio format negotiation

  • Input STT: any common format Whisper accepts (mp3, ogg, wav, m4a). Service normalises via ffmpeg.
  • Output TTS: ogg_vorbis default (smallest), mp3 widely-compatible, wav lossless.

5.5 Edge-TTS internet dependency

EdgeTtsBackend requires internet. Deregistered automatically by M09 when offline. XTTS local backend continues to work.

5.6 STT β†’ TTS chain (voice assistant pattern)

The voice query button in M08 UI ext:

mic β†’ audio blob via M07 β†’ stt.transcribe β†’ text
text β†’ llm.chat β†’ response text
response text β†’ tts.synthesize β†’ audio chunks β†’ speaker

This is composed at the UI layer, not internally in the speech services.

5.7 Christof's existing pipeline reuse

Christof has an established XTTS-v2 + Edge-TTS podcast generator pipeline. The XttsBackend and EdgeTtsBackend are designed to be drop-ins for that pipeline, sharing the same models directory.


6. Errors

Condition Wire code
Audio > STT_MAX_AUDIO_SECONDS bad_request
Text > TTS_MAX_TEXT_CHARS bad_request
Unknown voice not_found
Audio decode failed (corrupt blob) bad_request
Backend GPU OOM capacity_exceeded

7. Configuration

config.speech.enabled              = True
config.speech.stt_backends = [
    SttBackendConfig(name="whisper", default_model="large-v3", device="auto"),
]
config.speech.tts_backends = [
    TtsBackendConfig(name="xtts", voices_dir=Path("~/.hearthnet/voices")),
    TtsBackendConfig(name="edge_tts", default_voice="de-DE-KatjaNeural"),
]
config.speech.cloned_voices = []   # list[ClonedVoiceConfig]

Constants: STT_MAX_AUDIO_SECONDS, TTS_MAX_TEXT_CHARS.


8. Tests

Unit

  • test_stt_descriptor_per_model
  • test_tts_descriptor_per_voice
  • test_stt_max_duration_rejected
  • test_tts_max_length_rejected

Integration

  • test_whisper_transcribes_de_audio (test asset)
  • test_xtts_synthesises_then_decodes_to_correct_duration
  • test_voice_chain_stt_llm_tts β€” end-to-end
  • test_edge_tts_deregistered_when_offline

9. Cross-references

What Where
stt.transcribe@1.0 wire CAP2 Β§4.11
tts.synthesize@1.0 wire CAP2 Β§4.12
Voice query UI M08 ext
Mobile voice notes M22 Β§4
Translation chain M18
Emergency dereg for internet-bound backends M09 Β§5.2

10. Open questions

  1. Streaming STT (mic input β†’ live caption) β€” Phase 2.5. Requires WebSocket and a different backend init pattern.
  2. Real-time TTS (sub-100ms first audio) β€” XTTS is 500ms+; piper-tts is fast but limited voices. Phase 3.
  3. Speaker enrollment β€” explicit "this is who I am" speech sample so diarization can label by name. Phase 2.5.
  4. Audio at-rest privacy β€” should voice notes be E2E? M23 supports it; default ON for chat attachments.