HearthNet-Nemotron

Running on Zero

App Files Files Community

HearthNet-Nemotron / docs /p2_p3 /M19-stt-tts.md

Chris4K

p2, p3

70650b7 17 days ago

preview code

Raw

History Blame

10.1 kB

M19 — Speech I/O (STT + TTS)

Spec version: v1.0 (Phase 2) Depends on: M03 (bus), M07 (blobs, for audio I/O), X04 (config), X03 (observability), openai-whisper, TTS (Coqui XTTS-v2), edge-tts libs Depended on by: M08 UI (voice query button), M22 mobile (voice notes), M18 (STT can chain into translation)

1. Responsibility

Two capabilities:

stt.transcribe@1.0 — audio → text, with optional translate-to-English
tts.synthesize@1.0 — text → audio

Two services in the same module because they share the speech domain and often pair (voice query → STT → LLM → TTS).

2. File layout

hearthnet/services/speech/
├── __init__.py
├── stt_service.py
├── tts_service.py
└── backends/
    ├── __init__.py
    ├── base.py            # SttBackend, TtsBackend protocols
    ├── whisper.py         # OpenAI Whisper local
    ├── whisper_remote.py  # HF inference API alternative
    ├── xtts.py            # Coqui XTTS-v2 (cloned voices)
    └── edge_tts.py        # Microsoft Edge-TTS (Christof has existing pipeline)

3. STT — public API

3.1 `backends/base.py` (STT)

@dataclass(frozen=True)
class SttSegment:
    start_seconds:  float
    end_seconds:    float
    text:           str
    language:       str
    speaker:        str | None     # only if diarization enabled
    confidence:     float | None

@dataclass(frozen=True)
class SttResult:
    segments:        list[SttSegment]
    language:        str
    duration_seconds: float
    ms:              int

class SttBackend(Protocol):
    name:        str
    models:      list[str]            # "tiny" | "base" | "small" | "medium" | "large-v3"
    languages_supported: list[str]    # ISO 639-1
    supports_diarization: bool

    async def warm(self, model: str) -> None: ...
    async def close(self) -> None: ...

    async def transcribe(
        self,
        audio_bytes: bytes,
        *,
        model: str,
        language: str | None,         # "auto" handled by caller
        diarize: bool,
        translate_to_en: bool,
    ) -> AsyncIterator[SttSegment]:
        """Yields segments as they are produced. Backend may produce in big chunks
        or near-realtime depending on model + hardware."""

    def health(self) -> dict: ...

3.2 `stt_service.py`

class SttService:
    name    = "stt"
    version = "1.0"

    def __init__(self, config: SpeechConfig, blob_store: BlobStore):
        ...

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One stt.transcribe per (backend, model) combo."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    async def handle_transcribe(self, req: RouteRequest) -> AsyncIterator[dict]:
        """CAP2 §4.11.
        1. Fetch audio blob by CID
        2. Verify duration ≤ STT_MAX_AUDIO_SECONDS
        3. Stream segments
        4. Emit done with total stats"""

3.3 Concrete STT backends

class WhisperBackend(SttBackend):
    """Local Whisper via openai-whisper or faster-whisper."""

    def __init__(self, models_dir: Path, default_model: str = "large-v3", device: str = "auto"):
        ...

class WhisperRemoteBackend(SttBackend):
    """HF Inference API. requires_internet=True. Used as fallback when local Whisper not available."""

    def __init__(self, model: str = "openai/whisper-large-v3", token_env: str = "HF_TOKEN"):
        ...

4. TTS — public API

4.1 `backends/base.py` (TTS)

@dataclass(frozen=True)
class TtsResult:
    audio_format:    str         # "ogg_vorbis" | "mp3" | "wav"
    sample_rate:     int         # Hz
    duration_seconds: float
    total_bytes:     int
    ms:              int

class TtsBackend(Protocol):
    name:        str
    voices:      list[str]
    languages_supported: list[str]
    formats_supported: list[str]
    cloned_voices_supported: bool

    async def warm(self, voice: str) -> None: ...
    async def close(self) -> None: ...

    async def synthesize(
        self,
        text: str,
        *,
        voice: str,
        language: str,
        speed: float,                # 0.5..2.0; 1.0 default
        output_format: str,          # "ogg_vorbis"|"mp3"|"wav"
        chunk_size_bytes: int = 16384,
    ) -> AsyncIterator[bytes]:
        """Yields raw audio chunks."""

    def health(self) -> dict: ...

4.2 `tts_service.py`

class TtsService:
    name    = "tts"
    version = "1.0"

    def __init__(self, config: SpeechConfig):
        ...

    def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
        """One tts.synthesize per (backend, voice) pair (or backend-only if many voices)."""

    async def start(self) -> None: ...
    async def stop(self) -> None: ...
    def health(self) -> dict: ...

    async def handle_synthesize(self, req: RouteRequest) -> AsyncIterator[dict]:
        """CAP2 §4.12.
        1. Validate text length ≤ TTS_MAX_TEXT_CHARS
        2. Pick backend and voice
        3. Stream chunks (base64 in 'chunk' frame)
        4. Emit done with metadata"""

4.3 Concrete TTS backends

class XttsBackend(TtsBackend):
    """Coqui XTTS-v2 (Christof has the pipeline from his podcast generator).
       Supports voice cloning via reference audio."""

    def __init__(
        self,
        model: str = "tts_models/multilingual/multi-dataset/xtts_v2",
        voices_dir: Path = Path("~/.hearthnet/voices"),
        device: str = "auto",
    ):
        ...

class EdgeTtsBackend(TtsBackend):
    """Microsoft Edge-TTS — requires internet, many voices, very natural.
       Used as default when xtts is too slow on a node."""

    def __init__(self, default_voice: str = "de-DE-KatjaNeural"):
        ...

5. Behaviour

5.1 STT streaming

For long audio:

Local Whisper produces segments incrementally (~real time on a 4090, slower on CPU)
Service emits one SSE segment frame per finalised segment
Final done frame includes total duration and full language detection

5.2 STT max length

STT_MAX_AUDIO_SECONDS = 300. Longer audio: caller chunks into 5-minute segments and concatenates results. Caller's responsibility to manage cross-chunk speaker continuity.

5.3 Voice cloning (XTTS)

XttsBackend supports voice cloning when given a reference audio file:

config.tts.cloned_voices = [
    ClonedVoiceConfig(name="hannes_v1", reference_path=Path("~/.hearthnet/voices/hannes-3s.wav"))
]

Each cloned voice is registered as a separate voice entry in the descriptor params. Cloning happens once at startup; serves quickly thereafter.

Privacy note: Voice cloning is powerful and risky. Communities SHOULD policy-restrict who can register cloned voices (suggested: trust_required="anchor" for voice cloning). MVP allows any member; document the risk.

5.4 Audio format negotiation

Input STT: any common format Whisper accepts (mp3, ogg, wav, m4a). Service normalises via ffmpeg.
Output TTS: ogg_vorbis default (smallest), mp3 widely-compatible, wav lossless.

5.5 Edge-TTS internet dependency

EdgeTtsBackend requires internet. Deregistered automatically by M09 when offline. XTTS local backend continues to work.

5.6 STT → TTS chain (voice assistant pattern)

The voice query button in M08 UI ext:

mic → audio blob via M07 → stt.transcribe → text
text → llm.chat → response text
response text → tts.synthesize → audio chunks → speaker

This is composed at the UI layer, not internally in the speech services.

5.7 Christof's existing pipeline reuse

Christof has an established XTTS-v2 + Edge-TTS podcast generator pipeline. The XttsBackend and EdgeTtsBackend are designed to be drop-ins for that pipeline, sharing the same models directory.

6. Errors

Condition	Wire code
Audio > STT_MAX_AUDIO_SECONDS	`bad_request`
Text > TTS_MAX_TEXT_CHARS	`bad_request`
Unknown voice	`not_found`
Audio decode failed (corrupt blob)	`bad_request`
Backend GPU OOM	`capacity_exceeded`

7. Configuration

config.speech.enabled              = True
config.speech.stt_backends = [
    SttBackendConfig(name="whisper", default_model="large-v3", device="auto"),
]
config.speech.tts_backends = [
    TtsBackendConfig(name="xtts", voices_dir=Path("~/.hearthnet/voices")),
    TtsBackendConfig(name="edge_tts", default_voice="de-DE-KatjaNeural"),
]
config.speech.cloned_voices = []   # list[ClonedVoiceConfig]

Constants: STT_MAX_AUDIO_SECONDS, TTS_MAX_TEXT_CHARS.

8. Tests

Unit

test_stt_descriptor_per_model
test_tts_descriptor_per_voice
test_stt_max_duration_rejected
test_tts_max_length_rejected

Integration

test_whisper_transcribes_de_audio (test asset)
test_xtts_synthesises_then_decodes_to_correct_duration
test_voice_chain_stt_llm_tts — end-to-end
test_edge_tts_deregistered_when_offline

9. Cross-references

What	Where
`stt.transcribe@1.0` wire	CAP2 §4.11
`tts.synthesize@1.0` wire	CAP2 §4.12
Voice query UI	M08 ext
Mobile voice notes	M22 §4
Translation chain	M18
Emergency dereg for internet-bound backends	M09 §5.2

10. Open questions

Streaming STT (mic input → live caption) — Phase 2.5. Requires WebSocket and a different backend init pattern.
Real-time TTS (sub-100ms first audio) — XTTS is 500ms+; piper-tts is fast but limited voices. Phase 3.
Speaker enrollment — explicit "this is who I am" speech sample so diarization can label by name. Phase 2.5.
Audio at-rest privacy — should voice notes be E2E? M23 supports it; default ON for chat attachments.