Skip to content

Models

This module explains how to make the full OpenTalking model chain runnable, not only the talking-head backend. A usable digital-human session depends on five parts:

flowchart LR
    STT[Speech recognition<br/>optional voice input]
    LLM[LLM<br/>decides what to say]
    TTS[TTS<br/>text to audio]
    Avatar[Avatar assets<br/>image / frames / template]
    Head[Talking-head backend<br/>audio to video]
    WebRTC[WebRTC<br/>browser delivery]

    STT --> LLM --> TTS --> Head --> WebRTC
    Avatar --> Head
Layer Default for first run When to change it
LLM DashScope OpenAI-compatible endpoint Use OpenAI, vLLM, Ollama, or DeepSeek when those are already standard in your environment.
STT DashScope Paraformer realtime Keep it unless you need a different realtime STT provider.
TTS Edge TTS Use DashScope, CosyVoice, or ElevenLabs for production voices and voice cloning.
Avatar assets Built-in examples Use shared visual assets; models generate caches, templates, or preprocessing artifacts as needed.
Talking-head backend mock first, then the Wav2Lip local path Use QuickTalk / FlashTalk through OmniRT, FlashHead direct WS, or another model service.

Setup order

  1. Run Quickstart with mock.
  2. Check the Support Matrix to choose the right path.
  3. Configure LLM and STT.
  4. Choose and verify TTS.
  5. Prepare Avatar assets.
  6. Start a talking-head model.
  7. Verify /models, create a session, and test through the browser.

Model Shortcuts

Goal Entry
End-to-end self-test with no weights Mock
First real lip-sync model Wav2Lip Local
Local STT/TTS + QuickTalk Local STT/TTS + QuickTalk
Existing MuseTalk runtime MuseTalk
Local realtime adapter QuickTalk
Single-GPU realtime portrait with pasteback FasterLivePortrait
High-quality heavy model FlashTalk
Standalone FlashHead service FlashHead

Keep model execution decoupled from OpenTalking itself: lightweight models should use local or direct_ws where possible, while OmniRT remains the recommended backend for heavyweight, multi-card, remote, or NPU deployments.

Speech Generation Model Deployment

This section covers TTS model deployment and weight preparation only. For combined flows, see Local Audio + QuickTalk.

Model Entry Notes
Edge TTS Speech Generation Models First-run default, good for pipeline validation.
DashScope Qwen TTS Speech Generation Models Chinese realtime TTS and voice cloning.
CosyVoice3 CosyVoice Deployment Local Chinese TTS with built-in and cloned voices.
IndexTTS IndexTTS Deployment Controllable dubbing, emotion control, and voice cloning.
ElevenLabs Speech Generation Models Hosted multilingual voices.