Models¶
This module explains how to make the full OpenTalking model chain runnable, not only the talking-head backend. A usable digital-human session depends on five parts:
flowchart LR
STT[Speech recognition<br/>optional voice input]
LLM[LLM<br/>decides what to say]
TTS[TTS<br/>text to audio]
Avatar[Avatar assets<br/>image / frames / template]
Head[Talking-head backend<br/>audio to video]
WebRTC[WebRTC<br/>browser delivery]
STT --> LLM --> TTS --> Head --> WebRTC
Avatar --> Head
Recommended defaults¶
| Layer | Default for first run | When to change it |
|---|---|---|
| LLM | DashScope OpenAI-compatible endpoint | Use OpenAI, vLLM, Ollama, or DeepSeek when those are already standard in your environment. |
| STT | DashScope Paraformer realtime | Keep it unless you need a different realtime STT provider. |
| TTS | Edge TTS | Use DashScope, CosyVoice, or ElevenLabs for production voices and voice cloning. |
| Avatar assets | Built-in examples | Use shared visual assets; models generate caches, templates, or preprocessing artifacts as needed. |
| Talking-head backend | mock first, then the Wav2Lip local path |
Use QuickTalk / FlashTalk through OmniRT, FlashHead direct WS, or another model service. |
Setup order¶
- Run Quickstart with
mock. - Check the Support Matrix to choose the right path.
- Configure LLM and STT.
- Choose and verify TTS.
- Prepare Avatar assets.
- Start a talking-head model.
- Verify
/models, create a session, and test through the browser.
Model Shortcuts¶
| Goal | Entry |
|---|---|
| End-to-end self-test with no weights | Mock |
| First real lip-sync model | Wav2Lip Local |
| Local STT/TTS + QuickTalk | Local STT/TTS + QuickTalk |
| Existing MuseTalk runtime | MuseTalk |
| Local realtime adapter | QuickTalk |
| Single-GPU realtime portrait with pasteback | FasterLivePortrait |
| High-quality heavy model | FlashTalk |
| Standalone FlashHead service | FlashHead |
Keep model execution decoupled from OpenTalking itself: lightweight models should use
local or direct_ws where possible, while OmniRT remains the recommended backend
for heavyweight, multi-card, remote, or NPU deployments.
Speech Generation Model Deployment¶
This section covers TTS model deployment and weight preparation only. For combined flows, see Local Audio + QuickTalk.
| Model | Entry | Notes |
|---|---|---|
| Edge TTS | Speech Generation Models | First-run default, good for pipeline validation. |
| DashScope Qwen TTS | Speech Generation Models | Chinese realtime TTS and voice cloning. |
| CosyVoice3 | CosyVoice Deployment | Local Chinese TTS with built-in and cloned voices. |
| IndexTTS | IndexTTS Deployment | Controllable dubbing, emotion control, and voice cloning. |
| ElevenLabs | Speech Generation Models | Hosted multilingual voices. |