docs: reorganize deployment guides (#127)

2026-07-05 00:25:28 +08:00 · 2026-06-28 13:22:46 +08:00
parent cd313ce866
commit 5516cd5675
174 changed files with 4993 additions and 3818 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -78,6 +78,7 @@ examples/avatars/*/[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a
 machine.md

 # Internal dev specs (not for public repo)
+AGENTS.md
 docs/superpowers/
 .worktrees/

--- a/docs/assets/images/model-support/fasterliveportrait-video-clone.png
+++ b/docs/assets/images/model-support/fasterliveportrait-video-clone.png
--- a/docs/assets/images/quick-start/mock-first-session.png
+++ b/docs/assets/images/quick-start/mock-first-session.png
--- a/docs/assets/images/quick-start/mock-webui-home.png
+++ b/docs/assets/images/quick-start/mock-webui-home.png
--- a/docs/assets/images/usage/video-clone-workspace.png
+++ b/docs/assets/images/usage/video-clone-workspace.png
--- a/docs/assets/images/usage/webui-first-screen.png
+++ b/docs/assets/images/usage/webui-first-screen.png
--- a/docs/assets/stylesheets/navigation.css
+++ b/docs/assets/stylesheets/navigation.css
@@ -6,14 +6,13 @@
  white-space: nowrap;
 }

-.md-sidebar--primary
-  .md-nav__item--nested:not(.md-nav__item--section):not(.md-nav__item--active):not(
-    :has(> .md-nav__toggle:checked)
-  )
-  > .md-nav {
-  height: 0;
-  margin-bottom: 0;
-  overflow: hidden;
+.md-sidebar--primary .md-nav__item--nested > .md-nav {
+  margin-left: 0.65rem;
+}
+
+.md-sidebar--primary .md-nav__item--section > .md-nav__link,
+.md-sidebar--primary .md-nav__item--nested > .md-nav__link {
+  font-weight: 700;
 }

 .ot-figure-placeholder {
@@ -42,29 +41,3 @@
 .ot-figure-placeholder span {
  font-size: 0.8rem;
 }
-
-:root {
-  --ot-sidebar-child-indent: 0.85rem;
-}
-
-.md-sidebar--primary
-  .md-nav__item--section
-  > .md-nav
-  > .md-nav__list
-  > .md-nav__item:not(.md-nav__item--section)
-  > .md-nav__link {
-  box-sizing: border-box;
-  padding-left: var(--ot-sidebar-child-indent);
-}
-
-.md-sidebar--primary
-  .md-nav__item--section
-  > .md-nav
-  > .md-nav__list
-  > .md-nav__item:not(.md-nav__item--section)
-  > .md-nav
-  > .md-nav__list
-  > .md-nav__item
-  > .md-nav__link {
-  padding-left: var(--ot-sidebar-child-indent);
-}
--- a/docs/en/avatar_models/avatar.md
+++ b/docs/en/avatar_models/avatar.md
@@ -0,0 +1,51 @@
+# Avatar Assets
+
+Avatar assets define the visual identity of a digital human. OpenTalking now treats
+avatars as shared session assets: the same avatar can be reused by different
+talking-head models, and each model creates its own cache, template, or preprocessing
+artifacts when needed.
+
+## Minimal Rules
+
+A usable avatar bundle should include:
+
+- `manifest.json`: basic identity, display name, size, frame rate, and sample rate.
+- `preview.png`: image used by the WebUI avatar library.
+- Optional assets: a reference image, extracted frames, a template video, or
+  model-generated cache.
+
+Do not document avatars as QuickTalk-, MuseTalk-, or Wav2Lip-only assets. Model-specific
+derivatives such as QuickTalk templates, Wav2Lip reference frames, or MuseTalk
+`prepared/` files should be generated by preparation scripts, upload flows, or
+deployment commands.
+
+## Example manifest
+
+```json title="examples/avatars/demo-avatar/manifest.json"
+{
+  "id": "demo-avatar",
+  "name": "Demo Avatar",
+  "fps": 25,
+  "sample_rate": 16000,
+  "width": 512,
+  "height": 512,
+  "metadata": {}
+}
+```
+
+## Prepare and validate
+
+Use the existing avatar guide for the complete schema and preparation scripts:
+
+- [Avatar Format](../docs/avatar-format.md)
+- [Models → Talking-Head Models](talking-head.md)
+
+Verify the server sees the avatar:
+
+```bash title="terminal"
+curl -s http://127.0.0.1:8000/avatars | jq
+```
+
+When troubleshooting, check three values together: the session `model`, whether the
+avatar is readable by the service, and whether the matching `/models` backend is
+connected.
--- a/docs/en/avatar_models/deployment/musetalk-local.md
+++ b/docs/en/avatar_models/deployment/musetalk-local.md
@@ -0,0 +1,68 @@
+# MuseTalk Local Deployment
+
+Local mode starts the MuseTalk adapter from OpenTalking and runs official preprocessing before session creation. Use it when you want MuseTalk quality without deploying OmniRT yet.
+
+## Use Cases
+
+- Single-machine CUDA deployment with Web/API and MuseTalk runtime together.
+- OpenTalking should generate avatar `prepared/` assets automatically.
+- Extra first-session time for DWPose, face parsing, and VAE loading is acceptable.
+
+## Weight Preparation
+
+MuseTalk local needs model weights, the official source checkout, and a preprocessing Python:
+
+```bash title="Terminal"
+export DIGITAL_HUMAN_HOME="$HOME/digital-human"
+export OPENTALKING_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+mkdir -p "$OPENTALKING_MODEL_ROOT" "$DIGITAL_HUMAN_HOME/model-repos"
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download TMElyralab/MuseTalk \
+  --local-dir "$OPENTALKING_MODEL_ROOT"
+
+git clone https://github.com/TMElyralab/MuseTalk.git \
+  "$DIGITAL_HUMAN_HOME/model-repos/MuseTalk"
+```
+
+The model root must contain directories such as `musetalk/`, `sd-vae-ft-mse/`, `whisper/`, `dwpose/`, and `face-parse-bisenet/`. Use the repository script to check the preprocessing environment:
+
+```bash title="Terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+bash scripts/quickstart/prepare_local_musetalk.sh
+```
+
+## Start Command
+
+```bash title="Terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+uv sync --extra dev --extra models --python 3.11
+
+export OPENTALKING_MUSETALK_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+export OPENTALKING_MUSETALK_REPO="$DIGITAL_HUMAN_HOME/model-repos/MuseTalk"
+export OPENTALKING_MUSETALK_PREPROCESS_PYTHON="$DIGITAL_HUMAN_HOME/runtimes/musetalk-preprocess/venv/bin/python"
+
+bash scripts/start_unified.sh --backend local --model musetalk --api-port 18000 --web-port 18173
+```
+
+When creating a session, OpenTalking runs official MuseTalk preprocessing first if the avatar does not already have `prepared/prepared_info.json`.
+
+## Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:18000/health
+curl -s http://127.0.0.1:18000/models | jq '.statuses[] | select(.id=="musetalk")'
+```
+
+Expect `backend=local` and `connected=true`.
+
+## Common Errors
+
+| Symptom | Action |
+|---------|--------|
+| `No module named 'mmcv._ext'` | The preprocessing Python needs full `mmcv`, not only `mmcv-lite`. |
+| Preprocessing fails | Check `OPENTALKING_MUSETALK_REPO`, `dwpose`, and `face-parse-bisenet`. |
+| First session is slow | Preprocessing and VAE loading are expected; pre-generate `prepared/` for common avatars. |
+| Avatar asset unavailable | Check that the avatar is uploaded, readable, and the session configuration is complete. |
--- a/docs/en/avatar_models/deployment/musetalk-omnirt.md
+++ b/docs/en/avatar_models/deployment/musetalk-omnirt.md
@@ -0,0 +1,70 @@
+# MuseTalk OmniRT Deployment
+
+OmniRT mode lets an external MuseTalk service own weight loading, the official runtime, and GPU scheduling. OpenTalking connects through `/v1/audio2video/musetalk`.
+
+## Use Cases
+
+- MuseTalk dependencies are heavy and should be isolated from the OpenTalking process.
+- Web/API and inference GPU run separately.
+- Wav2Lip, QuickTalk, and MuseTalk should share one OmniRT entrypoint.
+
+## Weight Preparation
+
+```bash title="Terminal"
+export DIGITAL_HUMAN_HOME="$HOME/digital-human"
+export OMNIRT_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+mkdir -p "$OMNIRT_MODEL_ROOT" "$DIGITAL_HUMAN_HOME/model-repos"
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download TMElyralab/MuseTalk \
+  --local-dir "$OMNIRT_MODEL_ROOT"
+
+git clone https://github.com/TMElyralab/MuseTalk.git \
+  "$DIGITAL_HUMAN_HOME/model-repos/MuseTalk"
+```
+
+Confirm `musetalk/`, `sd-vae-ft-mse/`, `whisper/`, `dwpose/`, and `face-parse-bisenet/` exist under `$OMNIRT_MODEL_ROOT`.
+
+## Start Command
+
+Use the quickstart script to prepare and start the MuseTalk runtime:
+
+```bash title="Terminal"
+cd "$OMNIRT_HOME"
+export OMNIRT_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+export OMNIRT_MUSETALK_REPO="$DIGITAL_HUMAN_HOME/model-repos/MuseTalk"
+export OMNIRT_MUSETALK_DEVICE=cuda
+export OMNIRT_MUSETALK_PORT=8766
+
+bash scripts/quickstart/start_omnirt_musetalk.sh
+```
+
+Then start OpenTalking:
+
+```bash title="Terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+bash scripts/start_unified.sh \
+  --backend omnirt \
+  --model musetalk \
+  --omnirt http://127.0.0.1:9000 \
+  --api-port 8310 \
+  --web-port 5380
+```
+
+## Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:9000/v1/audio2video/models | jq
+curl -s http://127.0.0.1:8310/models | jq '.statuses[] | select(.id=="musetalk")'
+```
+
+## Common Errors
+
+| Symptom | Action |
+|---------|--------|
+| OmniRT does not list `musetalk` | Check `OMNIRT_MUSETALK_REPO`, model directories, and startup logs. |
+| `reason=omnirt_unavailable` | Check the OpenTalking `--omnirt` URL and OmniRT port. |
+| MuseTalk child-service port conflict | Change `OMNIRT_MUSETALK_PORT`. |
+| Slow first load | MuseTalk preload and avatar preprocessing are expensive; prewarm in production. |
--- a/docs/en/avatar_models/deployment/quicktalk-apple-silicon.md
+++ b/docs/en/avatar_models/deployment/quicktalk-apple-silicon.md
@@ -0,0 +1,66 @@
+# QuickTalk Apple Silicon Deployment
+
+Apple Silicon is useful for configuration, avatar, and WebUI flow validation. For realtime production inference, prefer CUDA or OmniRT; treat Mac usage as a development path.
+
+## Use Cases
+
+- Prepare weights, inspect manifests, and validate WebUI flow on an M-series Mac.
+- Reuse the same QuickTalk directory layout without CUDA access.
+- Prepare assets that will later be synced to a Linux GPU or OmniRT service.
+
+## Weight Preparation
+
+Keep the same layout as Linux local mode:
+
+```bash title="Terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+mkdir -p models/quicktalk/checkpoints
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download datascale-ai/quicktalk \
+  quicktalk.pth \
+  repair.npy \
+  chinese-hubert-large/config.json \
+  chinese-hubert-large/preprocessor_config.json \
+  chinese-hubert-large/pytorch_model.bin \
+  --local-dir models/quicktalk/checkpoints
+```
+
+If this machine is only used for documentation and asset checks, you can skip CUDA-specific dependencies and only verify weights, shared avatars, and optional template assets.
+
+## Start Command
+
+Validate API/WebUI with `mock` first, then switch to QuickTalk asset checks:
+
+```bash title="Terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+uv sync --extra dev --extra models --extra quicktalk-cpu --python 3.11
+
+export OPENTALKING_TORCH_DEVICE=mps
+export OPENTALKING_QUICKTALK_ASSET_ROOT="$DIGITAL_HUMAN_HOME/opentalking/models/quicktalk"
+export OPENTALKING_QUICKTALK_WORKER_CACHE=0
+
+bash scripts/start_unified.sh --backend local --model quicktalk --api-port 8210 --web-port 5280
+```
+
+If a dependency or operator does not support MPS, use `--backend mock` for product-flow checks, or sync the same `models/quicktalk/` directory to a CUDA machine.
+
+## Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:8210/health
+curl -s http://127.0.0.1:8210/models | jq '.statuses[] | select(.id=="quicktalk")'
+```
+
+On Apple Silicon, `connected=false` does not always mean the assets are wrong. Read `reason` to distinguish missing dependencies, missing weights, and unsupported devices.
+
+## Common Errors
+
+| Symptom | Action |
+|---------|--------|
+| MPS operator is unsupported | Use a CUDA machine or OmniRT for real inference; keep Mac for asset validation. |
+| ONNX Runtime provider mismatch | Use `quicktalk-cpu` dependencies or switch to Linux CUDA. |
+| Template video not found | If a fixed template video is configured, use a reachable absolute path or a repository asset path. |
+| Slow downloads | Set `HF_ENDPOINT`, or download on another network and sync the files. |
--- a/docs/en/avatar_models/deployment/quicktalk-local.md
+++ b/docs/en/avatar_models/deployment/quicktalk-local.md
@@ -0,0 +1,88 @@
+# QuickTalk Local Deployment
+
+Local mode loads the QuickTalk adapter inside the OpenTalking process. Use it for single-machine CUDA validation, avatar-cache debugging, and confirming the Web/API pipeline before introducing OmniRT.
+
+## Use Cases
+
+- You have already validated `mock` and now need real talking-head output.
+- GPU, WebUI, and API run on the same machine.
+- You need to prewarm QuickTalk cache for commonly used shared avatars with
+  `opentalking-prepare-cache`.
+
+## Weight Preparation
+
+Place weights under repository-root `models/quicktalk/`. Set `HF_ENDPOINT` when Hugging Face access is slow.
+
+```bash title="Terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+mkdir -p models/quicktalk/checkpoints
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download datascale-ai/quicktalk \
+  quicktalk.pth \
+  repair.npy \
+  chinese-hubert-large/config.json \
+  chinese-hubert-large/preprocessor_config.json \
+  chinese-hubert-large/pytorch_model.bin \
+  --local-dir models/quicktalk/checkpoints
+```
+
+Prepare InsightFace `buffalo_l` separately:
+
+```bash title="Terminal"
+mkdir -p /tmp/opentalking-insightface models/quicktalk/checkpoints/auxiliary/models
+curl -L \
+  -o /tmp/opentalking-insightface/buffalo_l.zip \
+  https://github.com/deepinsight/insightface/releases/download/v0.7/buffalo_l.zip
+unzip -q -o /tmp/opentalking-insightface/buffalo_l.zip \
+  -d /tmp/opentalking-insightface
+rsync -a /tmp/opentalking-insightface/buffalo_l/ \
+  models/quicktalk/checkpoints/auxiliary/models/buffalo_l/
+```
+
+## Start Command
+
+```bash title="Terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+uv sync --extra dev --extra models --extra quicktalk-cuda --python 3.11
+
+export OPENTALKING_TORCH_DEVICE=cuda:0
+export OPENTALKING_QUICKTALK_ASSET_ROOT="$DIGITAL_HUMAN_HOME/opentalking/models/quicktalk"
+export OPENTALKING_QUICKTALK_WORKER_CACHE=1
+
+bash scripts/start_unified.sh --backend local --model quicktalk --api-port 8210 --web-port 5280
+```
+
+Open `http://localhost:5280`, select a shared avatar, and choose the `quicktalk`
+model. If a fixed template video is required, confirm the template asset is reachable
+from the session or deployment configuration.
+
+## Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:8210/health
+curl -s http://127.0.0.1:8210/models | jq '.statuses[] | select(.id=="quicktalk")'
+```
+
+Expect `backend=local` and `connected=true`. To prepare cache ahead of time:
+
+```bash title="Terminal"
+opentalking-prepare-cache \
+  --model quicktalk \
+  --avatars-root examples/avatars \
+  --quicktalk-model-root models/quicktalk \
+  --device cuda:0 \
+  --model-backend pth \
+  --verify
+```
+
+## Common Errors
+
+| Symptom | Action |
+|---------|--------|
+| `connected=false` | Check `OPENTALKING_QUICKTALK_ASSET_ROOT`, the CUDA device, and `models/quicktalk/checkpoints`. |
+| Long first turn | Enable `OPENTALKING_QUICKTALK_WORKER_CACHE=1` or run `opentalking-prepare-cache` in advance. |
+| Avatar load failure | Check that the avatar is readable; if a fixed template video is configured, confirm that path is reachable. |
+| Hugging Face download fails | Configure `HF_ENDPOINT`, or download offline and sync into the same directory. |
--- a/docs/en/avatar_models/deployment/quicktalk-omnirt.md
+++ b/docs/en/avatar_models/deployment/quicktalk-omnirt.md
@@ -0,0 +1,82 @@
+# QuickTalk OmniRT Deployment
+
+OmniRT mode runs QuickTalk inference outside the OpenTalking process. Use it when multiple models share one service endpoint, GPU dependencies need isolation, or inference runs on a separate machine.
+
+## Use Cases
+
+- OpenTalking owns sessions, TTS, and WebRTC while QuickTalk is served externally.
+- One OmniRT endpoint needs to expose `quicktalk`, `wav2lip`, and other models.
+- Web-service resources and inference GPU resources need separate scaling.
+
+## Weight Preparation
+
+OmniRT reads `$OMNIRT_MODEL_ROOT/quicktalk` by default:
+
+```bash title="Terminal"
+export DIGITAL_HUMAN_HOME="$HOME/digital-human"
+export OMNIRT_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+mkdir -p "$OMNIRT_MODEL_ROOT/quicktalk/checkpoints"
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download datascale-ai/quicktalk \
+  quicktalk.pth \
+  repair.npy \
+  chinese-hubert-large/config.json \
+  chinese-hubert-large/preprocessor_config.json \
+  chinese-hubert-large/pytorch_model.bin \
+  --local-dir "$OMNIRT_MODEL_ROOT/quicktalk/checkpoints"
+```
+
+Confirm `quicktalk.pth`, `repair.npy`, HuBERT, and InsightFace `buffalo_l` all exist under the QuickTalk model directory. Prepare InsightFace as shown in [Local](quicktalk-local.md).
+
+## Start Command
+
+Start OmniRT first:
+
+```bash title="Terminal"
+cd "$OMNIRT_HOME"
+uv sync --extra server --extra quicktalk-cuda --python 3.11
+source .venv/bin/activate
+
+export OMNIRT_QUICKTALK_RUNTIME=1
+export OMNIRT_QUICKTALK_MODEL_ROOT="$OMNIRT_MODEL_ROOT/quicktalk"
+export OMNIRT_QUICKTALK_CHECKPOINT="$OMNIRT_MODEL_ROOT/quicktalk/checkpoints/quicktalk.pth"
+export OMNIRT_QUICKTALK_DEVICE=cuda:0
+export OMNIRT_QUICKTALK_HUBERT_DEVICE=cuda:0
+export OMNIRT_QUICKTALK_MAX_LONG_EDGE=900
+export OMNIRT_QUICKTALK_MAX_TEMPLATE_SECONDS=1
+
+omnirt serve-avatar-ws --host 0.0.0.0 --port 9000 --backend cuda
+```
+
+Then start OpenTalking:
+
+```bash title="Terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+bash scripts/start_unified.sh \
+  --backend omnirt \
+  --model quicktalk \
+  --omnirt http://127.0.0.1:9000 \
+  --api-port 8310 \
+  --web-port 5380
+```
+
+## Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:9000/v1/audio2video/models | jq
+curl -s http://127.0.0.1:8310/models | jq '.statuses[] | select(.id=="quicktalk")'
+```
+
+OpenTalking should report `backend=omnirt` and `connected=true`.
+
+## Common Errors
+
+| Symptom | Action |
+|---------|--------|
+| `reason=omnirt_unavailable` | Check the OmniRT port, `OMNIRT_ENDPOINT`, and `/v1/audio2video/models`. |
+| OmniRT does not list `quicktalk` | Check `OMNIRT_QUICKTALK_RUNTIME=1`, checkpoint paths, and startup logs. |
+| Slow first frame or high VRAM | Tune `OMNIRT_QUICKTALK_MAX_LONG_EDGE`, HuBERT device, or prewarm strategy. |
+| Avatar asset unavailable | Check that the selected avatar is uploaded, readable, and the session configuration is complete. |
--- a/docs/en/avatar_models/deployment/wav2lip-local.md
+++ b/docs/en/avatar_models/deployment/wav2lip-local.md
@@ -0,0 +1,66 @@
+# Wav2Lip Local Deployment
+
+Local mode uses OpenTalking's built-in Wav2Lip adapter. It is the lightest path for validating real lip sync and works well for single-GPU demos and avatar-asset checks.
+
+## Use Cases
+
+- First move from `mock` to a real talking-head model.
+- Run inference inside the OpenTalking process without deploying OmniRT.
+- Use built-in or custom shared avatars, and let the Wav2Lip flow consume reference
+  images or frame assets as needed.
+
+## Weight Preparation
+
+```bash title="Terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+mkdir -p models/wav2lip
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download Pypa/wav2lip384 \
+  wav2lip384.pth \
+  --local-dir models/wav2lip
+hf download rippertnt/wav2lip \
+  s3fd.pth \
+  --local-dir models/wav2lip
+
+stat models/wav2lip/wav2lip384.pth
+stat models/wav2lip/s3fd.pth
+```
+
+## Start Command
+
+```bash title="Terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+uv sync --extra dev --extra models --python 3.11
+
+export OPENTALKING_WAV2LIP_MODEL_ROOT="$DIGITAL_HUMAN_HOME/opentalking/models/wav2lip"
+export OPENTALKING_WAV2LIP_DEVICE=cuda
+export OPENTALKING_WAV2LIP_BATCH_SIZE=16
+export OPENTALKING_WAV2LIP_MAX_LONG_EDGE=832
+export OPENTALKING_WAV2LIP_FACE_DET_DEVICE=cpu
+
+bash scripts/start_unified.sh --backend local --model wav2lip --api-port 8210 --web-port 5280
+```
+
+Open `http://localhost:5280`, choose an available avatar, and select the `wav2lip`
+model.
+
+## Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:8210/health
+curl -s http://127.0.0.1:8210/models | jq '.statuses[] | select(.id=="wav2lip")'
+```
+
+Expect `backend=local` and `connected=true`. The first load initializes the checkpoint, S3FD, and avatar cache, which can take tens of seconds.
+
+## Common Errors
+
+| Symptom | Action |
+|---------|--------|
+| Checkpoint not found | Check `OPENTALKING_WAV2LIP_MODEL_ROOT` and both `.pth` files. |
+| Out of GPU memory | Lower `OPENTALKING_WAV2LIP_BATCH_SIZE` or `OPENTALKING_WAV2LIP_MAX_LONG_EDGE`. |
+| Slow first frame | Set `OPENTALKING_PREWARM_AVATARS=singer` for common avatars. |
+| Enhancement mode fails | `easy_enhanced` requires GFPGAN and `OPENTALKING_WAV2LIP_GFPGAN_CHECKPOINT`. |
--- a/docs/en/avatar_models/deployment/wav2lip-omnirt.md
+++ b/docs/en/avatar_models/deployment/wav2lip-omnirt.md
@@ -0,0 +1,74 @@
+# Wav2Lip OmniRT Deployment
+
+OmniRT mode serves Wav2Lip outside OpenTalking. Use it to decouple model dependencies from the web/API process, or to expose multiple talking-head models from one OmniRT endpoint.
+
+## Use Cases
+
+- Web/API and inference GPU run separately.
+- Models are managed through `/v1/audio2video/{model}`.
+- You want OmniRT preloading, batching, and device controls.
+
+## Weight Preparation
+
+```bash title="Terminal"
+export DIGITAL_HUMAN_HOME="$HOME/digital-human"
+export OMNIRT_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+mkdir -p "$OMNIRT_MODEL_ROOT/wav2lip"
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download Pypa/wav2lip384 \
+  wav2lip384.pth \
+  --local-dir "$OMNIRT_MODEL_ROOT/wav2lip"
+hf download rippertnt/wav2lip \
+  s3fd.pth \
+  --local-dir "$OMNIRT_MODEL_ROOT/wav2lip"
+```
+
+## Start Command
+
+```bash title="Terminal"
+cd "$OMNIRT_HOME"
+uv sync --extra server --extra wav2lip-cuda --python 3.11
+source .venv/bin/activate
+
+export OMNIRT_WAV2LIP_RUNTIME=1
+export OMNIRT_WAV2LIP_MODELS_DIR="$OMNIRT_MODEL_ROOT"
+export OMNIRT_WAV2LIP_CHECKPOINT="$OMNIRT_MODEL_ROOT/wav2lip/wav2lip384.pth"
+export OMNIRT_WAV2LIP_DEVICE=cuda
+export OMNIRT_WAV2LIP_FACE_DET_DEVICE=cpu
+export OMNIRT_WAV2LIP_BATCH_SIZE=16
+export OMNIRT_WAV2LIP_MAX_LONG_EDGE=832
+export OMNIRT_WAV2LIP_PRELOAD=1
+
+omnirt serve-avatar-ws --host 0.0.0.0 --port 9000 --backend cuda
+```
+
+Start OpenTalking from another terminal:
+
+```bash title="Terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+bash scripts/start_unified.sh \
+  --backend omnirt \
+  --model wav2lip \
+  --omnirt http://127.0.0.1:9000 \
+  --api-port 8310 \
+  --web-port 5380
+```
+
+## Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:9000/v1/audio2video/models | jq
+curl -s http://127.0.0.1:8310/models | jq '.statuses[] | select(.id=="wav2lip")'
+```
+
+## Common Errors
+
+| Symptom | Action |
+|---------|--------|
+| OmniRT does not load Wav2Lip | Check `OMNIRT_WAV2LIP_RUNTIME=1` and `OMNIRT_WAV2LIP_CHECKPOINT`. |
+| `reason=omnirt_unavailable` | Check the OpenTalking `--omnirt` URL and OmniRT health. |
+| End-to-end latency is high | Lower batch size, limit `MAX_LONG_EDGE`, and enable `OMNIRT_WAV2LIP_PRELOAD=1`. |
+| Avatar asset unavailable | Confirm the avatar asset is readable and the session configuration is complete. |
--- a/docs/en/model-deployment/fasterliveportrait.md
+++ b/docs/en/model-deployment/fasterliveportrait.md
@@ -6,7 +6,7 @@
 | Model ID | `fasterliveportrait` |
 | Backend | `omnirt` |
 | Evidence level | Documented; realtime path exposed through the OmniRT runtime |
-| Best for | Single-GPU realtime audio-driven portrait avatars, original-image pasteback, video clone, frontend amplitude hot updates |
+| Best for | Single-GPU realtime audio-driven portrait avatars, original-image pasteback, frontend amplitude hot updates |

 ## Common Errors

@@ -18,35 +18,19 @@
 | Browser sees the model but session creation fails | Select an avatar whose `model_type` matches `fasterliveportrait`, or prepare a matching avatar bundle. |


-FasterLivePortrait also runs through the OmniRT `audio2video` compatibility path. OpenTalking owns sessions, TTS/audio streaming, WebRTC playback, and frontend parameter updates. OmniRT keeps FasterLivePortrait and JoyVASA resident and exposes `/v1/audio2video/fasterliveportrait`. This repository does not include an in-process `local` backend for FasterLivePortrait; even for single-machine deployments, start OmniRT on the same host and point OpenTalking at `http://127.0.0.1:9000`.
+FasterLivePortrait also runs through the OmniRT `audio2video` compatibility path. OpenTalking owns sessions, TTS/audio streaming, WebRTC playback, and frontend parameter updates. OmniRT keeps FasterLivePortrait and JoyVASA resident and exposes `/v1/audio2video/fasterliveportrait`.

 This path is intended for single-GPU realtime avatars. The default live profile uses 25fps, one-second audio chunks, a 448px width, and pasteback into the original avatar image. Full-body uploads are still driven through the detected face region; body motion is not synthesized by this runtime.

-The same runtime can also serve the Video Clone workflow. OpenTalking keeps an avatar-library image as the source, streams browser camera frames or uploaded-video frames as driving input, and forwards them to OmniRT `/v1/avatar/video-clone/fasterliveportrait`. This path does not call LLM, STT, or TTS and does not reuse the realtime conversation `speak` queue.
-
 ## 1. Prepare code and weights

-Prepare the shared directory variables first. `FASTERLIVEPORTRAIT_HOME` is the FasterLivePortrait source checkout; `OMNIRT_MODEL_ROOT` is the model-weight root. Do not put model weights inside the OpenTalking or OmniRT repository.
-
-```bash title="terminal"
-export DIGITAL_HUMAN_HOME="${DIGITAL_HUMAN_HOME:-/path/to/digital_human}"
-export OPENTALKING_HOME="${OPENTALKING_HOME:-$DIGITAL_HUMAN_HOME/opentalking}"
-export OMNIRT_REPO="${OMNIRT_REPO:-$DIGITAL_HUMAN_HOME/omnirt}"
-export FASTERLIVEPORTRAIT_HOME="${FASTERLIVEPORTRAIT_HOME:-$DIGITAL_HUMAN_HOME/FasterLivePortrait}"
-export OMNIRT_MODEL_ROOT="${OMNIRT_MODEL_ROOT:-/path/to/model}"
-export FASTERLIVEPORTRAIT_REF="${FASTERLIVEPORTRAIT_REF:-5dcf03aa2e6b2eb2a55b971efdc28fc0afdb1494}"
-```
-
-The current OpenTalking video-clone path and OmniRT runtime depend on FasterLivePortrait patches for fine-grained motion controls, TensorRT output ordering, and new PyTorch checkpoint loading behavior. For now, deploy the pinned `zyairehhh/FasterLivePortrait` fork. Switch to the upstream package only after those patches are available in an official stable package.
+You need a FasterLivePortrait source checkout and a real checkpoint directory. If you do not want symlinks, copy or download the files directly into the model root.

 ```bash title="terminal"
 if [ ! -d "$FASTERLIVEPORTRAIT_HOME/.git" ]; then
-  git clone https://github.com/zyairehhh/FasterLivePortrait.git "$FASTERLIVEPORTRAIT_HOME"
+  git clone https://github.com/KlingAIResearch/LivePortrait.git "$FASTERLIVEPORTRAIT_HOME"
 fi

-git -C "$FASTERLIVEPORTRAIT_HOME" fetch origin master
-git -C "$FASTERLIVEPORTRAIT_HOME" checkout "$FASTERLIVEPORTRAIT_REF"
-
 mkdir -p "$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints"
 ```

@@ -81,12 +65,9 @@ test -f "$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints/chinese-hubert-base/p

 ## 2. Prepare the OmniRT environment

-On servers, keep the `uv` cache on a data disk and use a PyPI mirror to speed up dependency installation. `PIP_INDEX_URL` is a fallback for build steps that still read pip settings.
-
 ```bash title="terminal"
-cd "$OMNIRT_REPO"
+cd "$OMNIRT_HOME"
 export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export PIP_INDEX_URL="${PIP_INDEX_URL:-$UV_DEFAULT_INDEX}"
 export UV_CACHE_DIR="${UV_CACHE_DIR:-$DIGITAL_HUMAN_HOME/.uv-cache}"
 uv sync --extra server --extra fasterliveportrait --python 3.11
 ```
@@ -95,36 +76,24 @@ The realtime FasterLivePortrait path uses TensorRT by default. The `fasterlivepo

 Before deployment, verify that `uv run python -c "import tensorrt as trt; print(trt.__version__)"` prints a version.

-The TensorRT wheel places `libnvinfer.so.10` under the OmniRT `.venv` `site-packages/tensorrt_libs` directory. Add that directory to the dynamic library search path before starting the TRT runtime; otherwise `libgrid_sample_3d_plugin.so` fails with `libnvinfer.so.10: cannot open shared object file`:
-
-```bash title="terminal"
-export TRT_LIB_DIR="$OMNIRT_REPO/.venv/lib/python3.11/site-packages/tensorrt_libs"
-export LD_LIBRARY_PATH="$TRT_LIB_DIR:${LD_LIBRARY_PATH:-}"
-```
-
-
 ## 3. Start the OmniRT FasterLivePortrait runtime

 ```bash title="terminal"
-cd "$OMNIRT_REPO"
-mkdir -p "$DIGITAL_HUMAN_HOME/logs"
-nohup env \
-  OMNIRT_FASTLIVEPORTRAIT_RUNTIME=1 \
-  OMNIRT_FASTLIVEPORTRAIT_LOAD_MODELS=1 \
-  OMNIRT_FASTLIVEPORTRAIT_ROOT="$FASTERLIVEPORTRAIT_HOME" \
-  OMNIRT_FASTLIVEPORTRAIT_CHECKPOINTS_DIR="$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints" \
-  OMNIRT_FASTLIVEPORTRAIT_CFG=configs/trt_infer.yaml \
-  OMNIRT_FASTLIVEPORTRAIT_DEVICE=cuda:0 \
-  OMNIRT_FASTLIVEPORTRAIT_JPEG_QUALITY=85 \
-  uv run omnirt serve-avatar-ws --host 0.0.0.0 --port 9000 --backend cuda \
-  > "$DIGITAL_HUMAN_HOME/logs/omnirt-fasterliveportrait-9000.log" 2>&1 &
-echo $! > "$DIGITAL_HUMAN_HOME/logs/omnirt-fasterliveportrait-9000.pid"
+cd "$OMNIRT_HOME"
+OMNIRT_FASTLIVEPORTRAIT_RUNTIME=1 \
+OMNIRT_FASTLIVEPORTRAIT_LOAD_MODELS=1 \
+OMNIRT_FASTLIVEPORTRAIT_ROOT="$FASTERLIVEPORTRAIT_HOME" \
+OMNIRT_FASTLIVEPORTRAIT_CHECKPOINTS_DIR="$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints" \
+OMNIRT_FASTLIVEPORTRAIT_CFG=configs/trt_infer.yaml \
+OMNIRT_FASTLIVEPORTRAIT_DEVICE=cuda:0 \
+OMNIRT_FASTLIVEPORTRAIT_JPEG_QUALITY=85 \
+uv run omnirt serve-avatar-ws --host 0.0.0.0 --port 9000 --backend cuda
 ```

 Verify OmniRT reports the model:

 ```bash title="terminal"
-curl -s http://127.0.0.1:9000/v1/audio2video/models | python3 -m json.tool
+curl -s http://127.0.0.1:9000/v1/audio2video/models | jq '.statuses[] | select(.id=="fasterliveportrait")'
 ```

 Expected status:
@@ -135,16 +104,6 @@ Expected status:

 ## 4. Configure and start OpenTalking

-Sync the OpenTalking environment first. Use the same `uv` mirror and cache directory as OmniRT.
-
-```bash title="terminal"
-cd "$OPENTALKING_HOME"
-export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export PIP_INDEX_URL="${PIP_INDEX_URL:-$UV_DEFAULT_INDEX}"
-export UV_CACHE_DIR="${UV_CACHE_DIR:-$DIGITAL_HUMAN_HOME/.uv-cache}"
-uv sync --extra dev --python 3.11
-```
-
 OpenTalking configures `fasterliveportrait` as `backend: omnirt` by default. The realtime profile lives in `configs/synthesis/fasterliveportrait.yaml`; common defaults are:

 ```yaml title="configs/synthesis/fasterliveportrait.yaml"
@@ -169,30 +128,27 @@ flag_stitching: true
 head_only_pasteback: false
 ```

-Start OpenTalking against OmniRT. `scripts/start_unified.sh` sets `OPENTALKING_FASTLIVEPORTRAIT_BACKEND=omnirt`, `OPENTALKING_DEFAULT_MODEL=fasterliveportrait`, and `OMNIRT_ENDPOINT`, then starts the WebUI after the API is ready:
+Start OpenTalking against OmniRT:

 ```bash title="terminal"
 cd "$OPENTALKING_HOME"
-bash scripts/start_unified.sh \
-  --backend omnirt \
-  --model fasterliveportrait \
-  --omnirt http://127.0.0.1:9000 \
-  --api-port 8000 \
-  --web-port 5173 \
-  --host 0.0.0.0
+OMNIRT_ENDPOINT=http://127.0.0.1:9000 \
+OPENTALKING_OMNIRT_ENDPOINT=http://127.0.0.1:9000 \
+uv run opentalking-unified --host 0.0.0.0 --port 8000
 ```

-The previous command already starts the WebUI. To restart only the frontend while the API is already running on port `8000`, use a second terminal:
+Frontend:

 ```bash title="terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
+cd "$OPENTALKING_HOME/apps/web"
+npm ci
+VITE_BACKEND_PORT=8000 npm run dev -- --host 0.0.0.0 --port 5173
 ```

 Verify OpenTalking sees the model:

 ```bash title="terminal"
-curl -s http://127.0.0.1:8000/models | python3 -m json.tool
+curl -s http://127.0.0.1:8000/models | jq '.statuses[] | select(.id=="fasterliveportrait")'
 ```

 Expected status:
@@ -201,18 +157,6 @@ Expected status:
 {"id":"fasterliveportrait","backend":"omnirt","connected":true,"reason":"omnirt"}
 ```

-Also verify the video-clone entry:
-
-```bash title="terminal"
-curl -s http://127.0.0.1:8000/video-clone/status | python3 -m json.tool
-```
-
-Expected:
-
-```json
-{"model":"fasterliveportrait","connected":true,"reason":"omnirt"}
-```
-
 ## 5. Frontend controls and hot updates

 After selecting `FasterLivePortrait`, the frontend shows a parameter panel. Before a session starts, clicking Apply stores values for the next session. During a session, clicking Apply sends a hot update and takes effect on the next audio chunk without restarting the conversation.
@@ -234,39 +178,10 @@ After selecting `FasterLivePortrait`, the frontend shows a parameter panel. Befo

 Start with `head_motion_multiplier=0.3`, `pose_motion_multiplier=0.35`, `yaw_multiplier=0.85`, `roll_multiplier=0.85`, `animation_region=lip`, `expression_multiplier=1.0`, `mouth_open_multiplier=1.25`, `mouth_corner_multiplier=0.85`, `cheek_jaw_multiplier=0.9`, `cfg_scale=4.0`, and keep `flag_relative_motion=true`. If the head sways left/right, lower `yaw_multiplier` to `0.7`. If the mouth looks pursed or the smile is too strong, lower `mouth_corner_multiplier` to `0.75`. Switch the region from `lip` to `all` only when you need richer facial expression. Do not improve speed by dropping mouth-open frames.

-## 6. Video Clone Mode
-
-Video Clone is shown in the WebUI top navigation next to “Realtime Conversation”. After entering it:
-
- Source: select an existing avatar on the left, or upload a new source image. The source is the digital-human asset being driven.
- Driving: select a camera on the right, or upload a driving video. Driving only provides expression, head motion, and mouth motion.
- Output: inspect realtime output in the center, with sent frames, received frames, dropped frames, and latency.
-
-The frontend connects to OpenTalking:
-
-```text
-ws://<opentalking-host>/video-clone/fasterliveportrait/ws
-```
-
-OpenTalking then forwards the source image and driving frame stream to OmniRT:
-
-```text
-ws://<omnirt-host>/v1/avatar/video-clone/fasterliveportrait
-```
-
-Common tuning notes:
-
- Enable pasteback when you want to preserve the original source composition.
- If uploaded driving video does not open the mouth enough, raise mouth opening first. If motion collapses into simple vertical mouth opening, lower lip retargeting.
- If the mouth looks puffy or misaligned, first disable driving-face crop and confirm the driving input is not over-cropped.
- If camera permission fails, open the page from `localhost`, `127.0.0.1`, or HTTPS. You can also upload a driving video first to validate the backend.
-
-When stopped or when the page changes, the frontend releases the camera track, WebSocket, and current video-clone session.
-
-## 7. Performance check
+## 6. Performance check

 ```bash title="terminal"
-cd "$OMNIRT_REPO"
+cd "$OMNIRT_HOME"
 uv run python scripts/bench_fasterliveportrait_ws.py \
  --url ws://127.0.0.1:9000/v1/audio2video/fasterliveportrait \
  --duration 30 \
--- a/docs/en/model-deployment/flashhead.md
+++ b/docs/en/model-deployment/flashhead.md
@@ -51,7 +51,7 @@ bash scripts/quickstart/start_all.sh
 ## `/models` Verification

 ```bash title="Terminal"
-curl -s http://127.0.0.1:8000/models | python3 -m json.tool
+curl -s http://127.0.0.1:8000/models | jq '.statuses[] | select(.id=="flashhead")'
 ```

 After configuring the WebSocket URL, expected:
@@ -66,15 +66,4 @@ After configuring the WebSocket URL, expected:
 |---------|--------|
 | `reason=not_configured` | Set `OPENTALKING_FLASHHEAD_WS_URL`. |
 | WebSocket handshake fails | Check FlashHead service path, port, and cross-host network. |
-| Avatar mismatch | Use an avatar with `model_type: flashhead`. |
-
-## Frontend Entry
-
-After the model or backend service is running, use the OpenTalking WebUI:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to the server `5173`, then open `http://127.0.0.1:5173`.
+| Avatar issue | Confirm that the avatar is readable and that the FlashHead service can access the required reference image. |
--- a/docs/en/model-deployment/flashtalk.md
+++ b/docs/en/model-deployment/flashtalk.md
@@ -71,7 +71,7 @@ bash scripts/quickstart/start_omnirt_flashtalk.sh --device npu --nproc 8
 ## `/models` Verification

 ```bash title="Terminal"
-curl -s http://127.0.0.1:8000/models | python3 -m json.tool
+curl -s http://127.0.0.1:8000/models | jq '.statuses[] | select(.id=="flashtalk")'
 ```

 Expected:
@@ -88,14 +88,3 @@ Expected:
 | CUDA OOM | Lower `OPENTALKING_FLASHTALK_FRAME_NUM`, `OPENTALKING_FLASHTALK_SAMPLE_STEPS`, or resolution. |
 | NPU import failure | Confirm CANN is sourced and `torch_npu`, driver, and CANN versions match. |
 | `reason=not_configured` | Configure `OMNIRT_ENDPOINT` or run `start_all.sh --omnirt ...`. |
-
-## Frontend Entry
-
-After the model or backend service is running, use the OpenTalking WebUI:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to the server `5173`, then open `http://127.0.0.1:5173`.
--- a/docs/en/avatar_models/index.md
+++ b/docs/en/avatar_models/index.md
@@ -0,0 +1,3 @@
+# Avatar Models
+
+This directory collects avatar-related model overviews, weight preparation, runtime parameters, and verification steps.
--- a/docs/en/model-deployment/mock.md
+++ b/docs/en/model-deployment/mock.md
@@ -44,7 +44,7 @@ bash scripts/quickstart/start_mock.sh
 ## `/models` Verification

 ```bash title="Terminal"
-curl -s http://127.0.0.1:8000/models | python3 -m json.tool
+curl -s http://127.0.0.1:8000/models | jq '.statuses[] | select(.id=="mock")'
 ```

 Expected:
@@ -60,14 +60,3 @@ Expected:
 | LLM returns 401 | Check `OPENTALKING_LLM_API_KEY` and `OPENTALKING_STT_DASHSCOPE_API_KEY` separately. |
 | No browser video | Use a Chromium-based browser and inspect WebRTC/CORS errors. |
 | Port conflict | Run `bash scripts/quickstart/start_mock.sh --api-port 8010 --web-port 5180`. |
-
-## Frontend Entry
-
-After the model or backend service is running, use the OpenTalking WebUI:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to the server `5173`, then open `http://127.0.0.1:5173`.
--- a/docs/en/avatar_models/musetalk.md
+++ b/docs/en/avatar_models/musetalk.md
@@ -0,0 +1,42 @@
+# MuseTalk
+
+MuseTalk is the higher-quality video-avatar lip-sync path in OpenTalking. Compared with Wav2Lip, it has heavier dependencies and preprocessing; compared with QuickTalk, it is more quality-oriented and useful when you already have a MuseTalk runtime. This page explains when to choose MuseTalk and which deployment mode to use.
+
+## Support Status
+
+| Item | Value |
+|------|-------|
+| Model ID | `musetalk` |
+| Backend | `local` / `omnirt` / `direct_ws` |
+| Evidence level | Local adapter is wired; local mode runs official MuseTalk preprocessing before session initialization |
+| Best for | Higher-quality lip sync, video avatars, existing MuseTalk runtimes |
+
+## Benchmark Reference
+
+The numbers below are summarized from [Benchmark](../reference/benchmark.md). `Steady FPS` is model-generation throughput, not full user-perceived latency; STT, LLM, TTS, queueing, and WebRTC still affect the complete experience.
+
+| Hardware | Backend | Output | Steady FPS | First-turn total/ms | TTFV/ms | Peak inference VRAM/GB |
+|----------|---------|--------|------------|---------------------|---------|------------------------|
+| RTX 3090 | OmniRT | 512×512 / 25fps | 28.868 | 3235.518 | 1769.484 | 5.078 |
+| RTX 4090 | OmniRT | 512×512 / 25fps | 24.767 | 3605.564 | 2095.522 | 5.203 |
+| NPU 910B2 | OmniRT | 512×512 / 25fps | 12.276 | 5781.453 | 4211.721 | 8.754 |
+
+## Choose a Deployment Mode
+
+| Mode | Best for | Entry |
+|------|----------|-------|
+| Local | Single-machine CUDA, with OpenTalking running official avatar preprocessing | [MuseTalk Local Deployment](deployment/musetalk-local.md) |
+| OmniRT | Isolating MuseTalk dependencies from the main OpenTalking process | [MuseTalk OmniRT Deployment](deployment/musetalk-omnirt.md) |
+| Direct WebSocket | Connecting an existing MuseTalk-compatible service directly | See [Runtime Backends](../model-support/runtime-backends/direct-websocket.md) |
+
+## When to Choose Another Model
+
+- Need the lightest real lip-sync validation: see [Wav2Lip](wav2lip.md).
+- Need lower-latency realtime speaking: see [QuickTalk](quicktalk.md).
+- Need a high-quality heavyweight service path: see [FlashTalk](flashtalk.md).
+
+## Related Pages
+
+- [Support Matrix](../deployment/support-matrix.md)
+- [Avatar Assets](avatar.md)
+- [Talking-head Model Deployment](index.md)
--- a/docs/en/avatar_models/quicktalk.md
+++ b/docs/en/avatar_models/quicktalk.md
@@ -0,0 +1,37 @@
+# QuickTalk
+
+QuickTalk is the realtime-oriented talking-head model path in OpenTalking. Use it for low-latency digital-human conversations and fast local GPU trials. This page is a mode-selection overview; weights, startup commands, and verification live in the deployment-mode pages below.
+
+## Support Status
+
+| Item | Value |
+|------|-------|
+| Model ID | `quicktalk` |
+| Backend | `local` / `omnirt` |
+| Evidence level | Local adapter is built in and verified; OmniRT service path is documented |
+| Best for | Realtime speaking avatars, low-latency validation, local or service-hosted inference |
+
+## Benchmark Reference
+
+The numbers below are summarized from [Benchmark](../reference/benchmark.md). `Steady FPS` is model-generation throughput, not full user-perceived latency; STT, LLM, TTS, queueing, and WebRTC still affect the complete experience.
+
+| Hardware | Backend | Output | Steady FPS | First-turn total/ms | TTFV/ms | Peak inference VRAM/GB |
+|----------|---------|--------|------------|---------------------|---------|------------------------|
+| RTX 3090 | OmniRT | 540×900 / 25fps | 29.23 | 3356.019 | 1800.524 | 1.662 |
+| RTX 4090 | OmniRT | 540×900 / 25fps | 46.921 | 2561.146 | 1064.825 | 1.838 |
+| NPU 910B2 | OmniRT | 540×900 / 25fps | 29.66 | 3212.053 | 1782.861 | 2.473 |
+| RTX 3050 Laptop | OmniRT | 306×512 / 25fps | 20.695 | 4243.26 | 2661 | 1.396 |
+
+## Choose a Deployment Mode
+
+| Mode | Best for | Entry |
+|------|----------|-------|
+| Local | Single-machine CUDA, in-process adapter, fastest real-chain validation | [QuickTalk Local Deployment](deployment/quicktalk-local.md) |
+| Apple Silicon | Weight, manifest, and WebUI flow checks on macOS | [QuickTalk Apple Silicon Deployment](deployment/quicktalk-apple-silicon.md) |
+| OmniRT | Isolating inference from OpenTalking, or sharing one model endpoint across runtimes | [QuickTalk OmniRT Deployment](deployment/quicktalk-omnirt.md) |
+
+## Related Pages
+
+- [Support Matrix](../deployment/support-matrix.md): compare QuickTalk with other model-chain backends.
+- [Avatar Assets](avatar.md): understand shared avatar assets and session selection.
+- [Local Audio + QuickTalk](../recipes/local-quicktalk-audio.md): full local SenseVoice, CosyVoice, and QuickTalk chain.
--- a/docs/en/avatar_models/support-matrix.md
+++ b/docs/en/avatar_models/support-matrix.md
@@ -0,0 +1,9 @@
+# Support Matrix Moved
+
+The support matrix is now maintained under Deployment Guide as the global
+pre-deployment selection entry:
+
+- [Deployment Support Matrix](../deployment/support-matrix.md)
+
+Local / OmniRT / Direct WebSocket choices for talking-head models remain linked from
+each model overview page.
--- a/docs/en/avatar_models/talking-head.md
+++ b/docs/en/avatar_models/talking-head.md
@@ -0,0 +1,164 @@
+# Talking-head Models
+
+This page is the selection overview for talking-head backends. OpenTalking owns session
+orchestration, TTS, events, and WebRTC; model weight loading, GPU/NPU scheduling, and
+inference throughput belong to the selected backend.
+
+## Recommended Paths
+
+| Model | Backend | Best for | Evidence level | Details |
+|-------|---------|----------|----------------|---------|
+| `mock` | `mock` | First run, CI, API/WebRTC debugging | Built in, verified | [Mock](mock.md) |
+| `wav2lip` | `local` / `omnirt` | First real lip-sync model | Local adapter is built in; OmniRT path verified | [Local](deployment/wav2lip-local.md) / [OmniRT](deployment/wav2lip-omnirt.md) |
+| `musetalk` | `local` / `omnirt` / `direct_ws` | MuseTalk quality with either in-process startup or an external service | Local adapter is built in; OmniRT/direct_ws paths documented | [Local](deployment/musetalk-local.md) / [OmniRT](deployment/musetalk-omnirt.md) |
+| `quicktalk` | `local` / `omnirt` | Local realtime adapter and service deployment reference | Local is built in; OmniRT path documented | [Local](deployment/quicktalk-local.md) / [Apple Silicon](deployment/quicktalk-apple-silicon.md) / [OmniRT](deployment/quicktalk-omnirt.md) |
+| `fasterliveportrait` | `omnirt` | Single-GPU realtime audio-driven portrait with pasteback | Documented | [FasterLivePortrait](fasterliveportrait.md) |
+| `flashtalk` | `omnirt` | High-quality private GPU/NPU deployment | OmniRT/Ascend path verified | [FlashTalk](flashtalk.md) |
+| `flashhead` | `direct_ws` | Existing standalone FlashHead service | Documented | [FlashHead](flashhead.md) |
+
+## Backend Behavior
+
+| Backend | What OpenTalking expects | Typical models |
+|---------|--------------------------|----------------|
+| `mock` | No external runtime; always available. | `mock` |
+| `local` | Adapter can be imported in-process and dependencies are satisfied. | `wav2lip`, `quicktalk`, `musetalk` |
+| `direct_ws` | The model service exposes its own WebSocket URL. | `flashhead`, custom single-model services |
+| `omnirt` | OmniRT exposes `/v1/audio2video/{model}`. | `wav2lip`, `musetalk`, `fasterliveportrait`, `flashtalk` |
+
+## Common Setup
+
+```bash title="Terminal"
+export DIGITAL_HUMAN_HOME="$HOME/digital-human"
+export OMNIRT_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+export OPENTALKING_HOME="${OPENTALKING_HOME:-$DIGITAL_HUMAN_HOME/opentalking}"
+export OMNIRT_HOME="${OMNIRT_HOME:-$DIGITAL_HUMAN_HOME/omnirt}"
+export FASTERLIVEPORTRAIT_HOME="${FASTERLIVEPORTRAIT_HOME:-$DIGITAL_HUMAN_HOME/FasterLivePortrait}"
+
+mkdir -p "$DIGITAL_HUMAN_HOME" "$OMNIRT_MODEL_ROOT"
+cd "$DIGITAL_HUMAN_HOME"
+```
+
+Recommended layout:
+
+```text
+$DIGITAL_HUMAN_HOME/
+├── opentalking/
+├── omnirt/                  # Optional, only for backend: omnirt
+├── models/
+│   ├── wav2lip/
+│   ├── SoulX-FlashTalk-14B/
+│   ├── chinese-wav2vec2-base/
+│   ├── quicktalk/
+│   └── FasterLivePortrait/
+├── logs/
+└── run/
+```
+
+Download tools:
+
+```bash title="Terminal"
+uv pip install -U "huggingface_hub[cli]" modelscope
+```
+
+Common model sources:
+
+- [ModelScope](https://modelscope.cn/models)
+- [Modelers](https://modelers.cn/models)
+- [Hugging Face](https://huggingface.co/models)
+
+## Common Startup Combinations
+
+The commands below only use existing repository entrypoints. No additional scripts are
+required.
+
+### OpenTalking local: QuickTalk and Wav2Lip in one frontend
+
+In the default configuration, `wav2lip` already uses the `local` backend. The command
+below only overrides `quicktalk` to `local`, so the same frontend can select both
+`quicktalk` and `wav2lip`:
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+uv sync --extra dev --extra models --python 3.11
+
+export OPENTALKING_TORCH_DEVICE=cuda:0
+export OPENTALKING_QUICKTALK_ASSET_ROOT="$OPENTALKING_HOME/models/quicktalk"
+export OPENTALKING_QUICKTALK_WORKER_CACHE=1
+export OPENTALKING_WAV2LIP_MODEL_ROOT="$OPENTALKING_HOME/models/wav2lip"
+export OPENTALKING_WAV2LIP_DEVICE=cuda
+export OPENTALKING_WAV2LIP_BATCH_SIZE=16
+export OPENTALKING_WAV2LIP_MAX_LONG_EDGE=832
+export OPENTALKING_WAV2LIP_FACE_DET_DEVICE=cpu
+
+bash scripts/start_unified.sh --backend local --model quicktalk --api-port 8210 --web-port 5280
+```
+
+### OmniRT: QuickTalk and Wav2Lip behind one endpoint
+
+OpenTalking configures a single `OMNIRT_ENDPOINT`. To use both `quicktalk` and
+`wav2lip` through OmniRT from the same frontend, enable both runtimes in the same
+OmniRT process:
+
+```bash title="Terminal"
+cd "$OMNIRT_HOME"
+uv sync --extra server --extra wav2lip-cuda --extra quicktalk-cuda --python 3.11
+source .venv/bin/activate
+
+export OMNIRT_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+export OMNIRT_ALLOWED_FRAME_ROOTS="$OPENTALKING_HOME/examples/avatars"
+
+export OMNIRT_WAV2LIP_RUNTIME=1
+export OMNIRT_WAV2LIP_MODELS_DIR="$OMNIRT_MODEL_ROOT"
+export OMNIRT_WAV2LIP_CHECKPOINT="$OMNIRT_MODEL_ROOT/wav2lip/wav2lip384.pth"
+export OMNIRT_WAV2LIP_DEVICE=cuda
+export OMNIRT_WAV2LIP_FACE_DET_DEVICE=cpu
+export OMNIRT_WAV2LIP_BATCH_SIZE=16
+export OMNIRT_WAV2LIP_MAX_LONG_EDGE=832
+export OMNIRT_WAV2LIP_PRELOAD=1
+
+export OMNIRT_QUICKTALK_RUNTIME=1
+export OMNIRT_QUICKTALK_MODEL_ROOT="$OMNIRT_MODEL_ROOT/quicktalk"
+export OMNIRT_QUICKTALK_CHECKPOINT="$OMNIRT_MODEL_ROOT/quicktalk/checkpoints/quicktalk.pth"
+export OMNIRT_QUICKTALK_DEVICE=cuda:0
+export OMNIRT_QUICKTALK_HUBERT_DEVICE=cuda:0
+export OMNIRT_QUICKTALK_MAX_LONG_EDGE=900
+export OMNIRT_QUICKTALK_MAX_TEMPLATE_SECONDS=1
+
+omnirt serve-avatar-ws --host 0.0.0.0 --port 9000 --backend cuda
+```
+
+Then start OpenTalking from another terminal. In the default configuration,
+`quicktalk` already uses the `omnirt` backend. The command below only overrides
+`wav2lip` to `omnirt`:
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+bash scripts/start_unified.sh \
+  --backend omnirt \
+  --model wav2lip \
+  --omnirt http://127.0.0.1:9000 \
+  --api-port 8310 \
+  --web-port 5380
+```
+
+## Common Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:8000/health
+curl -s http://127.0.0.1:8000/models | jq
+```
+
+For OmniRT-backed models:
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:9000/v1/audio2video/models | jq
+```
+
+## Common Status Values
+
+| Status | Meaning | Action |
+|--------|---------|--------|
+| `connected=true` | The backend is usable for sessions. | Choose a matching avatar and model in the browser. |
+| `reason=not_configured` | Endpoint or WebSocket URL is empty. | Configure `OMNIRT_ENDPOINT` or the model-specific `WS_URL`. |
+| `reason=omnirt_unavailable` | OmniRT reachability or model registration issue. | Check OmniRT `/v1/audio2video/models`, model list, and logs. |
+| `reason=local_adapter_missing` | Configured as `local`, but no local adapter is registered. | Switch backend or add a local adapter. |
--- a/docs/en/avatar_models/wav2lip-local.md
+++ b/docs/en/avatar_models/wav2lip-local.md
@@ -0,0 +1,83 @@
+# Wav2Lip Local Single-Machine Deployment
+
+Use this path when you want to validate a lighter lip-sync effect on a single consumer GPU and do not want to introduce a standalone inference service at the beginning. OpenTalking includes the `wav2lip` local adapter and runtime, so you only need local model dependencies and Wav2Lip weights.
+
+#### 1. Install Local Model Dependencies
+
+```bash
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+uv sync --extra dev --extra models --python 3.11
+source .venv/bin/activate
+```
+
+#### 2. Prepare Wav2Lip Weights
+
+Place the weights under repository-root `models/wav2lip/`:
+
+```bash
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+mkdir -p models/wav2lip
+
+# Install the Hugging Face CLI if it is not already installed.
+uv pip install -U "huggingface_hub[cli]"
+
+# Wav2Lip 384 main checkpoint.
+hf download Pypa/wav2lip384 \
+  wav2lip384.pth \
+  --local-dir models/wav2lip
+
+# S3FD face detector checkpoint.
+hf download rippertnt/wav2lip \
+  s3fd.pth \
+  --local-dir models/wav2lip
+```
+
+The final layout should look like this:
+
+```text
+models/
+  wav2lip/
+    wav2lip384.pth
+    s3fd.pth
+```
+
+Check key files:
+
+```bash
+stat models/wav2lip/wav2lip384.pth
+stat models/wav2lip/s3fd.pth
+```
+
+If the server cannot access Hugging Face directly, download the files on a machine with network access first, then sync the same files into `models/wav2lip/` with `rsync` or an offline package.
+
+#### 3. Start OpenTalking With Wav2Lip
+
+```bash
+export OPENTALKING_WAV2LIP_MODEL_ROOT="$DIGITAL_HUMAN_HOME/opentalking/models/wav2lip"
+export OPENTALKING_WAV2LIP_DEVICE=cuda
+export OPENTALKING_WAV2LIP_BATCH_SIZE=16
+export OPENTALKING_WAV2LIP_MAX_LONG_EDGE=832
+export OPENTALKING_WAV2LIP_FACE_DET_DEVICE=cpu
+
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+bash scripts/start_unified.sh --backend local --model wav2lip --api-port 8210 --web-port 5280
+```
+
+Open `http://localhost:5280`, select an available avatar, select the `wav2lip` model,
+and start a conversation. If you omit `--web-port`, the default frontend URL is
+`http://localhost:5173`. The first load initializes the Wav2Lip checkpoint, S3FD face
+detector, and avatar cache, which may take tens of seconds.
+
+Local Wav2Lip defaults to `easy_improved` post-processing. The frontend exposes `auto`, `basic`, `opentalking_improved`, and `easy_improved`. The backend also accepts `easy_enhanced` for API/env driven tests, but that mode requires GFPGAN to be installed and `OPENTALKING_WAV2LIP_GFPGAN_CHECKPOINT` to point to a checkpoint.
+
+#### 4. Wav2Lip Single-Machine Tuning
+
+If GPU memory is tight or first-frame latency is high, tune these parameters first:
+
+| Parameter | Recommended default | Purpose |
+| --- | --- | --- |
+| `OPENTALKING_WAV2LIP_DEVICE` | `cuda` | Select the Wav2Lip runtime device; use `cpu` for debugging. |
+| `OPENTALKING_WAV2LIP_BATCH_SIZE` | `16` | Matches the OmniRT CUDA quickstart default; lower it if GPU memory is tight. |
+| `OPENTALKING_WAV2LIP_MAX_LONG_EDGE` | `832` | Matches the OmniRT CUDA quickstart default and keeps render latency closer to realtime; set `0` only when prioritizing full source resolution over latency. |
+| `OPENTALKING_WAV2LIP_JPEG_QUALITY` | `85` | Output-frame JPEG quality; higher values improve visuals but increase bandwidth. |
+| `OPENTALKING_PREWARM_AVATARS` | `singer` | Prewarm commonly used avatars when the service starts. |
--- a/docs/en/avatar_models/wav2lip.md
+++ b/docs/en/avatar_models/wav2lip.md
@@ -0,0 +1,41 @@
+# Wav2Lip
+
+Wav2Lip is the recommended first real lip-sync model path in OpenTalking. It is lighter than heavyweight talking-head models and is useful when moving from `mock` to real video output and testing the end-to-end audio-driven video chain.
+
+## Support Status
+
+| Item | Value |
+|------|-------|
+| Model ID | `wav2lip` |
+| Backend | `local` / `omnirt` |
+| Evidence level | Local adapter is built in; OmniRT compatibility path is documented |
+| Best for | First real lip-sync model, lightweight demos, low-cost pipeline validation |
+
+## Benchmark Reference
+
+The numbers below are summarized from [Benchmark](../reference/benchmark.md). `Steady FPS` is model-generation throughput, not full user-perceived latency; STT, LLM, TTS, queueing, and WebRTC still affect the complete experience.
+
+| Hardware | Backend | Output | Steady FPS | First-turn total/ms | TTFV/ms | Peak inference VRAM/GB |
+|----------|---------|--------|------------|---------------------|---------|------------------------|
+| RTX 3090 | OmniRT | 498×832 / 30fps | 37.269 | 3002.526 | 1625.962 | 7.928 |
+| RTX 4090 | OmniRT | 498×832 / 30fps | 31.542 | 3689.764 | 1955.629 | 8.133 |
+| NPU 910B2 | OmniRT | 498×832 / 30fps | 23.945 | 4019.564 | 2615.322 | 9.113 |
+
+## Choose a Deployment Mode
+
+| Mode | Best for | Entry |
+|------|----------|-------|
+| Local | Single-machine deployment, minimal moving parts, first real lip-sync validation | [Wav2Lip Local Deployment](deployment/wav2lip-local.md) |
+| OmniRT | Isolated inference service, OmniRT preloading, and device configuration | [Wav2Lip OmniRT Deployment](deployment/wav2lip-omnirt.md) |
+
+## When to Choose Another Model
+
+- Need lower-latency realtime speaking: see [QuickTalk](quicktalk.md).
+- Need higher quality or official MuseTalk preprocessing: see [MuseTalk](musetalk.md).
+- Need a heavyweight high-quality private deployment: see [FlashTalk](flashtalk.md).
+
+## Related Pages
+
+- [Support Matrix](../deployment/support-matrix.md)
+- [Avatar Assets](avatar.md)
+- [Talking-head Model Deployment](index.md)
--- a/docs/en/cases/customer-support.md
+++ b/docs/en/cases/customer-support.md
@@ -84,8 +84,8 @@ See [Sessions API](../docs/api/sessions.md) and [Events and Streaming](../docs/a

 | Goal | Recommended path |
 |------|------------------|
-| Quick lip-sync on a consumer GPU | [QuickTalk](../model-deployment/quicktalk.md) or [Wav2Lip](../model-deployment/wav2lip.md) |
-| Higher quality through a remote model service | [FlashTalk](../model-deployment/flashtalk.md) + [OmniRT](../model-deployment/deployment.md) |
+| Quick lip-sync on a consumer GPU | [QuickTalk](../avatar_models/quicktalk.md) or [Wav2Lip](../avatar_models/wav2lip.md) |
+| Higher quality through a remote model service | [FlashTalk](../../avatar_models/flashtalk.md) + [OmniRT](../deployment/index.md) |
 | API or frontend development | Keep `mock` until the business flow is stable |

 The frontend and API flow remain the same. Select the new `model` and a matching avatar
--- a/docs/en/cases/index.md
+++ b/docs/en/cases/index.md
@@ -6,7 +6,7 @@ goal, recommended pipeline, configuration points, validation steps, and integrat

 If you are new to the project, finish the [Quickstart](../tutorials/quickstart.md) first.
 If a case uses a real talking-head model, make sure the corresponding backend and weights
-are ready in [Model Deployment](../model-deployment/index.md).
+are ready in [Model Deployment](../deployment/index.md).

 ## Pick a Case

@@ -47,7 +47,7 @@ are ready in [Model Deployment](../model-deployment/index.md).
 |------|------|
 | Understand what business scenarios are possible | This Use Cases section |
 | Run the project for the first time | [Quickstart](../tutorials/quickstart.md) |
-| Deploy Wav2Lip, QuickTalk, or FlashTalk | [Model Deployment](../model-deployment/index.md) |
+| Deploy Wav2Lip, QuickTalk, or FlashTalk | [Model Deployment](../deployment/index.md) |
 | Call an API endpoint | [API Interfaces](../docs/api/index.md) |
 | Add a new model backend | [Model Adapter](../docs/model-adapter.md) |

--- a/docs/en/cases/private-deployment.md
+++ b/docs/en/cases/private-deployment.md
@@ -38,7 +38,7 @@ flowchart TB
 ## Prerequisites

 - Finish [Quickstart](../tutorials/quickstart.md).
- Read [Configuration](../tutorials/configuration.md) and [Deployment](../model-deployment/deployment.md).
+- Read [Configuration](../tutorials/configuration.md) and [Deployment](../deployment/index.md).
 - Prefer a private LLM with an OpenAI-compatible `/v1/chat/completions` endpoint.
 - Use OmniRT, `direct_ws`, or a local adapter for avatar inference.

--- a/docs/en/cases/product-demo-live-sales.md
+++ b/docs/en/cases/product-demo-live-sales.md
@@ -37,7 +37,7 @@ flowchart LR

 - Finish [AI Customer Support](customer-support.md) or at least validate `mock`.
 - Prepare a display-friendly avatar. See [Custom Avatar](../tutorials/cases/custom-avatar.md).
- For real lip-sync video, deploy at least one talking-head backend in [Model Deployment](../model-deployment/index.md).
+- For real lip-sync video, deploy at least one talking-head backend in [Model Deployment](../deployment/index.md).

 ## 1. Prepare Product Facts

@@ -112,5 +112,5 @@ bash scripts/start_unified.sh --backend omnirt --model flashtalk --omnirt http:/
 | Speech sounds mechanical | Shorten each text segment and make the prompt more conversational. |
 | The LLM drifts away from the product | Pass structured product facts from the business layer and require grounded answers. |
 | First frame is slow | Warm up the real model or create the session before the live segment starts. |
-| The live room needs concurrency | Use API/Worker split and external Redis; see [Deployment](../model-deployment/deployment.md). |
+| The live room needs concurrency | Use API/Worker split and external Redis; see [Deployment](../deployment/index.md). |

--- a/docs/en/community/index.md
+++ b/docs/en/community/index.md
@@ -26,7 +26,7 @@ GitHub, the QQ group, and documentation feedback.
 ## Good First Tasks

 - Add screenshots, common errors, and environment notes to tutorials.
- Add mirror links and checksum notes to `model-deployment` pages.
+- Add mirror links and checksum notes to `deployment` pages.
 - Add reproducible records to the Benchmark section.
 - Improve request and response examples in API docs.
 - Add avatar asset validation examples.
--- a/docs/en/deployment/index.md
+++ b/docs/en/deployment/index.md
@@ -0,0 +1,3 @@
+# Deployment
+
+This directory explains the OpenTalking orchestration runbook, topologies, and reverse proxy setup.
--- a/docs/en/model-deployment/support-matrix.md
+++ b/docs/en/model-deployment/support-matrix.md
@@ -26,9 +26,9 @@ Use it as the decision page before following the deeper setup guides.
 | TTS | Local CosyVoice3 0.5B | Local CosyVoice service / adapter | Local voice and cloning path | Built-in, Validated | Uses `local_cosyvoice`; the standalone service is recommended. |
 | TTS | CosyVoice service | Provider adapter / remote service | Use for custom voice service deployments | Built-in, Documented | Requires a reachable CosyVoice service and, in some flows, `OPENTALKING_PUBLIC_BASE_URL`. |
 | TTS | ElevenLabs | Provider adapter | Use for hosted multilingual voices | Built-in, Documented | Requires API key and voice id. |
-| Avatar | Built-in example avatars | Local asset bundles | Default first-run path | Built-in, Validated | Good for `mock`, Wav2Lip, and other documented flows. |
-| Avatar | Custom uploaded portraits | `/avatars/custom` | Use when you want quick custom avatars | Built-in, Documented | Best compatibility today is with Wav2Lip-style image avatars. |
-| Avatar | Model-specific manifests | Local asset bundles | Required for QuickTalk / FlashHead / FlashTalk matching | Built-in, Documented | `model_type` must match the selected synthesis model. |
+| Avatar | Built-in example avatars | Local asset bundles | Default first-run path | Built-in, Validated | Reusable shared visual assets for different models. |
+| Avatar | Custom uploaded portraits | `/avatars/custom` | Use when you want quick custom avatars | Built-in, Documented | Model flows generate caches, templates, or preprocessing artifacts when needed. |
+| Avatar | Model-derived artifacts | Preparation scripts / first session | Generated when a model needs extra assets | Built-in, Documented | The avatar manifest does not need to be bound to QuickTalk, MuseTalk, or Wav2Lip. |

 ## Talking-Head Model Matrix

@@ -69,9 +69,9 @@ Use it as the decision page before following the deeper setup guides.

 1. Use `mock` to validate the browser, API, LLM, STT, TTS, and WebRTC path.
 2. Use local `wav2lip` when you want the lightest talking-head validation path.
-3. Use [Local STT/TTS + QuickTalk](recipes/local-quicktalk-audio.md) when you want local speech input, local speech synthesis, and QuickTalk realtime video.
+3. Use [Local STT/TTS + QuickTalk](../recipes/local-quicktalk-audio.md) when you want local speech input, local speech synthesis, and QuickTalk realtime video.
 4. Use local `musetalk` when you want MuseTalk quality on one CUDA machine and can install the preprocessing dependencies.
-5. Use [QuickTalk Local](quicktalk/local.md) for single-machine realtime audio2video on CUDA, or [QuickTalk with OmniRT](quicktalk/omnirt.md) for service isolation.
+5. Use `quicktalk` when you want realtime audio2video and can run CUDA.
 6. Use `fasterliveportrait` when you want realtime audio-driven portrait pasteback on a single CUDA GPU.
 7. Use `flashtalk` when quality matters more than deployment weight.
 8. Use `flashhead` only when you already operate a FlashHead service.
@@ -79,18 +79,7 @@ Use it as the decision page before following the deeper setup guides.
 ## Next Pages

 - [Overview](index.md)
- [LLM and STT](llm-stt.md)
- [Text-to-Speech](tts.md)
- [Avatar Assets](avatar.md)
- [Talking-Head Models](talking-head/index.md)
-
-## Frontend Entry
-
-After the model or backend service is running, use the OpenTalking WebUI:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to the server `5173`, then open `http://127.0.0.1:5173`.
+- [LLM and STT](../speech_models/llm-stt.md)
+- [Text-to-Speech](../speech_models/tts.md)
+- [Avatar Assets](../avatar_models/avatar.md)
+- [Talking-Head Models](../avatar_models/talking-head.md)
--- a/docs/en/docs/api/avatars.md
+++ b/docs/en/docs/api/avatars.md
@@ -21,7 +21,7 @@ Body type: `list[AvatarSummary]`.
 |-------|------|-------------|
 | `id` | string | Globally unique avatar identifier. |
 | `name` | string \| null | Display name. Defaults to `id`. |
-| `model_type` | string | One of `wav2lip`, `musetalk`, `quicktalk`, `flashtalk`, `flashhead`, `mock`. |
+| `model_type` | string | Legacy manifest type field; new avatar flows should not use it as a model-binding requirement. |
 | `width` | integer | Output video width in pixels. |
 | `height` | integer | Output video height in pixels. |
 | `is_custom` | boolean | `true` when the avatar was created via `POST /avatars/custom` and may be deleted. |
@@ -43,7 +43,7 @@ curl -s http://localhost:8000/avatars | jq
  {
    "id": "custom-alice-20260513-153012-001",
    "name": "Alice",
-    "model_type": "wav2lip",
+    "model_type": "generic",
    "width": 1024,
    "height": 1024,
    "is_custom": true
@@ -124,7 +124,7 @@ user-supplied portrait image. The newly created avatar is tagged
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
 | `name` | string | Yes | Display name for the new avatar. |
-| `base_avatar_id` | string | Yes | Identifier of an existing avatar to use as manifest template. The base avatar must have `model_type=wav2lip`. |
+| `base_avatar_id` | string | Yes | Identifier of an existing avatar to use as manifest template. The avatar does not need to be bound to a specific talking-head model. |
 | `image` | file | Yes | Portrait image, maximum 10 MB. Acceptable formats: JPEG, PNG, WebP. |

 **Behavior**
@@ -142,7 +142,7 @@ Body type: `AvatarSummary` of the newly created avatar.
 ```bash title="curl"
 curl -X POST http://localhost:8000/avatars/custom \
  -F name="Alice" \
-  -F base_avatar_id=demo-wav2lip \
+  -F base_avatar_id=demo-avatar \
  -F image=@portrait.jpg
 ```

@@ -150,7 +150,7 @@ curl -X POST http://localhost:8000/avatars/custom \
 {
  "id": "custom-alice-20260513-153012-001",
  "name": "Alice",
-  "model_type": "wav2lip",
+  "model_type": "generic",
  "width": 1024,
  "height": 1024,
  "is_custom": true
--- a/docs/en/docs/architecture.md
+++ b/docs/en/docs/architecture.md
@@ -72,7 +72,7 @@ flowchart LR
    end
 ```

-Command-line specifics are documented in [Deployment](../model-deployment/deployment.md).
+Command-line specifics are documented in [Deployment](../deployment/index.md).

 ## Session lifecycle

--- a/docs/en/docs/avatar-format.md
+++ b/docs/en/docs/avatar-format.md
@@ -2,9 +2,9 @@

 An avatar bundle defines the visual identity of a digital human together with the
 metadata required to align mouth motion with audio. OpenTalking reads avatar bundles
-when a session is created, pairs them with the configured synthesis model (`wav2lip`,
-`musetalk`, `flashtalk`, `flashhead`, or `quicktalk`), and drives video generation
-from streaming audio input.
+when a session is created and treats them as shared visual assets for the active
+talking-head model; model-specific caches, templates, or preprocessing artifacts are
+created by the corresponding deployment flow.

 This page documents the directory layout, the `manifest.json` schema, the scripts that
 generate avatar bundles, and the validation endpoints.
@@ -32,15 +32,16 @@ examples/avatars/
        └── ...
 ```

-Per-model conventions:
+Common layout conventions:

-| `model_type` | Required subdirectory | Contents |
-|--------------|----------------------|----------|
-| `wav2lip` | `frames/` | Ordered image sequence (PNG or JPG), sorted by filename. |
-| `musetalk` | `full_frames/` | Ordered image sequence. Future extensions may include `mask/` and `latent/`. |
-| `quicktalk` | _none_ | External assets referenced via `metadata.asset_root` and `metadata.template_video`. |
-| `flashtalk` | _none_ | Reference identity managed by the model service. |
-| `flashhead` | _none_ | Reference identity managed by the model service. |
+| Content | Required | Description |
+|---------|----------|-------------|
+| `manifest.json` | Yes | Basic avatar information and optional metadata. |
+| `preview.png` | Recommended | Preview image for the WebUI avatar library. |
+| `frames/` | Optional | Ordered image sequence, commonly used by Wav2Lip-style reference-frame flows. |
+| `full_frames/` | Optional | Video frame sequence, commonly used by MuseTalk preprocessing. |
+| `prepared/` | Optional | Preprocessing artifacts generated by models such as MuseTalk. |
+| Template video | Optional | Derived or external asset that models such as QuickTalk may use at runtime. |

 A `preview.png` file is recommended; the frontend uses it to populate the avatar picker.

@@ -50,17 +51,17 @@ A `preview.png` file is recommended; the frontend uses it to populate the avatar
 |-------|------|----------|-------------|
 | `id` | string | Yes | Globally unique identifier referenced by the client. |
 | `name` | string | No | Display name. Defaults to `id`. |
-| `model_type` | enum | Yes | One of `wav2lip`, `musetalk`, `quicktalk`, `flashtalk`, `flashhead`. |
+| `model_type` | string | No | Legacy manifest type field; do not rely on it to bind an avatar to a model. |
 | `fps` | number | Yes | Target output frame rate. Typical value: 25. |
 | `sample_rate` | number | Yes | Audio sample rate aligned with the TTS output. Typical value: 16000. |
 | `width` | number | Yes | Output video width in pixels. |
 | `height` | number | Yes | Output video height in pixels. |
 | `version` | string | No | Asset version string. |
-| `metadata` | object | No | Arbitrary additional fields, including per-model conventions documented below. |
+| `metadata` | object | No | Arbitrary additional fields for upload provenance, derivatives, or runtime metadata. |

-## Wav2Lip `metadata`
+## Mouth Metadata

-Wav2lip avatars should populate the `metadata` field with mouth localization data:
+When an avatar includes mouth localization data, store it under `metadata.animation`:

 ```json
 {
@@ -75,30 +76,26 @@ Wav2lip avatars should populate the `metadata` field with mouth localization dat
 }
 ```

-Coordinates are normalized to the image dimensions. When a single-image wav2lip avatar
-is uploaded through `/avatars/custom`, OpenTalking attempts mouth detection using
+Coordinates are normalized to the image dimensions. When a single-image avatar is
+uploaded through `/avatars/custom`, OpenTalking attempts mouth detection using
 MediaPipe locally. If detection fails, the upload succeeds without an `animation`
-field; in this case, OmniRT's wav2lip backend falls back to its built-in alignment.
+field; model backends fall back to their own built-in alignment when possible.
 The `wav2lip_postprocess_mode` flag controls the server-side post-processing mode.
 OpenTalking local Wav2Lip defaults to `easy_improved`; `easy_enhanced` is accepted
 by the backend/API but requires GFPGAN dependencies and checkpoint assets.

-## QuickTalk manifest example
+## Generic manifest example

 ```json
 {
-  "id": "quicktalk-daytime",
-  "name": "QuickTalk Daytime",
-  "model_type": "quicktalk",
+  "id": "demo-avatar",
+  "name": "Demo Avatar",
  "fps": 25,
  "sample_rate": 16000,
  "width": 512,
  "height": 512,
  "version": "1.0",
-  "metadata": {
-    "asset_root": "/path/to/quicktalk/assets",
-    "template_video": "/path/to/template.mp4"
-  }
+  "metadata": {}
 }
 ```

--- a/docs/en/docs/developing.md
+++ b/docs/en/docs/developing.md
@@ -91,7 +91,7 @@ bash scripts/quickstart/start_all.sh

 The frontend model selector lists `wav2lip` after OmniRT is reachable.
 For model-specific weight downloads and startup commands, see
-[Models](../model-deployment/index.md).
+[Models](../deployment/index.md).

 ### API and Worker split with local Redis

--- a/docs/en/docs/index.md
+++ b/docs/en/docs/index.md
@@ -4,7 +4,7 @@ This section is for developers and integrators. It explains OpenTalking concepts
 boundaries, and extension points. If the goal is to run the project first, start with
 [Tutorials](../tutorials/index.md). If the goal is to understand business scenarios, start
 with [Use Cases](../cases/index.md). If the goal is model serving, start with
-[Model Deployment](../model-deployment/index.md).
+[Model Deployment](../deployment/index.md).

 ## Understand Concepts

--- a/docs/en/index.md
+++ b/docs/en/index.md
@@ -32,7 +32,7 @@ OpenTalking is not a single talking-head model. It sits between product experien
 model services, organizing LLM, speech recognition, speech synthesis, avatar rendering,
 event streaming, and browser playback into a unified runtime. Developers can start with
 Mock validation and then move to real models and inference backends such as Wav2Lip,
-QuickTalk, MuseTalk, FlashTalk, or OmniRT.
+QuickTalk, FasterLivePortrait, MuseTalk, FlashTalk, or OmniRT.

 It is designed for scenarios such as AI customer support, product demos, course presenters,
 news anchors, companion characters, and private digital-human deployments. If you are new to
@@ -44,24 +44,33 @@ with [Model Support](model-support/index.md).

 <video src="https://github.com/user-attachments/assets/a3abce76-12c0-4b8b-844f-bbc5c3227dc7" controls width="100%"></video>

+
+## Get Started Fast
+
+- [Quick Start](quick-start/index.md) — first run and mock validation.
+- [Model Support](model-support/index.md) — choose models, backends, and deployment paths.
+- [Deployment](deployment/index.md) — model deployment and TTS weight prep.
+- [Avatar Models](avatar_models/index.md) — Wav2Lip, QuickTalk, MuseTalk, FlashTalk, and more.
+- [Speech Generation Models](speech_models/index.md) — LLM, STT, and TTS deployment.
+- [Deployment Recipes](recipes/index.md) — combined setup such as local audio + QuickTalk.
+
 ## Key Features

 - **Real-time conversation pipeline**: coordinates speech input, LLM response, TTS synthesis, subtitle events, avatar rendering, and WebRTC playback.
 - **Pluggable model backends**: supports backend modes such as `mock`, `local`, `direct_ws`, and `omnirt`, from local validation to remote inference services.
- **Multiple model paths**: provides an evolving integration plan for Wav2Lip, QuickTalk, MuseTalk, FlashTalk, FlashHead, and related talking-head models.
+- **Multiple model paths**: provides an evolving integration plan for Wav2Lip, QuickTalk, FasterLivePortrait, MuseTalk, FlashTalk, FlashHead, and related talking-head models.
+- **Video Clone workflow**: use camera frames or uploaded video as driving input in WebUI to drive a source digital-human avatar.
 - **Open LLM/TTS configuration**: supports OpenAI-compatible LLM endpoints, including DashScope, DeepSeek, Ollama, vLLM, or internal model services.
 - **WebUI and command-line tools**: use WebUI for session validation, avatar selection, voice configuration, and model status; use CLI entrypoints for service startup and debugging.
 - **Production-oriented runtime modes**: supports local development, Mock validation, Docker, API / Worker split, and external inference-service integration.

 ## User Guide

- [Quick Start](quick-start/index.md): run OpenTalking for the first time with the `mock` backend.
- [Usage](usage/index.md): learn command-line startup, WebUI usage, avatar configuration, and voice/TTS settings.
- [Persona Package](usage/persona-package.md): import, validate, and run portable digital-human Agent bundles.
- [Examples](examples/index.md): understand how OpenTalking applies to customer support, product demos, course presenters, and similar scenarios.
- [Model Support](model-support/index.md): review models, runtime backends, and production topology such as Wav2Lip, QuickTalk, FlashTalk, and OmniRT.
- [Reference Materials](reference/index.md): review benchmark metrics and changelog entries.
- [FAQ](faq.md): troubleshoot installation, configuration, WebRTC, model backend, and runtime issues.
+- [Usage](usage/index.md): command-line startup, WebUI usage, Video Clone, avatar configuration, and voice/TTS settings.
+- [Examples](examples/index.md): customer support, product demos, course presenters, and similar scenarios.
+- [Model Support](model-support/index.md): model and backend selection, plus production topology.
+- [Reference Materials](reference/index.md): benchmark metrics and changelog entries.
+- [FAQ](faq.md): installation, configuration, WebRTC, model backend, and runtime issues.

 ## License Information

--- a/docs/en/model-deployment/avatar.md
+++ b/docs/en/model-deployment/avatar.md
@@ -1,63 +0,0 @@
-# Avatar Assets
-
-Avatar assets bind a visual identity to the selected talking-head backend. A model may
-be connected, but session creation still fails or looks wrong if the avatar bundle does
-not match that model.
-
-## Minimal rule
-
-The avatar `manifest.json` must declare a `model_type` compatible with the selected
-session model:
-
-| Model | Typical avatar requirement |
-|-------|----------------------------|
-| `mock` | Preview/reference image only. |
-| `wav2lip` | Reference frames or prepared Wav2Lip frame assets. |
-| `quicktalk` | `metadata.asset_root` and `metadata.template_video`. |
-| `flashhead` | Reference image for the FlashHead session. |
-| `flashtalk` | Portrait/reference image compatible with the backend service. |
-
-## Example manifest
-
-```json title="examples/avatars/quicktalk-demo/manifest.json"
-{
-  "id": "quicktalk-demo",
-  "name": "QuickTalk Demo",
-  "model_type": "quicktalk",
-  "fps": 25,
-  "sample_rate": 16000,
-  "width": 512,
-  "height": 512,
-  "metadata": {
-    "asset_root": "/absolute/path/to/models/quicktalk/hdModule",
-    "template_video": "/absolute/path/to/template.mp4"
-  }
-}
-```
-
-## Prepare and validate
-
-Use the existing avatar guide for the complete schema and preparation scripts:
-
- [Avatar Format](../docs/avatar-format.md)
- [Models → Talking-Head Models](talking-head/index.md)
-
-Verify the server sees the avatar:
-
-```bash title="terminal"
-curl -s http://127.0.0.1:8000/avatars | python3 -m json.tool
-```
-
-When troubleshooting, check three values together: session `model`, avatar
-`model_type`, and `/models` `backend`.
-
-## Frontend Entry
-
-After the model or backend service is running, use the OpenTalking WebUI:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to the server `5173`, then open `http://127.0.0.1:5173`.
--- a/docs/en/model-deployment/backends/index.md
+++ b/docs/en/model-deployment/backends/index.md
@@ -42,7 +42,7 @@ curl -fsS http://127.0.0.1:9000/v1/audio2video/models | python3 -m json.tool
 - [Local Adapter](local.md)
 - [OmniRT](omnirt.md)
 - [Talking-Head Models](../talking-head/index.md)
- [Support Matrix](../support-matrix.md)
+- [Support Matrix](../../deployment/support-matrix.md)

 ## Frontend Entry

--- a/docs/en/model-deployment/backends/omnirt.md
+++ b/docs/en/model-deployment/backends/omnirt.md
@@ -38,8 +38,8 @@ curl -s http://127.0.0.1:8000/models | python3 -m json.tool
 - [QuickTalk with OmniRT](../quicktalk/omnirt.md)
 - [Wav2Lip with OmniRT](../wav2lip/omnirt.md)
 - [MuseTalk with OmniRT](../musetalk/omnirt.md)
- [FasterLivePortrait](../fasterliveportrait.md)
- [FlashTalk](../flashtalk.md)
+- [FasterLivePortrait](../../avatar_models/fasterliveportrait.md)
+- [FlashTalk](../../avatar_models/flashtalk.md)

 ## Frontend Entry

--- a/docs/en/model-deployment/deployment.md
+++ b/docs/en/model-deployment/deployment.md
@@ -1,340 +0,0 @@
-# Deployment
-
-This page is the runbook for deploying the OpenTalking orchestration layer. Model
-weights and model-server startup remain in [Models](../model-deployment/index.md);
-use this page to decide how the API, Worker, Web UI, Redis, reverse proxy, and
-external inference services should be wired together.
-
-## Choose a Topology
-
-| Topology | Command shape | Best for | Notes |
-|----------|---------------|----------|-------|
-| Single-process `unified` | `opentalking-unified` | Local demos, small internal trials, fast debugging | One process owns API, Worker, sessions, and an in-memory event bus. Do not run multiple `unified` workers behind a load balancer. |
-| Split API + Worker | `opentalking-api` + `opentalking-worker` + Redis | Standard single-host or small production deployment | Recommended production baseline. Worker can be restarted or scaled separately from API. |
-| Docker Compose | `docker compose up` | Reproducible deployment, CI, container-first teams | Convenient, but heavier than native source installs for CPU and single-GPU evaluation. |
-| Remote model backend | OpenTalking + `OMNIRT_ENDPOINT` or `direct_ws` | Heavy models, multi-GPU, remote GPU/NPU hosts | Keep OpenTalking near users; run model servers where accelerators live. |
-| Ascend 910B | Source install + CANN + OmniRT/model service | NPU evaluation | Prefer host-native source deployment; Docker is optional and environment-specific. |
-
-## Prerequisites
-
-Prepare these before choosing a topology:
-
- Python 3.10 or later (3.11 recommended), Node.js 18 or later, Redis 7, and FFmpeg.
- A completed `.env` copied from `.env.example`.
- LLM/STT/TTS credentials configured as described in [Configuration](../tutorials/configuration.md).
- Avatar assets and model backend configuration selected from [Models](../model-deployment/index.md).
- For public access, a domain name, TLS certificate, and a TURN server if browsers are often behind symmetric NAT.
-
-## Native Single-Host Runbook
-
-Use this path for a machine that runs OpenTalking from source. It is the clearest
-deployment for debugging and for CPU or single-GPU evaluation because there is no
-container layer between the process and the host.
-
-### 1. Install
-
-```bash title="terminal"
-git clone https://github.com/datascale-ai/opentalking.git
-cd opentalking
-uv sync --extra dev --python 3.11
-source .venv/bin/activate
-
-cd apps/web
-npm ci
-cd ../..
-cp .env.example .env
-```
-
-If you need the compatibility fallback instead:
-
-```bash title="terminal"
-python3 -m venv .venv
-source .venv/bin/activate
-pip install --index-url https://pypi.tuna.tsinghua.edu.cn/simple -e ".[dev]"
-```
-
-Set the minimum runtime configuration:
-
-```env title=".env"
-OPENTALKING_LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
-OPENTALKING_LLM_API_KEY=<your-key>
-OPENTALKING_STT_DEFAULT_PROVIDER=dashscope
-OPENTALKING_STT_DASHSCOPE_API_KEY=<your-key>
-OPENTALKING_TTS_DEFAULT_PROVIDER=edge
-OPENTALKING_AVATARS_DIR=./examples/avatars
-OPENTALKING_VOICES_DIR=./var/voices
-OPENTALKING_SQLITE_PATH=./data/opentalking.sqlite3
-OPENTALKING_CORS_ORIGINS=http://localhost:5173,http://127.0.0.1:5173
-```
-
-### 2. Run `unified`
-
-For development, a private demo, or a single machine without horizontal scaling:
-
-```bash title="terminal"
-source .venv/bin/activate
-OPENTALKING_REDIS_MODE=memory opentalking-unified --host 0.0.0.0 --port 8000
-```
-
-#### Frontend Entry
-
-Start the WebUI in another terminal after the API is listening on `8000`:
-
-```bash title="terminal"
-cd apps/web
-VITE_BACKEND_PORT=8000 npm run dev -- --host 0.0.0.0 --port 5173
-```
-
-Open <http://127.0.0.1:5173>, select a built-in avatar, then start with `mock`.
-
-### 3. Run Split API + Worker {#api-and-worker-split}
-
-Use this as the production baseline.
-
-```bash title="terminal: redis"
-redis-server --port 6379 --appendonly yes
-```
-
-```bash title="terminal: api"
-source .venv/bin/activate
-export OPENTALKING_REDIS_URL=redis://127.0.0.1:6379/0
-export OPENTALKING_WORKER_URL=http://127.0.0.1:9001
-opentalking-api
-```
-
-```bash title="terminal: worker"
-source .venv/bin/activate
-export OPENTALKING_REDIS_URL=redis://127.0.0.1:6379/0
-opentalking-worker
-```
-
-```bash title="terminal: web"
-cd apps/web
-VITE_API_BASE=/api npm run build
-# Serve apps/web/dist with nginx, Caddy, or another static server.
-```
-
-The split topology looks like this:
-
-```mermaid
-flowchart LR
-    Browser[Browser] --> Proxy[nginx / Caddy]
-    Proxy --> Web[Static Web UI]
-    Proxy --> API[opentalking-api]
-    API --> Redis[(Redis)]
-    Worker[opentalking-worker] --> Redis
-    Worker --> Backend[(local / direct_ws / OmniRT)]
-    API -. WebRTC and session control .-> Worker
-```
-
-### 4. Connect a Model Backend
-
-For `mock`, no model service is required. For real models, configure only the selected
-backend:
-
-```env title=".env"
-# OmniRT for wav2lip / musetalk / flashtalk when those models use backend: omnirt.
-OMNIRT_ENDPOINT=http://<model-host>:9000
-
-# FlashHead remains a direct WebSocket backend.
-OPENTALKING_FLASHHEAD_WS_URL=ws://<flashhead-host>:8766/v1/avatar/realtime
-OPENTALKING_FLASHHEAD_BASE_URL=http://<flashhead-host>:8766
-```
-
-Verify backend visibility:
-
-```bash title="terminal"
-curl -fsS http://127.0.0.1:8000/models | python3 -m json.tool
-```
-
-## Docker Compose
-
-Docker Compose is useful when reproducibility matters more than startup weight. For
-light CPU or single-GPU evaluation, native source deployment is usually easier to
-inspect.
-
-### CPU / Mock Stack
-
-```bash title="terminal"
-cp .env.example .env
-docker compose up -d --build
-docker compose ps
-curl -fsS http://127.0.0.1:8000/health
-curl -fsS http://127.0.0.1:8000/models
-```
-
-Open <http://127.0.0.1:5173>. This stack starts `redis`, `api`, `worker`, and `web`.
-It is suitable for UI validation and pipeline testing with `mock`.
-
-### GPU / OmniRT Stack
-
-Install the NVIDIA driver and NVIDIA Container Toolkit first. Then run:
-
-```bash title="terminal"
-cp .env.example .env
-docker compose --profile gpu \
-  -f docker-compose.yml \
-  -f docker-compose.gpu.yml \
-  up -d --build
-docker compose ps
-curl -fsS http://127.0.0.1:9000/v1/audio2video/models
-curl -fsS http://127.0.0.1:8000/models
-```
-
-Use this path only for models configured with `backend: omnirt`. Model weights and
-OmniRT-specific startup details are documented under [Models](../model-deployment/index.md).
-
-Useful operations:
-
-```bash title="terminal"
-docker compose logs -f api worker web
-docker compose restart api worker
-docker compose down
-```
-
-Persist production data by mounting the avatar, voice, SQLite, Redis, and model
-directories instead of relying on container-local files.
-
-## Reverse Proxy
-
-For production, terminate TLS at nginx, Caddy, or an ingress controller. The proxy
-must support normal HTTP requests, WebSocket upgrades, and SSE without buffering.
-
-Minimal nginx shape:
-
-```nginx title="/etc/nginx/conf.d/opentalking.conf"
-map $http_upgrade $connection_upgrade {
-  default upgrade;
-  '' close;
-}
-
-server {
-  listen 443 ssl http2;
-  server_name demo.example.com;
-
-  ssl_certificate /etc/letsencrypt/live/demo.example.com/fullchain.pem;
-  ssl_certificate_key /etc/letsencrypt/live/demo.example.com/privkey.pem;
-
-  root /srv/opentalking/web/dist;
-  index index.html;
-
-  location /api/ {
-    proxy_pass http://127.0.0.1:8000/;
-    proxy_http_version 1.1;
-    proxy_set_header Host $host;
-    proxy_set_header X-Forwarded-Proto $scheme;
-    proxy_set_header X-Real-IP $remote_addr;
-    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
-    proxy_set_header Upgrade $http_upgrade;
-    proxy_set_header Connection $connection_upgrade;
-    proxy_buffering off;
-    proxy_cache off;
-    proxy_read_timeout 3600s;
-  }
-
-  location / {
-    try_files $uri /index.html;
-  }
-}
-```
-
-Production `.env` should include the browser origin:
-
-```env title=".env"
-OPENTALKING_CORS_ORIGINS=https://demo.example.com
-OPENTALKING_PUBLIC_BASE_URL=https://demo.example.com
-```
-
-## Multi-Host and Heavy Models
-
-For heavy talking-head models, keep OpenTalking stateless where possible and run model
-services on accelerator hosts:
-
-```mermaid
-flowchart LR
-    Browser --> Proxy
-    Proxy --> API[OpenTalking API]
-    API --> Redis[(Redis)]
-    Worker[OpenTalking Worker] --> Redis
-    Worker --> OmniRT[OmniRT on GPU/NPU host]
-    Worker --> DirectWS[FlashHead / other direct_ws service]
-```
-
-Recommended rules:
-
- Use `local` for lightweight adapters that fit on the OpenTalking host.
- Use `direct_ws` when a single model already exposes its own WebSocket protocol.
- Use `omnirt` for heavy, multi-card, remote, or NPU-backed inference.
- Do not set `OMNIRT_ENDPOINT` as a blanket requirement for every model; only models
-  configured with `backend: omnirt` need it.
-
-## Ascend 910B
-
-For NPU evaluation, prefer host-native source deployment so the process can inherit
-the CANN environment:
-
-```bash title="terminal"
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-bash scripts/deploy_ascend_910b.sh
-```
-
-Prerequisites:
-
- CANN 8.0 or later.
- Prefer setting `UV_INDEX_URL` / `PIP_INDEX_URL` to a domestic mirror before installing OpenTalking and OmniRT in China-friendly environments.
- OmniRT checked out alongside OpenTalking when using `backend: omnirt`.
- Model checkpoints under `$DIGITAL_HUMAN_HOME/models/`.
-
-Verify:
-
-```bash title="terminal"
-curl -fsS http://127.0.0.1:9000/v1/audio2video/models
-curl -fsS http://127.0.0.1:8000/models
-```
-
-## Health Checks
-
-Use these checks during rollout and after restarts:
-
-| Check | Command | Expected |
-|-------|---------|----------|
-| API liveness | `curl -fsS http://127.0.0.1:8000/healthz` | HTTP 200 |
-| API readiness | `curl -fsS http://127.0.0.1:8000/health` | JSON service status |
-| Queue status | `curl -fsS http://127.0.0.1:8000/queue/status` | Queue and slot state |
-| Models | `curl -fsS http://127.0.0.1:8000/models` | Each model has `backend`, `connected`, and `reason` |
-| Web UI | Open `http://127.0.0.1:5173` or production URL | UI loads and model selector is populated |
-
-## Production Checklist {#production-checklist}
-
-Recommended production defaults:
-
- Run API and Worker under systemd, supervisor, Docker Compose, or Kubernetes.
- Keep Redis persistent with `appendonly yes`.
- Mount `OPENTALKING_AVATARS_DIR`, `OPENTALKING_VOICES_DIR`, and
-  `OPENTALKING_SQLITE_PATH` on durable storage.
- Forward logs to the platform logger and set `OPENTALKING_LOG_LEVEL=INFO`.
- For multiple Workers, isolate model GPU assignments with environment variables such
-  as `CUDA_VISIBLE_DEVICES` or vendor-specific NPU visibility controls.
- Use sticky routing or a shared Redis-backed setup for long-lived browser sessions.
-
-Quickstart helper scripts remain useful for development:
-
-| Script | Purpose |
-|--------|---------|
-| `scripts/quickstart/start_all.sh` | Starts `unified` and the frontend. |
-| `scripts/quickstart/start_omnirt_wav2lip.sh` | Starts OmniRT serving Wav2Lip. |
-| `scripts/quickstart/start_omnirt_flashtalk.sh` | Starts OmniRT serving FlashTalk. |
-| `scripts/quickstart/status.sh` | Reports helper-managed process and endpoint status. |
-| `scripts/quickstart/stop_all.sh` | Stops helper-managed processes. |
-
-## Troubleshooting
-
-| Symptom | Likely cause | Fix |
-|---------|--------------|-----|
-| Web UI loads but API calls fail | `VITE_API_BASE`, nginx `/api` proxy, or CORS mismatch | Confirm `/api/health` reaches API through the same origin; update `OPENTALKING_CORS_ORIGINS`. |
-| Event stream connects then stalls | Reverse proxy buffers SSE | Set `proxy_buffering off` and keep `Cache-Control: no-transform`. |
-| WebRTC fails only for remote users | NAT traversal problem | Deploy TURN, then expose the TURN config through the frontend/runtime integration used by your deployment. |
-| `/models` shows `connected=false` | Backend is unavailable or misconfigured | Read the `reason` field. `local_adapter_missing`, missing WS URL, and missing OmniRT model list are different fixes. |
-| `mock` works but real model fails | Model service, weights, or avatar type mismatch | Check [Models](../model-deployment/index.md), verify `/models`, then match avatar `model_type` to the selected model. |
-| Worker starts but sessions stay queued | Redis URL mismatch or Worker cannot reach backend | Compare `OPENTALKING_REDIS_URL` in API and Worker; check Worker logs. |
-| Docker web port is reachable but API is not | nginx proxy or Compose service health | Run `docker compose logs -f web api worker` and test `curl http://127.0.0.1:8000/health`. |
--- a/docs/en/model-deployment/index.md
+++ b/docs/en/model-deployment/index.md
@@ -23,48 +23,45 @@ flowchart LR
 | LLM | DashScope OpenAI-compatible endpoint | Use OpenAI, vLLM, Ollama, or DeepSeek when those are already standard in your environment. |
 | STT | DashScope Paraformer realtime | Keep it unless you need a different realtime STT provider. |
 | TTS | Edge TTS | Use DashScope, CosyVoice, or ElevenLabs for production voices and voice cloning. |
-| Avatar assets | Built-in examples | Prepare model-specific assets before selecting Wav2Lip, QuickTalk, FlashHead, or FlashTalk. |
-| Talking-head backend | `mock` first, then the Wav2Lip local path | Use QuickTalk local/OmniRT, FlashTalk through OmniRT, FlashHead direct WS, or another model service. |
+| Avatar assets | Built-in examples | Use shared visual assets; models generate caches, templates, or preprocessing artifacts as needed. |
+| Talking-head backend | `mock` first, then the Wav2Lip local path | Use QuickTalk / FlashTalk through OmniRT, FlashHead direct WS, or another model service. |

 ## Setup order

 1. Run [Quickstart](../tutorials/quickstart.md) with `mock`.
-2. Check the [Support Matrix](support-matrix.md) to choose the right path.
-3. Configure [LLM and STT](llm-stt.md).
-4. Choose and verify [TTS](tts.md).
-5. Prepare [Avatar assets](avatar.md).
-6. Start a [talking-head model](talking-head/index.md).
+2. Check the [Support Matrix](../deployment/support-matrix.md) to choose the right path.
+3. Configure [LLM and STT](../speech_models/llm-stt.md).
+4. Choose and verify [TTS](../speech_models/tts.md).
+5. Prepare [Avatar assets](../avatar_models/avatar.md).
+6. Start a [talking-head model](../avatar_models/talking-head.md).
 7. Verify `/models`, create a session, and test through the browser.

 ## Model Shortcuts

 | Goal | Entry |
 |------|-------|
-| End-to-end self-test with no weights | [Mock](mock.md) |
-| First real lip-sync model | [Wav2Lip Local](wav2lip/local.md) |
-| Local STT/TTS + QuickTalk | [Local STT/TTS + QuickTalk](recipes/local-quicktalk-audio.md) |
-| V100 single-host FasterLivePortrait + FlashHead | [V100 + FasterLivePortrait + FlashHead](recipes/v100-fasterliveportrait-flashhead.md) |
-| Existing MuseTalk runtime | [MuseTalk with OmniRT](musetalk/omnirt.md) |
-| Local realtime adapter | [QuickTalk Local](quicktalk/local.md) |
-| Single-GPU realtime portrait with pasteback | [FasterLivePortrait](fasterliveportrait.md) |
-| High-quality heavy model | [FlashTalk](flashtalk.md) |
-| Standalone FlashHead service | [FlashHead](flashhead.md) |
+| End-to-end self-test with no weights | [Mock](../avatar_models/mock.md) |
+| First real lip-sync model | [Wav2Lip Local](../avatar_models/deployment/wav2lip-local.md) |
+| Local STT/TTS + QuickTalk | [Local STT/TTS + QuickTalk](../recipes/local-quicktalk-audio.md) |
+| Existing MuseTalk runtime | [MuseTalk](../avatar_models/musetalk.md) |
+| Local realtime adapter | [QuickTalk](../avatar_models/quicktalk.md) |
+| Single-GPU realtime portrait with pasteback | [FasterLivePortrait](../avatar_models/fasterliveportrait.md) |
+| High-quality heavy model | [FlashTalk](../avatar_models/flashtalk.md) |
+| Standalone FlashHead service | [FlashHead](../avatar_models/flashhead.md) |

 Keep model execution decoupled from OpenTalking itself: lightweight models should use
 `local` or `direct_ws` where possible, while OmniRT remains the recommended backend
 for heavyweight, multi-card, remote, or NPU deployments.

-## Frontend Entry
+## Speech Generation Model Deployment

-After the model or backend service is running, use the OpenTalking WebUI:
+This section covers TTS model deployment and weight preparation only. For combined
+flows, see [Local Audio + QuickTalk](../recipes/local-quicktalk-audio.md).

-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to the server `5173`, then open `http://127.0.0.1:5173`.
-
-## Documentation Versions
-
-The published site uses versioned docs: `/latest/` points at the current main docs, and formal releases are kept under frozen paths such as `/vX.Y.Z/`. For production deployments, prefer the docs version that matches the Python package, GitHub Release, or Docker image tag you installed.
+| Model | Entry | Notes |
+| --- | --- | --- |
+| Edge TTS | [Speech Generation Models](../speech_models/tts.md) | First-run default, good for pipeline validation. |
+| DashScope Qwen TTS | [Speech Generation Models](../speech_models/tts.md) | Chinese realtime TTS and voice cloning. |
+| CosyVoice3 | [CosyVoice Deployment](../speech_models/tts/cosyvoice.md) | Local Chinese TTS with built-in and cloned voices. |
+| IndexTTS | [IndexTTS Deployment](../speech_models/tts/indextts.md) | Controllable dubbing, emotion control, and voice cloning. |
+| ElevenLabs | [Speech Generation Models](../speech_models/tts.md) | Hosted multilingual voices. |
--- a/docs/en/model-deployment/llm-stt.md
+++ b/docs/en/model-deployment/llm-stt.md
@@ -1,117 +0,0 @@
-# LLM and STT
-
-The LLM decides what the digital human says. STT is required only when users speak
-through the microphone; text-only `speak` requests do not need STT.
-
-## LLM
-
-OpenTalking uses an OpenAI-compatible chat-completions interface. DashScope is the
-default because it works with the default Chinese demo settings.
-
-```env title=".env"
-OPENTALKING_LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
-OPENTALKING_LLM_API_KEY=<dashscope-api-key>
-OPENTALKING_LLM_MODEL=qwen-flash
-```
-
-Common alternatives:
-
-| Provider | Configuration notes |
-|----------|---------------------|
-| OpenAI | Set `OPENTALKING_LLM_BASE_URL=https://api.openai.com/v1` and use an OpenAI model id. |
-| vLLM | Point `OPENTALKING_LLM_BASE_URL` to the vLLM OpenAI-compatible server. |
-| Ollama | Use the Ollama OpenAI-compatible endpoint, usually `http://localhost:11434/v1`. |
-| DeepSeek | Use the provider's OpenAI-compatible base URL and model id. |
-| Atlas Cloud | OpenAI-compatible gateway hosting DeepSeek, Qwen, and other models. Set `OPENTALKING_LLM_BASE_URL=https://api.atlascloud.ai/v1`. See [Atlas Cloud](https://www.atlascloud.ai/?utm_source=github&utm_medium=link&utm_campaign=opentalking). |
-
-Atlas Cloud is OpenAI-compatible, so it works as a drop-in LLM backend:
-
-```env title=".env"
-OPENTALKING_LLM_BASE_URL=https://api.atlascloud.ai/v1
-OPENTALKING_LLM_API_KEY=<atlascloud-api-key>
-OPENTALKING_LLM_MODEL=deepseek-ai/deepseek-v4-pro
-```
-
-`deepseek-ai/deepseek-v4-pro` is a reasoning model, so allow enough `max_tokens`
-(≥ 512) or the reply may be truncated before any visible text is produced.
-
-<details>
-<summary>All Atlas Cloud chat models (59)</summary>
-
- **Anthropic (Claude):** `anthropic/claude-haiku-4.5-20251001`, `anthropic/claude-opus-4.8`, `anthropic/claude-sonnet-4.6`
- **OpenAI (GPT):** `openai/gpt-5.4`, `openai/gpt-5.5`
- **Google (Gemini):** `google/gemini-3.1-flash-lite`, `google/gemini-3.1-pro-preview`, `google/gemini-3.5-flash`
- **Qwen:** `qwen/qwen2.5-7b-instruct`, `Qwen/Qwen3-235B-A22B-Instruct-2507`, `qwen/qwen3-235b-a22b-thinking-2507`, `qwen/qwen3-30b-a3b`, `Qwen/Qwen3-30B-A3B-Instruct-2507`, `qwen/qwen3-30b-a3b-thinking-2507`, `qwen/qwen3-32b`, `qwen/qwen3-8b`, `Qwen/Qwen3-Coder`, `qwen/qwen3-coder-next`, `qwen/qwen3-max-2026-01-23`, `Qwen/Qwen3-Next-80B-A3B-Instruct`, `Qwen/Qwen3-Next-80B-A3B-Thinking`, `Qwen/Qwen3-VL-235B-A22B-Instruct`, `qwen/qwen3-vl-235b-a22b-thinking`, `qwen/qwen3-vl-30b-a3b-instruct`, `qwen/qwen3-vl-30b-a3b-thinking`, `qwen/qwen3-vl-8b-instruct`, `qwen/qwen3.5-122b-a10b`, `qwen/qwen3.5-27b`, `qwen/qwen3.5-35b-a3b`, `qwen/qwen3.5-397b-a17b`, `qwen/qwen3.6-35b-a3b`, `qwen/qwen3.6-plus`
- **DeepSeek:** `deepseek-ai/deepseek-ocr`, `deepseek-ai/deepseek-r1-0528`, `deepseek-ai/DeepSeek-V3-0324`, `deepseek-ai/DeepSeek-V3.1`, `deepseek-ai/DeepSeek-V3.1-Terminus`, `deepseek-ai/deepseek-v3.2`, `deepseek-ai/DeepSeek-V3.2-Exp`, `deepseek-ai/deepseek-v4-flash`, `deepseek-ai/deepseek-v4-pro`
- **Moonshot (Kimi):** `moonshotai/Kimi-K2-Instruct`, `moonshotai/Kimi-K2-Instruct-0905`, `moonshotai/Kimi-K2-Thinking`, `moonshotai/kimi-k2.5`, `moonshotai/kimi-k2.6`
- **Zhipu (GLM):** `zai-org/GLM-4.6`, `zai-org/glm-4.7`, `zai-org/glm-5`, `zai-org/glm-5-turbo`, `zai-org/glm-5.1`, `zai-org/glm-5v-turbo`
- **MiniMax:** `MiniMaxAI/MiniMax-M2`, `minimaxai/minimax-m2.1`, `minimaxai/minimax-m2.5`, `minimaxai/minimax-m2.7`
- **xAI:** `xai/grok-4.3`
- **Kwaipilot:** `kwaipilot/kat-coder-pro-v2`
- **Other:** `owl`
-
-</details>
-
-Verify the API key and endpoint by starting OpenTalking and sending a text `speak`
-request after creating a `mock` session.
-
-## STT
-
-Select the STT provider with `OPENTALKING_STT_DEFAULT_PROVIDER`. The frontend can also select local STT or API STT before a session starts. When API STT is selected, the provider-specific key must be configured; it is not populated from the LLM key.
-
-### DashScope Paraformer realtime
-
-```env title=".env"
-OPENTALKING_STT_DEFAULT_PROVIDER=dashscope
-OPENTALKING_STT_DASHSCOPE_API_KEY=<dashscope-api-key>
-OPENTALKING_STT_DASHSCOPE_MODEL=paraformer-realtime-v2
-```
-
-For DashScope-based deployments, LLM and STT may use the same actual key, but it
-must be written explicitly to `OPENTALKING_LLM_API_KEY` and
-`OPENTALKING_STT_DASHSCOPE_API_KEY`. If microphone input fails but text `speak` works, verify
-the STT module key first.
-
-### Local SenseVoiceSmall
-
-```env title=".env"
-OPENTALKING_STT_DEFAULT_PROVIDER=sensevoice
-OPENTALKING_STT_ENABLED_PROVIDERS=sensevoice,dashscope
-OPENTALKING_STT_SENSEVOICE_MODEL=iic/SenseVoiceSmall
-OPENTALKING_STT_SENSEVOICE_MODEL_DIR=./models/local-audio/iic__SenseVoiceSmall
-OPENTALKING_STT_SENSEVOICE_DEVICE=cpu
-```
-
-SenseVoiceSmall uses the local FunASR adapter and supports both uploaded audio and WebSocket PCM microphone input. CPU inference is usually enough for short realtime utterances, which makes it a good match for QuickTalk local.
-
-Download the weights:
-
-```bash title="terminal"
-uv sync --extra dev --extra models --extra local-audio --python 3.11
-python scripts/download_local_audio_models.py \
-  --root ./models/local-audio \
-  --model sensevoice-small
-```
-
-## Verification
-
-```bash title="terminal"
-curl -fsS http://127.0.0.1:8000/health
-curl -s -X POST http://127.0.0.1:8000/sessions \
-  -H 'content-type: application/json' \
-  -d '{"avatar_id":"demo-avatar","model":"mock"}'
-```
-
-Then use the frontend microphone flow to confirm STT events and LLM responses appear
-in the session event stream.
-
-## Frontend Entry
-
-After the model or backend service is running, use the OpenTalking WebUI:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to the server `5173`, then open `http://127.0.0.1:5173`.
--- a/docs/en/model-deployment/local-quicktalk-audio.md
+++ b/docs/en/model-deployment/local-quicktalk-audio.md
@@ -1,20 +0,0 @@
-# Local Audio + QuickTalk
-
-The full guide has moved to [Model Deployment / Recipes / Local Audio + QuickTalk](recipes/local-quicktalk-audio.md).
-
-This compatibility page keeps old links working. The new deployment structure separates model pages from combined recipes:
-
- [QuickTalk Overview](quicktalk.md)
- [QuickTalk Local](quicktalk/local.md)
- [Local Audio + QuickTalk](recipes/local-quicktalk-audio.md)
-
-## Frontend Entry
-
-After the model or backend service is running, use the OpenTalking WebUI:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to the server `5173`, then open `http://127.0.0.1:5173`.
--- a/docs/en/model-deployment/musetalk.md
+++ b/docs/en/model-deployment/musetalk.md
@@ -1,27 +0,0 @@
-# MuseTalk
-
-MuseTalk supports `local`, `omnirt`, and advanced `direct_ws` integration. This section separates the two common deployment modes.
-
-| Item | Value |
-|------|-------|
-| Model ID | `musetalk` |
-| Backends | `local`, `omnirt`, `direct_ws` |
-| Repo default | `omnirt` |
-| Recommended start | Use `local` when OpenMMLab dependencies are ready; use `omnirt` for service isolation. |
-
-## Guides
-
- [MuseTalk Local](musetalk/local.md)
- [MuseTalk with OmniRT](musetalk/omnirt.md)
- [Support Matrix](support-matrix.md)
-
-## Frontend Entry
-
-After the model or backend service is running, use the OpenTalking WebUI:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to the server `5173`, then open `http://127.0.0.1:5173`.
--- a/docs/en/model-deployment/quicktalk.md
+++ b/docs/en/model-deployment/quicktalk.md
@@ -1,57 +0,0 @@
-# QuickTalk
-
-QuickTalk supports both `local` and `omnirt` modes in the current repository. The commit history includes QuickTalk routed through OmniRT audio2video, and the current scripts include `scripts/quickstart/start_omnirt_quicktalk.sh`; it is not a local-only model.
-
-| Item | Value |
-|------|-------|
-| Model ID | `quicktalk` |
-| Backends | `local`, `omnirt` |
-| Repo default | `omnirt` |
-| Recommended start | Use `local` for single-machine validation; use `omnirt` for service isolation. |
-
-## Asset Layouts
-
-Both modes need QuickTalk weights, HuBERT files, and InsightFace assets, but they read different roots. The OmniRT quickstart script reads top-level files under `$OMNIRT_MODEL_ROOT/quicktalk`:
-
-```text
-$OMNIRT_MODEL_ROOT/quicktalk/
-  quicktalk.pth
-  repair.npy
-  chinese-hubert-large/
-    config.json
-    preprocessor_config.json
-    pytorch_model.bin
-  auxiliary/models/buffalo_l/
-    det_10g.onnx
-```
-
-The local adapter reads an asset root that contains `checkpoints/`:
-
-```text
-$OPENTALKING_QUICKTALK_ASSET_ROOT/
-  checkpoints/
-    quicktalk.pth
-    repair.npy
-    chinese-hubert-large/
-      pytorch_model.bin
-    auxiliary/models/buffalo_l/ or auxiliary_min/
-      det_10g.onnx
-```
-
-## Guides
-
- [QuickTalk Local](quicktalk/local.md)
- [QuickTalk with OmniRT](quicktalk/omnirt.md)
- [Local Audio + QuickTalk](recipes/local-quicktalk-audio.md)
- [Support Matrix](support-matrix.md)
-
-## Frontend Entry
-
-After the model or backend service is running, use the OpenTalking WebUI:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to the server `5173`, then open `http://127.0.0.1:5173`.
--- a/docs/en/model-deployment/recipes/local-quicktalk-audio.md
+++ b/docs/en/model-deployment/recipes/local-quicktalk-audio.md
@@ -70,7 +70,7 @@ python scripts/download_local_audio_models.py \
 Use the main `.venv` for OpenTalking, SenseVoice, and QuickTalk. Create a
 separate CosyVoice sidecar venv after the runtime checkout.

-For CosyVoice3 model sources and the optional fp16 TensorRT ONNX files, see [TTS deployment](../tts.md#local-cosyvoice3-05b).
+For CosyVoice3 model sources and the optional fp16 TensorRT ONNX files, see [TTS deployment](../../speech_models/tts/cosyvoice.md).

 Prepare QuickTalk weights as described in [QuickTalk Local](../quicktalk/local.md). Put the CosyVoice runtime under the model directory:

--- a/docs/en/model-deployment/talking-head/index.md
+++ b/docs/en/model-deployment/talking-head/index.md
@@ -8,13 +8,13 @@ inference throughput belong to the selected backend.

 | Model | Backend | Best for | Evidence level | Details |
 |-------|---------|----------|----------------|---------|
-| `mock` | `mock` | First run, CI, API/WebRTC debugging | Built in, verified | [Mock](../mock.md) |
+| `mock` | `mock` | First run, CI, API/WebRTC debugging | Built in, verified | [Mock](../../avatar_models/mock.md) |
 | `wav2lip` | `local` / `omnirt` | First real lip-sync model | Local adapter is built in; OmniRT path verified | [Local](../wav2lip/local.md) / [OmniRT](../wav2lip/omnirt.md) |
 | `musetalk` | `local` / `omnirt` / `direct_ws` | MuseTalk quality with either in-process startup or an external service | Local adapter is built in; OmniRT/direct_ws paths documented | [Local](../musetalk/local.md) / [OmniRT](../musetalk/omnirt.md) |
 | `quicktalk` | `local` / `omnirt` | Local realtime adapter or OmniRT-hosted deployment | Local path is built in; OmniRT path is integrated | [Local](../quicktalk/local.md) / [OmniRT](../quicktalk/omnirt.md) |
-| `fasterliveportrait` | `omnirt` | Single-GPU realtime audio-driven portrait with pasteback | Documented | [FasterLivePortrait](../fasterliveportrait.md) |
-| `flashtalk` | `omnirt` | High-quality private GPU/NPU deployment | OmniRT/Ascend path verified | [FlashTalk](../flashtalk.md) |
-| `flashhead` | `direct_ws` | Existing standalone FlashHead service | Documented | [FlashHead](../flashhead.md) |
+| `fasterliveportrait` | `omnirt` | Single-GPU realtime audio-driven portrait with pasteback | Documented | [FasterLivePortrait](../../avatar_models/fasterliveportrait.md) |
+| `flashtalk` | `omnirt` | High-quality private GPU/NPU deployment | OmniRT/Ascend path verified | [FlashTalk](../../avatar_models/flashtalk.md) |
+| `flashhead` | `direct_ws` | Existing standalone FlashHead service | Documented | [FlashHead](../../avatar_models/flashhead.md) |

 ## Backend Behavior

--- a/docs/en/model-deployment/tts.md
+++ b/docs/en/model-deployment/tts.md
@@ -1,396 +0,0 @@
-# Text-to-Speech
-
-TTS converts LLM output into audio that drives the talking-head backend. Start with Edge TTS for the lightest evaluation, then switch providers when you need production voices, cloning, or higher quality.
-
-## Provider Options
-
-| Provider | Best for | Required configuration |
-|----------|----------|----------|
-| `edge` | First run, CPU evaluation, no API key | `OPENTALKING_TTS_DEFAULT_PROVIDER=edge` |
-| `dashscope` | Chinese realtime TTS and voice cloning | `OPENTALKING_TTS_DASHSCOPE_API_KEY` and DashScope TTS settings |
-| `local_cosyvoice` | Local Chinese TTS, built-in voices, and cloned voices | CosyVoice3 weights and local service URL |
-| `indextts` | IndexTTS2 controllable voice, emotion control, and cloned voices | `OPENTALKING_TTS_INDEXTTS_BACKEND=local` or `omnirt` |
-| `cosyvoice` | Custom CosyVoice service | CosyVoice WebSocket URL/settings |
-| `elevenlabs` | Hosted multilingual voices | ElevenLabs API key and voice id |
-
-`indextts` is the only provider name exposed by OpenTalking. Deployment can route it to a same-host HTTP sidecar through the `local` backend or to an OmniRT resident backend. This is similar to avatar video model backend selection: OpenTalking always selects `IndexTTS`, while operators switch the runtime backend through environment variables.
-
-## Edge TTS Default
-
-```env title=".env"
-OPENTALKING_TTS_DEFAULT_PROVIDER=edge
-OPENTALKING_TTS_EDGE_VOICE=zh-CN-XiaoxiaoNeural
-```
-
-Edge TTS still needs `ffmpeg` because OpenTalking decodes provider audio into PCM before feeding the avatar backend.
-
-## DashScope Qwen Realtime TTS
-
-```env title=".env"
-OPENTALKING_TTS_DEFAULT_PROVIDER=dashscope
-OPENTALKING_TTS_DASHSCOPE_API_KEY=<dashscope-api-key>
-OPENTALKING_TTS_DASHSCOPE_MODEL=qwen3-tts-flash-realtime
-OPENTALKING_QWEN_TTS_REUSE_WS=1
-```
-
-DashScope TTS does not read `OPENTALKING_LLM_API_KEY` or `DASHSCOPE_API_KEY`; even if you reuse the same actual key, write it explicitly to `OPENTALKING_TTS_DASHSCOPE_API_KEY`.
-
-## Local CosyVoice3 0.5B
-
-The recommended shape is a standalone local CosyVoice service. OpenTalking uses the `local_cosyvoice` provider to consume its PCM stream over HTTP.
-
-```env title=".env"
-OPENTALKING_TTS_DEFAULT_PROVIDER=local_cosyvoice
-OPENTALKING_TTS_ENABLED_PROVIDERS=local_cosyvoice,dashscope,edge
-OPENTALKING_TTS_LOCAL_COSYVOICE_MODEL=FunAudioLLM/Fun-CosyVoice3-0.5B-2512
-OPENTALKING_TTS_LOCAL_COSYVOICE_MODEL_DIR=./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512
-OPENTALKING_TTS_LOCAL_COSYVOICE_RUNTIME_DIR=./models/local-audio/runtime/CosyVoice
-OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL=http://127.0.0.1:19090/synthesize
-OPENTALKING_TTS_LOCAL_COSYVOICE_DEVICE=cuda:0
-OPENTALKING_TTS_LOCAL_COSYVOICE_FP16=auto
-OPENTALKING_TTS_LOCAL_COSYVOICE_LOAD_TRT=0
-OPENTALKING_TTS_LOCAL_COSYVOICE_MAX_TOKEN_TEXT_RATIO=6
-OPENTALKING_TTS_LOCAL_COSYVOICE_MASK_STOP_TOKENS=1
-```
-
-Download local audio weights:
-
-```bash title="Terminal"
-export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export UV_INDEX_URL="${UV_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export PIP_INDEX_URL="${PIP_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export UV_LINK_MODE=copy
-uv sync --extra dev --extra models --extra local-audio --python 3.11
-.venv/bin/python scripts/download_local_audio_models.py \
-  --root ./models/local-audio \
-  --model fun-cosyvoice3-0.5b-2512
-```
-
-This downloads the base CosyVoice3 model from ModelScope:
-
-| Asset | Source | Destination |
-|---|---|---|
-| Base CosyVoice3 weights | ModelScope `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | `./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512/` |
-
-The base model directory must include the files used by the sidecar runtime,
-including `cosyvoice3.yaml`, `llm.pt`, `flow.pt`, `hift.pt`,
-`speech_tokenizer_v3.onnx`, `speech_tokenizer_v3.batch.onnx`, `campplus.onnx`,
-and `flow.decoder.estimator.fp32.onnx`. The built-in zero-shot voice also needs
-a prompt wav configured by `OPENTALKING_TTS_LOCAL_COSYVOICE_PROMPT_AUDIO`; cloned
-voices store their own prompt wav under the local voice directory.
-
-For fp16 TensorRT, download the extra ONNX assets from Hugging Face and place
-them in the same base model directory:
-
-| Asset | Source | Required for |
-|---|---|---|
-| `flow.decoder.estimator.autocast_fp16.onnx` | Hugging Face `yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX` | `FP16 + LOAD_TRT=1`; OpenTalking builds `flow.decoder.estimator.autocast_fp16.mygpu.plan` from it. |
-| `flow.decoder.estimator.streaming.autocast_fp16.onnx` | Hugging Face `yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX` | Optional streaming fp16 ONNX asset; keep beside the estimator ONNX for runtime compatibility. |
-
-The generated `*.mygpu.plan` files are machine-specific TensorRT engines. Do not
-copy them between different GPU / TensorRT / CUDA environments; rebuild them on
-the target host from the ONNX files.
-
-This main `.venv` is for OpenTalking, SenseVoice, and the video backend. Keep
-CosyVoice in its own sidecar venv so its `transformers==4.51.3` runtime does not
-conflict with OpenTalking's `transformers>=4.57,<6`.
-
-Prepare the CosyVoice runtime:
-
-```bash title="Terminal"
-mkdir -p ./models/local-audio/runtime
-git clone https://github.com/FunAudioLLM/CosyVoice.git ./models/local-audio/runtime/CosyVoice
-cd ./models/local-audio/runtime/CosyVoice
-git submodule update --init --recursive
-```
-
-Create or update the CosyVoice sidecar venv:
-
-```bash title="Terminal"
-OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
-  bash scripts/prepare_cosyvoice_venv.sh
-```
-
-If you need TensorRT, install the TRT dependencies into the CosyVoice sidecar venv, not into OpenTalking's main `.venv`:
-
-```bash title="Terminal"
-PIP_EXTRA_INDEX_URL=https://pypi.nvidia.com/ OPENTALKING_COSYVOICE_INSTALL_TENSORRT=1 \
-  OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice bash scripts/prepare_cosyvoice_venv.sh
-```
-
-Start the local TTS service:
-
-```bash title="Terminal"
-bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
-```
-
-In prior GPU validation, the main CosyVoice3 issue was not a single TTFA number but seed-dependent output-length drift. The local CosyVoice service therefore keeps two stability guards on by default: `OPENTALKING_TTS_LOCAL_COSYVOICE_MASK_STOP_TOKENS=1` masks every stop token exposed by the CosyVoice LLM, and `OPENTALKING_TTS_LOCAL_COSYVOICE_MAX_TOKEN_TEXT_RATIO=6` bounds the token/text ratio so long prompts do not occasionally produce runaway audio. Keep these guards enabled for realtime use.
-
-TensorRT is optional. Enable it only after the current CosyVoice runtime, CUDA, onnxruntime-gpu/TensorRT engines, and model directory are compatible:
-
-```bash title="Terminal"
-.venv-cosyvoice/bin/python -c "import tensorrt as trt; print(trt.__version__)"
-test -f ./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512/flow.decoder.estimator.fp32.onnx
-```
-
-For CosyVoice3 fp16 TRT, prefer the official autocast fp16 ONNX asset. A TRT engine can be built from `flow.decoder.estimator.fp32.onnx`, but some GPU/TensorRT combinations can produce NaNs or silent audio. Before enabling `FP16 + LOAD_TRT`, place `flow.decoder.estimator.autocast_fp16.onnx` in the same model directory. If the server needs a proxy for Hugging Face, inject proxy variables only for the download command; do not add them to the OpenTalking service env:
-
-```bash title="Terminal"
-env ALL_PROXY=socks5h://127.0.0.1:7890 HTTPS_PROXY=socks5h://127.0.0.1:7890 \
-  HF_ENDPOINT=https://huggingface.co .venv-cosyvoice/bin/python - <<'PY'
-from huggingface_hub import hf_hub_download
-repo = "yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX"
-target = "./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512"
-for name in ["flow.decoder.estimator.autocast_fp16.onnx", "flow.decoder.estimator.streaming.autocast_fp16.onnx"]:
-    hf_hub_download(repo_id=repo, filename=name, repo_type="model", local_dir=target)
-PY
-```
-
-```env title="scripts/quickstart/env"
-OPENTALKING_TTS_LOCAL_COSYVOICE_FP16=auto
-OPENTALKING_TTS_LOCAL_COSYVOICE_LOAD_TRT=1
-OPENTALKING_TTS_LOCAL_COSYVOICE_TRT_CONCURRENT=1
-OPENTALKING_TTS_LOCAL_COSYVOICE_TOKEN_HOP_LEN=8
-OPENTALKING_TTS_LOCAL_COSYVOICE_TOKEN_MAX_HOP_LEN=16
-OPENTALKING_TTS_LOCAL_COSYVOICE_STREAM_SCALE_FACTOR=1
-```
-
-`start_local_cosyvoice.sh` automatically adds the sidecar venv's `site-packages/tensorrt_libs` directory to `LD_LIBRARY_PATH`. On first startup with `FP16 + LOAD_TRT=1`, if `flow.decoder.estimator.autocast_fp16.onnx` exists in the model directory, OpenTalking builds the GPU-specific `flow.decoder.estimator.autocast_fp16.mygpu.plan` from it; this can take longer than a normal startup. SenseVoice still runs in the OpenTalking main `.venv` and should not follow the CosyVoice TRT settings.
-
-After startup, check the sidecar health payload and verify `runtime_flags.load_trt`, `runtime.trt_autocast_fp16`, `streaming`, `llm_token_ratio`, and `llm_stop_token_patch`:
-
-```bash title="Terminal"
-curl -fsS http://127.0.0.1:19090/health | python3 -m json.tool
-```
-
-Measured on a Linux server with an NVIDIA RTX 3090, CosyVoice3 sidecar venv,
-`FP16 + LOAD_TRT=1`, and the autocast fp16 TensorRT plan loaded. The benchmark
-called the sidecar `/synthesize` endpoint directly and measured first PCM byte
-arrival as TTFB:
-
-| Text length | TTFB | Wall time | Audio duration | RTF |
-|---:|---:|---:|---:|---:|
-| 43 chars | 0.683 s | 6.215 s | 7.200 s | 0.863 |
-| 42 chars | 0.642 s | 5.858 s | 6.960 s | 0.842 |
-| 29 chars | 0.639 s | 5.771 s | 6.520 s | 0.885 |
-| **Average** | **0.655 s** | **5.948 s** | **6.893 s** | **0.863** |
-
-For the full local speech input, speech synthesis, and QuickTalk video chain, see [Local STT/TTS + QuickTalk](recipes/local-quicktalk-audio.md).
-
-## IndexTTS Deployment (provider = indextts)
-
-OpenTalking always uses `provider=indextts` for IndexTTS. `OPENTALKING_TTS_INDEXTTS_BACKEND` only selects the runtime topology: `local` means a same-host HTTP sidecar, and `omnirt` means an OmniRT resident service. The frontend, API payloads, and cloned voice metadata do not split this into two providers.
-
-### Option A: Same-host HTTP sidecar (backend = local)
-
-The local backend runs IndexTTS2 in a separate same-host HTTP sidecar. The OpenTalking API process only consumes the sidecar's `audio/L16` PCM stream through `OPENTALKING_TTS_LOCAL_INDEXTTS_SERVICE_URL`. Do not install `index-tts` directly into the OpenTalking `.venv`; the upstream package pins `torch`, `transformers`, `protobuf`, and related dependencies and can break QuickTalk / STT dependencies in the main environment.
-
-Install the OpenTalking main environment and local-audio download dependencies first. Do not install `index-tts` into this environment:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export UV_INDEX_URL="${UV_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export PIP_INDEX_URL="${PIP_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export UV_LINK_MODE=copy
-export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
-
-uv sync --extra dev --extra models --extra local-audio --python 3.11
-.venv/bin/python scripts/download_local_audio_models.py \
-  --root ./models/local-audio \
-  --model indextts2 \
-  --model indextts2-w2v-bert \
-  --model indextts2-maskgct \
-  --model indextts2-campplus \
-  --model indextts2-bigvgan
-```
-Large Hugging Face/Xet files may print `Read timed out` or `SSL record layer failure` while resuming. Continue if the command exits with status 0. If it fails, keep the partial model directory and rerun the same command; the downloader reuses the existing cache and resumes the remaining files.
-
-
-If the machine already has a compatible model root, you can skip the download and set `OPENTALKING_LOCAL_AUDIO_MODEL_ROOT` to that directory. The root must contain at least `IndexTeam__IndexTTS-2/config.yaml`, `facebook__w2v-bert-2.0`, `funasr__campplus`, and `nvidia__bigvgan_v2_22khz_80band_256x`. MaskGCT prefers `amphion__MaskGCT`; an existing `amphion__MaskGCT-ms` directory is also supported as long as it contains `semantic_codec/model.safetensors`.
-
-Create the IndexTTS sidecar runtime next. The upstream repository includes LFS example audio files and some environments may hit LFS quota, so skip smudge; use your own clear 3-15 second prompt audio instead.
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-mkdir -p ./models/local-audio/runtime
-INDEXTTS_RUNTIME_REPO="${INDEXTTS_RUNTIME_REPO:-https://github.com/index-tts/index-tts.git}"
-if [ ! -d ./models/local-audio/runtime/index-tts/.git ]; then
-  for i in 1 2 3; do
-    GIT_LFS_SKIP_SMUDGE=1 git clone "$INDEXTTS_RUNTIME_REPO" ./models/local-audio/runtime/index-tts && break
-    rm -rf ./models/local-audio/runtime/index-tts
-    sleep 3
-  done
-fi
-test -d ./models/local-audio/runtime/index-tts/.git
-cd ./models/local-audio/runtime/index-tts
-export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export UV_INDEX_URL="${UV_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export PIP_INDEX_URL="${PIP_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export UV_LINK_MODE=copy
-uv sync --python 3.11
-uv pip install fastapi "uvicorn[standard]" soundfile
-```
-
-IndexTTS needs a clear 3-15 second human voice prompt. Prepare a default system voice first, or upload a reference audio file in the WebUI voice clone flow.
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-mkdir -p ./models/local-audio/voices/system/indextts-default
-cp /path/to/reference.wav ./models/local-audio/voices/system/indextts-default/prompt.wav
-cat > ./models/local-audio/voices/system/indextts-default/meta.json <<'JSON'
-{"voice_id":"indextts-default","display_label":"IndexTTS Default Voice","provider":"indextts","target_model":"IndexTeam/IndexTTS-2","source":"system"}
-JSON
-```
-
-Start the sidecar service:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-OPENTALKING_LOCAL_AUDIO_MODEL_ROOT=./models/local-audio \
-OPENTALKING_TTS_LOCAL_INDEXTTS_MODEL_DIR=./models/local-audio/IndexTeam__IndexTTS-2 \
-OPENTALKING_TTS_LOCAL_INDEXTTS_CFG_PATH=./models/local-audio/IndexTeam__IndexTTS-2/config.yaml \
-OPENTALKING_TTS_LOCAL_INDEXTTS_PROMPT_AUDIO=./models/local-audio/voices/system/indextts-default/prompt.wav \
-OPENTALKING_TTS_LOCAL_INDEXTTS_W2V_BERT_DIR=./models/local-audio/facebook__w2v-bert-2.0 \
-OPENTALKING_TTS_LOCAL_INDEXTTS_MASKGCT_DIR=./models/local-audio/amphion__MaskGCT \
-OPENTALKING_TTS_LOCAL_INDEXTTS_CAMPPLUS_DIR=./models/local-audio/funasr__campplus \
-OPENTALKING_TTS_LOCAL_INDEXTTS_BIGVGAN_DIR=./models/local-audio/nvidia__bigvgan_v2_22khz_80band_256x \
-OPENTALKING_TTS_LOCAL_INDEXTTS_DEVICE=cuda:0 \
-./models/local-audio/runtime/index-tts/.venv/bin/python scripts/local_indextts_service.py --host 127.0.0.1 --port 19092
-```
-
-Select `indextts` in OpenTalking `.env` and set the backend to `local`:
-
-```env title=".env"
-OPENTALKING_TTS_DEFAULT_PROVIDER=indextts
-OPENTALKING_TTS_ENABLED_PROVIDERS=edge,dashscope,local_cosyvoice,indextts
-OPENTALKING_TTS_INDEXTTS_BACKEND=local
-OPENTALKING_LOCAL_AUDIO_MODEL_ROOT=./models/local-audio
-OPENTALKING_TTS_LOCAL_INDEXTTS_MODEL=IndexTeam/IndexTTS-2
-OPENTALKING_TTS_LOCAL_INDEXTTS_MODEL_DIR=./models/local-audio/IndexTeam__IndexTTS-2
-OPENTALKING_TTS_LOCAL_INDEXTTS_CFG_PATH=./models/local-audio/IndexTeam__IndexTTS-2/config.yaml
-OPENTALKING_TTS_LOCAL_INDEXTTS_SERVICE_URL=http://127.0.0.1:19092/synthesize
-OPENTALKING_TTS_LOCAL_INDEXTTS_PROMPT_AUDIO=./models/local-audio/voices/system/indextts-default/prompt.wav
-OPENTALKING_TTS_LOCAL_INDEXTTS_W2V_BERT_DIR=./models/local-audio/facebook__w2v-bert-2.0
-OPENTALKING_TTS_LOCAL_INDEXTTS_MASKGCT_DIR=./models/local-audio/amphion__MaskGCT
-OPENTALKING_TTS_LOCAL_INDEXTTS_CAMPPLUS_DIR=./models/local-audio/funasr__campplus
-OPENTALKING_TTS_LOCAL_INDEXTTS_BIGVGAN_DIR=./models/local-audio/nvidia__bigvgan_v2_22khz_80band_256x
-OPENTALKING_TTS_LOCAL_INDEXTTS_DEVICE=auto
-```
-
-These `LOCAL_INDEXTTS_*_DIR` variables can be written in the OpenTalking `.env` or passed to the sidecar process. The OpenTalking main process needs the `SERVICE_URL` and voice prompt path; the sidecar needs the model directory, prompt, and local w2v / MaskGCT / campplus / BigVGAN asset directories so it does not reach Hugging Face at runtime.
-
-Start the OpenTalking API and WebUI. This example uses QuickTalk local as the video backend; use `--mock` instead if you only want to validate TTS preview first.
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/start_unified.sh --backend local --model quicktalk --api-port 8000 --web-port 5173
-```
-
-If `8000` or `5173` is already occupied, choose free values for `--api-port` / `--web-port`; update the API port in the following `curl` commands accordingly.
-
-Verify the sidecar first, then verify the OpenTalking API:
-
-```bash title="Terminal"
-curl -fsS http://127.0.0.1:19092/health
-curl -fsS -X POST http://127.0.0.1:19092/synthesize \
-  -H 'content-type: application/json' \
-  -d '{"text":"Hello, this is a local IndexTTS service test.","sample_rate":16000}' \
-  --output /tmp/indextts-local.pcm
-
-.venv/bin/python - <<'PY'
-from pathlib import Path
-pcm = Path("/tmp/indextts-local.pcm").read_bytes()
-assert len(pcm) > 0 and len(pcm) % 2 == 0, len(pcm)
-print("pcm_bytes", len(pcm), "sample_rate", 16000, "channels", 1)
-PY
-
-curl -fsS -X POST http://127.0.0.1:8000/tts/preview \
-  --max-time 300 \
-  -H 'content-type: application/json' \
-  -d '{"text":"Hello, this is an OpenTalking IndexTTS test.","tts_provider":"indextts","tts_voice":"indextts-default","tts_model":"IndexTeam/IndexTTS-2"}' \
-  --output /tmp/opentalking-indextts-preview.wav
-
-.venv/bin/python - <<'PY'
-import wave
-with wave.open("/tmp/opentalking-indextts-preview.wav", "rb") as wf:
-    print("wav", wf.getframerate(), wf.getnchannels(), wf.getsampwidth(), wf.getnframes())
-PY
-```
-
-You can also inspect runtime status to confirm OpenTalking still exposes `indextts` while routing it to the local backend:
-
-```bash title="Terminal"
-curl -fsS http://127.0.0.1:8000/runtime/status | python3 -m json.tool
-```
-
-Expected: `tts_provider` is `indextts`, `tts_providers.indextts.backend` is `local`, `tts_providers.indextts.resolved_provider` is `local_indextts`, and `service_url_set=true`.
-
-
-### Option B: OmniRT resident service (backend = omnirt)
-
-The OmniRT backend keeps IndexTTS resident in a separate service. OpenTalking only consumes the HTTP `audio/L16` PCM stream. The OpenTalking provider is still `indextts`; only the backend changes to `omnirt`.
-
-Start the OmniRT text2audio service first:
-
-```bash title="Terminal"
-cd "$OMNIRT_HOME"
-OMNIRT_INDEXTTS_RUNTIME=1 \
-OMNIRT_INDEXTTS_MODEL_DIR=/data2/zhongyi/model/local-audio/IndexTeam__IndexTTS-2 \
-OMNIRT_INDEXTTS_CFG_PATH=/data2/zhongyi/model/local-audio/IndexTeam__IndexTTS-2/config.yaml \
-OMNIRT_INDEXTTS_PROMPT_AUDIO=/data2/zhongyi/model/local-audio/voices/system/indextts-default/prompt.wav \
-OMNIRT_INDEXTTS_DEVICE=cuda:0 \
-uv run omnirt serve-text2audio --host 127.0.0.1 --port 9012
-```
-
-Then select the provider and backend in OpenTalking `.env`:
-
-```env title=".env"
-OPENTALKING_TTS_DEFAULT_PROVIDER=indextts
-OPENTALKING_TTS_ENABLED_PROVIDERS=edge,dashscope,local_cosyvoice,indextts
-OPENTALKING_TTS_INDEXTTS_BACKEND=omnirt
-OPENTALKING_TTS_OMNIRT_INDEXTTS_SERVICE_URL=http://127.0.0.1:9012/v1/text2audio/indextts
-OPENTALKING_TTS_OMNIRT_INDEXTTS_MODEL=IndexTeam/IndexTTS-2
-OPENTALKING_TTS_OMNIRT_INDEXTTS_STREAMING=1
-OPENTALKING_TTS_OMNIRT_INDEXTTS_STREAMING_MODE=token_window
-OPENTALKING_TTS_OMNIRT_INDEXTTS_MAX_TEXT_TOKENS_PER_SEGMENT=80
-OPENTALKING_TTS_OMNIRT_INDEXTTS_QUICK_STREAMING_TOKENS=4
-OPENTALKING_TTS_OMNIRT_INDEXTTS_INTERVAL_SILENCE_MS=0
-OPENTALKING_TTS_OMNIRT_INDEXTTS_TOKEN_WINDOW_SIZE=40
-OPENTALKING_TTS_OMNIRT_INDEXTTS_TOKEN_WINDOW_HOP=96
-OPENTALKING_TTS_OMNIRT_INDEXTTS_TOKEN_WINDOW_CONTEXT=8
-OPENTALKING_TTS_OMNIRT_INDEXTTS_TOKEN_WINDOW_OVERLAP_MS=60
-```
-
-`token_window` is token-window streaming at the model-token level: OmniRT decodes and emits PCM once enough speech tokens are available, without waiting for the full text segment. It is not 20 ms waveform-level streaming; short-utterance first-byte latency still depends on IndexTTS GPT token generation and vocoder decoding.
-
-## ElevenLabs
-
-```env title=".env"
-OPENTALKING_TTS_DEFAULT_PROVIDER=elevenlabs
-OPENTALKING_TTS_ELEVENLABS_API_KEY=<elevenlabs-api-key>
-OPENTALKING_TTS_ELEVENLABS_VOICE_ID=<voice-id>
-OPENTALKING_TTS_ELEVENLABS_MODEL_ID=eleven_flash_v2_5
-```
-
-## Verification
-
-Create a `mock` session first, then call `/speak`. This verifies TTS without depending on a real talking-head model.
-
-```bash title="Terminal"
-SID=<session-id>
-curl -s -X POST "http://127.0.0.1:8000/sessions/$SID/speak" \
-  -H 'content-type: application/json' \
-  -d '{"text":"你好，这是一次 OpenTalking 语音合成测试。","tts_provider":"indextts","tts_voice":"indextts-default","tts_model":"IndexTeam/IndexTTS-2"}'
-```
-
-## Frontend Entry
-
-After the model or backend service is running, use the OpenTalking WebUI:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to server `5173`, then open `http://127.0.0.1:5173`.
--- a/docs/en/model-deployment/wav2lip-local.md
+++ b/docs/en/model-deployment/wav2lip-local.md
@@ -1,20 +0,0 @@
-# Wav2Lip Local
-
-The Wav2Lip local guide has moved to [Wav2Lip / Local](wav2lip/local.md).
-
-This compatibility page keeps old links working. Use the new deployment structure:
-
- [Wav2Lip Overview](wav2lip.md)
- [Wav2Lip Local](wav2lip/local.md)
- [Wav2Lip with OmniRT](wav2lip/omnirt.md)
-
-## Frontend Entry
-
-After the model or backend service is running, use the OpenTalking WebUI:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to the server `5173`, then open `http://127.0.0.1:5173`.
--- a/docs/en/model-deployment/wav2lip.md
+++ b/docs/en/model-deployment/wav2lip.md
@@ -1,27 +0,0 @@
-# Wav2Lip
-
-Wav2Lip supports both `local` and `omnirt` modes. It is the recommended first real talking-head model because the weights are small and the startup path is clear.
-
-| Item | Value |
-|------|-------|
-| Model ID | `wav2lip` |
-| Backends | `local`, `omnirt`, compatible `direct_ws` |
-| Repo default | `local` |
-| Recommended start | [Wav2Lip Local](wav2lip/local.md) |
-
-## Guides
-
- [Wav2Lip Local](wav2lip/local.md)
- [Wav2Lip with OmniRT](wav2lip/omnirt.md)
- [Talking-Head Models](talking-head/index.md)
-
-## Frontend Entry
-
-After the model or backend service is running, use the OpenTalking WebUI:
-
-```bash title="Terminal"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-For a remote server, forward your local browser port to the server `5173`, then open `http://127.0.0.1:5173`.
--- a/docs/en/model-support/index.md
+++ b/docs/en/model-support/index.md
@@ -24,10 +24,10 @@ runtime backend.
 | Mock | `mock` | Install and WebUI flow validation |
 | Wav2Lip | `local` / `omnirt` | Lightweight lip sync and avatar asset validation |
 | QuickTalk | `local` / `omnirt` | Realtime talking-head validation |
+| FasterLivePortrait | `omnirt` | Single-GPU realtime portrait driving; supports JoyVASA audio driving and camera / uploaded-video Video Clone |
 | MuseTalk | `local` / `omnirt` / `direct_ws` | Higher-quality lip sync; local mode runs official avatar preprocessing before session initialization |
 | FlashTalk | `omnirt` | High-quality realtime digital human, better as a service |
 | FlashHead | `direct_ws` / HTTP adapter | Clip-style generation or existing FlashHead service |
-| FasterLivePortrait | `omnirt` | Single-GPU portrait pasteback, audio-driven realtime conversation, and video clone |

 Actual availability depends on weights, hardware, backend services, and installed
 dependencies. Model-specific pages describe the supported parameters and asset
@@ -36,7 +36,7 @@ requirements.
 ## Next Steps

 - Not sure which model to choose: start with [Model and Backend Selection](./selection.md).
- Need local STT/TTS with QuickTalk: read [Model Deployment / Local Audio + QuickTalk](../model-deployment/recipes/local-quicktalk-audio.md).
+- Need local STT/TTS with QuickTalk: read [Local Audio + QuickTalk](../recipes/local-quicktalk-audio.md).
+- Need camera or uploaded-video driving: read [FasterLivePortrait](../../avatar_models/fasterliveportrait.md) and [Video Clone](../usage/webui/video-clone.md).
 - Need local runtime details: read [Local Adapter](./runtime-backends/local-adapter.md).
- Need MuseTalk local setup: read [MuseTalk](./models/musetalk.md).
- Need camera- or selfie-video-driven cloning: read [FasterLivePortrait](./models/fasterliveportrait.md).
+- Need MuseTalk local setup: read [MuseTalk](../../avatar_models/musetalk.md).
--- a/docs/en/model-support/local-audio-quicktalk.md
+++ b/docs/en/model-support/local-audio-quicktalk.md
@@ -1,10 +1,119 @@
 # Local Audio + QuickTalk

-The full guide has moved to [Model Deployment / Local Audio + QuickTalk](../model-deployment/recipes/local-quicktalk-audio.md).
+This is the local media path for private validation:

-This page remains only to preserve old links. In the current documentation structure:
+```mermaid
+flowchart LR
+  Mic["Microphone / uploaded audio"] --> STT["SenseVoiceSmall<br/>local CPU"]
+  Text["Text input"] --> LLM["LLM<br/>OpenAI-compatible"]
+  STT --> LLM
+  LLM --> TTS["Fun-CosyVoice3-0.5B-2512<br/>local_cosyvoice service"]
+  TTS --> QT["QuickTalk local<br/>CUDA"]
+  QT --> RTC["WebRTC / browser playback"]
+```

- `Model Support` explains model capabilities, backend choices, and when to use each path.
- `Model Deployment` contains weights, dependencies, environment variables, startup commands, and verification steps.
+The LLM remains a separate module. It can point to DashScope, OpenAI, vLLM, Ollama, or your own local OpenAI-compatible service. STT, TTS, and video can all run on the same machine.

-To deploy local SenseVoiceSmall, local CosyVoice3, and the QuickTalk local path, read [Model Deployment / Local Audio + QuickTalk](../model-deployment/recipes/local-quicktalk-audio.md).
+## When to Use It
+
+- You want speech input and speech synthesis to run locally.
+- You want QuickTalk driven by OpenTalking's local adapter before introducing OmniRT.
+- You need to validate custom avatars, cloned voices, and the realtime digital-human chain.
+
+This is not the first choice for 8GB VRAM machines when local TTS and QuickTalk share the GPU. If VRAM is tight, keep `SenseVoiceSmall CPU + QuickTalk local` and use Edge or DashScope TTS first.
+
+## Provider Configuration
+
+```env title=".env"
+OPENTALKING_LLM_PROVIDER=openai_compatible
+OPENTALKING_LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
+OPENTALKING_LLM_API_KEY=<llm-key>
+OPENTALKING_LLM_MODEL=qwen-flash
+
+OPENTALKING_STT_DEFAULT_PROVIDER=sensevoice
+OPENTALKING_STT_ENABLED_PROVIDERS=sensevoice,dashscope
+OPENTALKING_STT_SENSEVOICE_MODEL=iic/SenseVoiceSmall
+OPENTALKING_STT_SENSEVOICE_MODEL_DIR=./avatar_models/local-audio/iic__SenseVoiceSmall
+OPENTALKING_STT_SENSEVOICE_DEVICE=cpu
+
+OPENTALKING_TTS_DEFAULT_PROVIDER=local_cosyvoice
+OPENTALKING_TTS_ENABLED_PROVIDERS=local_cosyvoice,dashscope,edge
+OPENTALKING_TTS_LOCAL_COSYVOICE_MODEL=FunAudioLLM/Fun-CosyVoice3-0.5B-2512
+OPENTALKING_TTS_LOCAL_COSYVOICE_MODEL_DIR=./avatar_models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512
+OPENTALKING_TTS_LOCAL_COSYVOICE_RUNTIME_DIR=./avatar_models/local-audio/runtime/CosyVoice
+OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL=http://127.0.0.1:19090/synthesize
+OPENTALKING_TTS_LOCAL_COSYVOICE_DEVICE=cuda:0
+
+OPENTALKING_QUICKTALK_BACKEND=local
+OPENTALKING_QUICKTALK_ASSET_ROOT=./avatar_models/quicktalk
+OPENTALKING_QUICKTALK_WORKER_CACHE=1
+OPENTALKING_TORCH_DEVICE=cuda:0
+```
+
+`*_DEFAULT_PROVIDER` only controls the default selection. It is not a fallback chain. If the frontend lets users choose API STT/TTS, configure provider-specific keys explicitly:
+
+```env title=".env"
+OPENTALKING_STT_DASHSCOPE_API_KEY=<dashscope-stt-key>
+OPENTALKING_TTS_DASHSCOPE_API_KEY=<dashscope-tts-key>
+```
+
+## Install and Models
+
+```bash title="terminal"
+uv sync --extra dev --extra models --extra local-audio --extra quicktalk-cuda --python 3.11
+python scripts/download_local_audio_models.py \
+  --root ./avatar_models/local-audio \
+  --model sensevoice-small \
+  --model fun-cosyvoice3-0.5b-2512
+```
+
+Use the main `.venv` for OpenTalking, SenseVoice, and QuickTalk. Create a
+separate CosyVoice sidecar venv after the runtime checkout:
+
+Prepare QuickTalk weights as described in [QuickTalk Local Deployment](../avatar_models/deployment/quicktalk-local.md). Put the CosyVoice runtime under the model directory:
+
+```bash title="terminal"
+mkdir -p ./avatar_models/local-audio/runtime
+git clone https://github.com/FunAudioLLM/CosyVoice.git ./avatar_models/local-audio/runtime/CosyVoice
+cd ./avatar_models/local-audio/runtime/CosyVoice
+git submodule update --init --recursive
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
+  bash scripts/prepare_cosyvoice_venv.sh
+```
+
+## Start
+
+Start the local TTS service first:
+
+```bash title="terminal"
+bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
+```
+
+Then start OpenTalking:
+
+```bash title="terminal"
+bash scripts/start_unified.sh --backend local --model quicktalk
+```
+
+## Verify
+
+```bash title="terminal"
+curl -fsS http://127.0.0.1:19090/health
+curl -fsS http://127.0.0.1:8000/api/runtime/status
+curl -s http://127.0.0.1:8000/models | jq '.statuses[] | select(.id=="quicktalk")'
+```
+
+Expected:
+
+- `stt_provider=sensevoice`
+- `tts_provider=local_cosyvoice`
+- `quicktalk_backend=local`
+- `quicktalk.connected=true`
+
+In the frontend, select local STT, local CosyVoice3, a shared avatar, and the
+`quicktalk` model. Test text input, microphone input, and TTS preview.
+
+## Mixing API Providers
+
+The local path is not mandatory. Users can choose API STT or API TTS in the frontend, but the backend will not implicitly reuse the LLM key or `DASHSCOPE_API_KEY`. Missing API provider keys are blocked before session startup. API errors during a conversation are shown in the digital-human chat view.
--- a/docs/en/model-support/models/fasterliveportrait.md
+++ b/docs/en/model-support/models/fasterliveportrait.md
@@ -1,81 +1,193 @@
 # FasterLivePortrait

-## Model Introduction
+## When to Use FasterLivePortrait

-FasterLivePortrait is integrated through OmniRT. OpenTalking currently supports two paths:
+FasterLivePortrait is a good fit for realtime portrait driving on a single CUDA GPU. In OpenTalking it is integrated through OmniRT and currently supports two workflows:

- Realtime conversation: OpenTalking generates speech and OmniRT drives the avatar through `/v1/audio2video/fasterliveportrait`.
- Video clone: OpenTalking keeps a digital-human asset as the `source`, while browser camera frames or an uploaded video act as the `driving` input. Frames are streamed through an independent video-clone WebSocket.
+- **Realtime conversation**: the selected digital-human avatar is the source; TTS/audio is converted into motion by JoyVASA, then FasterLivePortrait pastes the animated face back into the original avatar image.
+- **Video Clone**: the selected avatar remains the source; browser camera frames or an uploaded selfie video act as the driving video for expression, mouth, and head motion.

-Video clone does not enter the LLM, STT, or TTS conversation pipeline. It is a visual driving workflow for testing camera-driven facial expression and head motion.
+Realtime conversation still uses the normal OpenTalking LLM / STT / TTS / WebRTC session path. Video Clone is video-driven only and does not enter the LLM, TTS, or STT conversation chain.

-## Suitable Scenarios
+![FasterLivePortrait Video Clone workspace](../../../assets/images/model-support/fasterliveportrait-video-clone.png)

- Realtime preview of expression, head motion, and mouth motion.
- Use a clear frontal or half-body image as `source`, then drive it with a camera or selfie video.
- Use “Realtime Conversation” and “Video Clone” side by side in the same WebUI.
+## Concept Boundaries
+
+| Concept | Meaning |
+| --- | --- |
+| `source` | A digital-human image or video asset from the OpenTalking avatar library. This is the character shown in the output. |
+| `driving` | Browser camera frames or an uploaded selfie video. This only provides expression, mouth, and head motion. |
+| Pasteback | Paste the animated face region back into the original source composition to preserve body, background, and aspect ratio. |
+| Crop driving | Crop the driving video's face region before driving. In Video Clone it can be disabled when full-frame detection or preview is preferable. |
+
+Video Clone does not use the camera user as the source. The camera or uploaded video is only the driving input.

 ## Recommended Runtime Backend

-Use `omnirt`. OpenTalking owns assets, WebUI, parameters, and browser frame streaming. OmniRT keeps FasterLivePortrait loaded and provides inference WebSockets.
+Use `omnirt` for FasterLivePortrait. OpenTalking owns WebUI, avatar selection, session or Video Clone bridge, playback, and parameter updates. OmniRT owns FasterLivePortrait, JoyVASA, TensorRT/ONNXRuntime components, and model weights.

-| OpenTalking entry | OmniRT entry | Purpose |
+| Capability | OpenTalking entry | OmniRT entry |
 | --- | --- | --- |
-| `/sessions` with a FasterLivePortrait session | `/v1/audio2video/fasterliveportrait` | Audio-driven realtime conversation |
-| `/video-clone/fasterliveportrait/ws` | `/v1/avatar/video-clone/fasterliveportrait` | Video-clone frame stream |
+| Realtime audio-driven conversation | Normal session creation and `/sessions/{id}/speak` | `/v1/audio2video/fasterliveportrait` |
+| Video-driven Video Clone | WebUI Video Clone workspace | `/v1/video2video/fasterliveportrait` |
+| Model status | `/models` or WebUI status | `/v1/audio2video/models` |

-## Weights and Startup
+## Weights and Source Requirements

-First follow [FasterLivePortrait deployment](../../model-deployment/fasterliveportrait.md) to prepare the FasterLivePortrait source checkout, JoyVASA weights, TensorRT/ONNXRuntime dependencies, OmniRT, and OpenTalking.
+OmniRT needs a FasterLivePortrait source checkout and a checkpoint directory. Public docs should describe paths with environment variables instead of placing model files inside the OpenTalking repository:

-Check whether OpenTalking sees the video-clone service:
-
-```bash
-curl -s http://127.0.0.1:8000/video-clone/status | python3 -m json.tool
+```bash title="terminal"
+export DIGITAL_HUMAN_HOME=/opt/digital_human
+export OPENTALKING_HOME="$DIGITAL_HUMAN_HOME/opentalking"
+export OMNIRT_HOME="$DIGITAL_HUMAN_HOME/omnirt"
+export FASTERLIVEPORTRAIT_HOME="$DIGITAL_HUMAN_HOME/FasterLivePortrait"
+export OMNIRT_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
 ```

-`connected` should be `true`. If it is `false`, check whether OmniRT started the FasterLivePortrait runtime and whether `OMNIRT_ENDPOINT` points to that service.
+The checkpoint directory must include at least:
+
+```text
+$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints/
+  JoyVASA/
+    motion_generator/motion_generator_hubert_chinese.pt
+    motion_template/motion_template.pkl
+  chinese-hubert-base/
+    config.json
+    preprocessor_config.json
+    pytorch_model.bin
+  liveportrait/ or appearance_feature_extractor.onnx and other FasterLivePortrait ONNX/TRT files
+```
+
+Preflight check:
+
+```bash title="terminal"
+test -f "$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints/JoyVASA/motion_generator/motion_generator_hubert_chinese.pt"
+test -f "$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints/JoyVASA/motion_template/motion_template.pkl"
+test -f "$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints/chinese-hubert-base/pytorch_model.bin"
+```
+
+## Start OmniRT
+
+```bash title="terminal"
+cd "$OMNIRT_HOME"
+uv sync --extra server --extra fasterliveportrait --python 3.11
+
+OMNIRT_FASTLIVEPORTRAIT_RUNTIME=1 \
+OMNIRT_FASTLIVEPORTRAIT_LOAD_MODELS=1 \
+OMNIRT_FASTLIVEPORTRAIT_ROOT="$FASTERLIVEPORTRAIT_HOME" \
+OMNIRT_FASTLIVEPORTRAIT_CHECKPOINTS_DIR="$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints" \
+OMNIRT_FASTLIVEPORTRAIT_CFG=configs/trt_infer.yaml \
+OMNIRT_FASTLIVEPORTRAIT_DEVICE=cuda:0 \
+OMNIRT_FASTLIVEPORTRAIT_JPEG_QUALITY=85 \
+uv run omnirt serve-avatar-ws --host 0.0.0.0 --port 9000 --backend cuda
+```
+
+Verify OmniRT model status:
+
+```bash title="terminal"
+curl -s http://127.0.0.1:9000/v1/audio2video/models | jq '.statuses[] | select(.id=="fasterliveportrait")'
+```
+
+The expected result has `connected=true`.
+
+!!! tip "Next step"
+    After OmniRT returns `connected=true`, continue with [Start OpenTalking WebUI](#start-opentalking-webui). That section starts OpenTalking API and the frontend page. For general startup script details, see [Command Line Tools](../../usage/cli.md). For ports, host binding, and OmniRT endpoint parameters, see [Advanced CLI Arguments](../../usage/cli-advanced.md).
+
+## Start OpenTalking WebUI
+
+OpenTalking uses FasterLivePortrait through OmniRT. The command below starts the OpenTalking API / unified backend, and `start_unified.sh` also starts WebUI:
+
+```bash title="terminal"
+cd "$OPENTALKING_HOME"
+export OPENTALKING_OMNIRT_ENDPOINT=http://127.0.0.1:9000
+export OMNIRT_ENDPOINT=http://127.0.0.1:9000
+bash scripts/start_unified.sh --backend omnirt --model fasterliveportrait
+```
+
+After startup, the terminal prints the WebUI URL. The default is `http://127.0.0.1:5173`. After opening the page:
+
+- For audio-driven realtime conversation: open Realtime Conversation, select FasterLivePortrait, and select a compatible avatar.
+- For camera or uploaded-video driving: open Video Clone and follow the [Video Clone guide](../../usage/webui/video-clone.md).
+
+For Video Clone validation, OpenTalking API still needs to reach OmniRT and `/models` should report `fasterliveportrait` as connected.
+
+## Realtime Conversation Parameters
+
+Realtime conversation defaults live in `configs/synthesis/fasterliveportrait.yaml`. Common fields:
+
+| Parameter | Effect | Common setting |
+| --- | --- | --- |
+| `width` / `height` | Output shape | Start from `448` width for realtime |
+| `fps` | Output frame rate | Default `25` |
+| `animation_region` | Driven region | Conversation default is `lip` to reduce exaggerated full-face motion |
+| `head_motion_multiplier` | Overall head motion | `0.2-0.8` |
+| `pose_motion_multiplier` | Pose motion | `0.2-0.5` |
+| `mouth_open_multiplier` | Mouth opening | `1.0-1.4` |
+| `mouth_corner_multiplier` | Mouth-corner movement | `0.7-1.0` |
+| `driving_multiplier` | Overall keypoint amplitude | `0.8-1.2` |
+| `cfg_scale` | JoyVASA audio-following strength | `3.5-4.5` |
+| `flag_stitching` | Stabilize face boundary | Keep enabled |
+| `flag_normalize_lip` | Reduce initial mouth-shape offset | Keep enabled |
+| `flag_relative_motion` | Preserve source base pose | Enabled by default for conversation |
+| `flag_lip_retargeting` | Improve mouth following | Enable by effect |
+
+After selecting FasterLivePortrait in the frontend, these amplitude controls can be updated live. During a running session, updates usually take effect on the next audio chunk.
+
+## Video Clone Parameters
+
+The Video Clone workspace provides source selection, driving input, and realtime controls. The first path is live camera driving; uploaded driving video is useful for validation and near-realtime testing.
+
+| Control | Effect | Suggestion |
+| --- | --- | --- |
+| Camera | Select browser input device | Allow browser camera permission on first use |
+| FPS | Camera sampling rate | Start with `12` or `15` |
+| Resolution | Driving frame sampling size | Start with `448px` |
+| Mirror preview | Local preview only | Usually enabled for selfie camera |
+| Driving region | `all` / `exp` / `pose` / `lip` / `eye` | Use `lip` or `exp` for mouth tests; `all` for richer expression |
+| Pasteback | Paste output into source image | Keep enabled to avoid an over-zoomed head-only result |
+| Crop driving face | Crop driving input | Disable when uploaded-video aspect ratio or face position looks wrong |
+| Lip retargeting | Improve mouth following | Try when the mouth looks puffy or does not open enough |
+| Relative motion | Preserve source pose offset | Usually disable this when lip retargeting is enabled |
+
+If uploaded video driving makes the mouth look puffy or too closed, debug in this order:
+
+1. Disable `Crop driving face` to make sure the driving video is not cropped too tightly.
+2. Enable `Pasteback` so the output keeps the original source composition.
+3. Enable `Lip retargeting` and disable `Relative motion`.
+4. Change `animation_region` from `lip` to `exp` or `all` and check whether mouth corners and cheeks recover.
+5. Tune `mouth_open_multiplier` around `0.8-1.3` and `mouth_corner_multiplier` around `1.0-1.3`.
+
+Lip retargeting can improve mouth following, but combined with relative motion it may reduce the mouth to mostly vertical open/close. Treat those two switches as a pair during Video Clone tuning.

 ## Avatar Requirements

-`source` is the digital-human image in the OpenTalking avatar library. Recommended assets:
+- Use a clear frontal or half-body source when possible.
+- Enable `Pasteback` when you want to preserve a half-body composition.
+- If aspect ratio or crop is wrong, inspect the avatar preview and source image before changing only driving parameters.
+- Video Clone can reuse existing avatar-library assets. You do not need to upload the camera user as an avatar.

- Clear frontal or half-body image.
- Unblocked face and stable lighting.
- Full head-and-shoulder composition. With pasteback enabled, output is pasted back into the original source image instead of showing only a cropped head.
+## Verify

-The Video Clone page can upload a new `source` image directly. OpenTalking reuses the existing `/avatars/custom` asset API, adds the image to the avatar library, and selects it as the current source.
+1. Start OmniRT and confirm `fasterliveportrait` is connected.
+2. Start OpenTalking WebUI.
+3. Open Realtime Conversation, select FasterLivePortrait, send a short text message, and confirm the audio-driven path still works.
+4. Open Video Clone, select a digital-human source, allow camera permission, click Start, and confirm the source avatar follows camera expression.
+5. Stop or leave the page and confirm camera tracks, WebSocket session, and OmniRT session are released.

-## Driving Input
+## Troubleshooting

-`driving` controls expression and head motion. It is not the source identity:
+### `/models` shows `runtime_not_enabled`

- Live camera frames are the primary path.
- Uploaded driving video is useful for offline or near-realtime testing.
- Driving video is not cropped by default. If the face is too small or detection is unstable, try enabling “crop driving face”.
+Start OmniRT with `OMNIRT_FASTLIVEPORTRAIT_RUNTIME=1` and check checkpoint paths.

-## Frontend Controls
+### Audio driving has no lip motion

-The Video Clone page exposes driving controls:
+Check JoyVASA motion generator, motion template, and `chinese-hubert-base/pytorch_model.bin`.

-| Control | Effect |
-| --- | --- |
-| Motion / expression / head-motion amplitude | Overall motion and expression strength |
-| Mouth opening | First knob to raise when uploaded videos do not open the mouth enough |
-| Animation region | Full face, expression, pose, mouth, or eyes |
-| Pasteback | Preserve the original source composition instead of showing only a cropped head |
-| Relative motion | Preserve relative motion differences between source and driving |
-| Lip normalization / lip retargeting | Can improve mouth shape, but aggressive retargeting may reduce motion to simple vertical opening |
+### Video Clone cannot start the camera

-If the mouth looks puffy or misaligned, first check whether the driving video is being cropped. Then try disabling crop, keeping pasteback enabled, and balancing mouth opening with lip retargeting.
+Open the page from `localhost` / `127.0.0.1` or HTTPS, allow browser camera permission, and confirm OpenTalking API can connect to OmniRT.

-## WebUI Flow
+### Uploaded-video driving differs from camera driving

-1. Start the OmniRT FasterLivePortrait runtime.
-2. Start OpenTalking API and WebUI.
-3. Open WebUI and select “Video Clone” in the top navigation.
-4. Select or upload the source avatar on the left.
-5. Select a camera, or upload a driving video on the right.
-6. Click Start and inspect the cloned output in the center.
-
-When stopped or when the page is switched, WebUI releases the camera track, WebSocket, and current video-clone session.
+Uploaded videos are sensitive to resolution, face position, crop, and scaling. Disable driving crop first, then tune lip retargeting, relative motion, and mouth multipliers.
--- a/docs/en/model-support/models/flashhead.md
+++ b/docs/en/model-support/models/flashhead.md
@@ -2,37 +2,55 @@

 ## Model Introduction

-TODO
+FlashHead is integrated in OpenTalking primarily through an external generative service. The adapter writes audio chunks as WAV, calls the FlashHead HTTP generation endpoint, then decodes the returned video fragments back into frames for the existing WebRTC playback path.

 ## Suitable Scenarios

-TODO
+- Higher visual quality with fragment-style generation latency.
+- HTTP service integration instead of WebSocket streaming.
+- Saving generation results as video clips.

 ## Recommended Runtime Backend

-TODO
+Use a standalone FlashHead HTTP service and connect it through the `direct_ws` / adapter path. If OmniRT later provides a unified FlashHead audio2video endpoint, the backend can be switched there.

 ## Hardware Requirements

-TODO
+Hardware requirements are driven mainly by the FlashHead service. OpenTalking handles audio upload, result retrieval, and video decoding, so the heavy model should stay outside the API process.

 ## Weights and Asset Requirements

-TODO
+Weights live on the FlashHead service side. OpenTalking needs:
+
+- A reachable FlashHead base URL.
+- A shared directory or downloadable output URL.
+- An avatar reference image.
+- Input audio chunks.

 ## Avatar Requirements

-TODO
+Use a clear front-facing reference image. The FlashHead HTTP client writes the reference image into the shared directory and passes it to the model service in the generation request.

 ## OpenTalking Configuration

-TODO
+```bash
+export OPENTALKING_FLASHHEAD_BACKEND=direct_ws
+export OPENTALKING_FLASHHEAD_BASE_URL=http://127.0.0.1:8766
+```

 ## Start and Verify

-TODO
+1. Start the FlashHead HTTP service.
+2. Configure `OPENTALKING_FLASHHEAD_BASE_URL`.
+3. Start OpenTalking.
+4. In the WebUI, select the `flashhead` model and create a session.

 ## Common Issues

-TODO
+### OpenTalking receives a video but cannot read it

+Configure `OPENTALKING_FLASHHEAD_OUTPUT_LOCAL_DIR` / `OPENTALKING_FLASHHEAD_OUTPUT_REMOTE_DIR`, or provide `OPENTALKING_FLASHHEAD_OUTPUT_BASE_URL`.
+
+### Latency is higher than streaming models
+
+FlashHead currently uses fragment-style HTTP generation. Lower `frame_num`, tune the preset, or deploy stronger hardware; it is fundamentally different from frame-by-frame streaming WebSocket models.
--- a/docs/en/model-support/models/flashtalk.md
+++ b/docs/en/model-support/models/flashtalk.md
@@ -2,55 +2,87 @@

 ## When to Use FlashTalk

-TODO
+FlashTalk is suited to high-quality realtime digital humans, livestream hosts, customer-support avatars, and other scenarios that need stronger expressiveness. It is heavier than Wav2Lip / QuickTalk and should usually run as an OmniRT service instead of inside the OpenTalking API process.

 ## OpenTalking and OmniRT Boundary

-TODO
+OpenTalking owns the WebUI, sessions, TTS, WebRTC, recording, and status management. FlashTalk weight loading, GPU scheduling, and actual inference are handled by OmniRT or a dedicated FlashTalk service.

 ## Requirements

 ### GPU

-TODO
+Multi-GPU is recommended, or a single GPU with enough VRAM. FlashTalk has stricter requirements for memory, throughput, and service stability.

 ### NPU

-TODO
+If the FlashTalk backend is already adapted for NPU, expose it through OmniRT. OpenTalking should not manage the NPU runtime directly.

 ### Memory

-TODO
+If memory is tight, prefer quantization, lower resolution, fewer concurrent sessions, shorter cache windows, or splitting the model service.

 ### Disk

-TODO
+You need space for weights, quantized weights, temporary audio/video files, and logs. Production deployments should keep weights and runtime caches on fast disks.

 ## Prepare Weights

-TODO
+FlashTalk weights usually live on the OmniRT service side. OpenTalking keeps the following default paths:
+
+```bash
+OPENTALKING_FLASHTALK_CKPT_DIR=./avatar_models/SoulX-FlashTalk-14B
+OPENTALKING_FLASHTALK_WAV2VEC_DIR=./avatar_models/chinese-wav2vec2-base
+```
+
+For production, let OmniRT manage these paths.

 ## Start OmniRT

-TODO
+```bash
+bash scripts/quickstart/start_omnirt_flashtalk.sh
+```

 ## Configure OpenTalking

-TODO
+```bash
+export OPENTALKING_OMNIRT_ENDPOINT=http://127.0.0.1:9000
+export OPENTALKING_FLASHTALK_BACKEND=omnirt
+```

 ## Start OpenTalking

-TODO
+```bash
+bash scripts/start_unified.sh \
+  --backend omnirt \
+  --model flashtalk \
+  --omnirt http://127.0.0.1:9000
+```

 ## Verify

-TODO
+1. Run `bash scripts/quickstart/status.sh`.
+2. Confirm that `flashtalk` appears in the model list.
+3. Select a FlashTalk-compatible avatar in the WebUI.
+4. Send a short prompt and observe first-frame latency, audio boundaries, and stability.

 ## Performance Notes

-TODO
+- Do not run the API process and the FlashTalk heavy model in the same process.
+- Limit session duration and queue length in production.
+- TTS chunking affects first-frame latency and continuity.
+- For multi-model deployments, give FlashTalk a dedicated GPU or host.

 ## Troubleshooting

-TODO
+### Queue is blocked

+Check slot timeout, max session, and active sessions. Production deployments should have explicit session release rules.
+
+### First-frame latency is too high
+
+Check for cold starts, long TTS waits, overly large `frame_num`, or heavy sampling settings.
+
+### Out of memory
+
+Consider quantization, lower resolution, fewer concurrent sessions, service splitting, or a larger VRAM device.
--- a/docs/en/model-support/models/musetalk.md
+++ b/docs/en/model-support/models/musetalk.md
@@ -127,4 +127,4 @@ When OmniRT or the local runtime provides the model, it should report
 | `reason=omnirt_unavailable` | Check that OmniRT reports `/v1/audio2video/musetalk`. |
 | `No module named 'mmcv._ext'` | The preprocessing Python lacks full OpenMMLab dependencies; use an `OPENTALKING_MUSETALK_PREPROCESS_PYTHON` environment with full `mmcv`. |
 | Session fails during preprocessing | Check that `OPENTALKING_MUSETALK_REPO` points to the official MuseTalk source and that `dwpose` and `face-parse-bisenet` weights exist. |
-| Avatar mismatch | Use an avatar with `model_type: musetalk`. |
+| Avatar asset unavailable | Check that the avatar is uploaded, readable, and the session configuration is complete. |
--- a/docs/en/model-support/models/quicktalk.md
+++ b/docs/en/model-support/models/quicktalk.md
@@ -2,33 +2,78 @@

 ## When to Use QuickTalk

-TODO
+QuickTalk fits realtime lip-sync and low-latency validation. It is a good path when
+you want a real local model instead of Mock.

 ## Requirements

-TODO
+- NVIDIA GPU is recommended.
+- The local asset root must contain `checkpoints/`.
+- You need `quicktalk.pth`, `repair.npy`, `chinese-hubert-large/`, and InsightFace `auxiliary/models/buffalo_l/`.
+- Avatars use OpenTalking's shared avatar flow; templates or caches needed by
+  QuickTalk are generated by deployment commands, upload flows, or session startup.

 ## Prepare Weights

-TODO
+The full download commands live in [QuickTalk Local Deployment](../../avatar_models/deployment/quicktalk-local.md#weight-preparation). This page keeps only the layout and configuration essentials.
+
+```bash
+export OPENTALKING_QUICKTALK_ASSET_ROOT=/path/to/quicktalk
+```
+
+```text
+quicktalk/
+  checkpoints/
+    quicktalk.pth
+    repair.npy
+    chinese-hubert-large/
+      pytorch_model.bin
+    auxiliary/models/buffalo_l/
+      det_10g.onnx
+```

 ## Prepare Avatar

-TODO
+Use the shared flow in [Avatar Assets](../../avatar_models/avatar.md). QuickTalk does
+not require the avatar manifest to be bound to a dedicated type; if the runtime needs a
+fixed template video, make sure that asset is reachable from deployment configuration
+or session initialization.

 ## Configure Backend

-TODO
+```bash
+bash scripts/start_unified.sh --backend local --model quicktalk
+```
+
+```bash
+bash scripts/start_unified.sh \
+  --backend omnirt \
+  --model quicktalk \
+  --omnirt http://127.0.0.1:9000
+```

 ## Start Service

-TODO
+```bash
+bash scripts/start_unified.sh --backend local --model quicktalk
+```

 ## Verify

-TODO
+```bash
+uv run opentalking-quicktalk-bench \
+  --asset-root ./examples/avatars/quicktalk-daytime \
+  --template-video ./examples/avatars/quicktalk-daytime/quicktalk/template_900.mp4 \
+  --audio ./assets/test.wav \
+  --output ./outputs/quicktalk-bench.mp4 \
+  --device cuda:0
+```
+
+Or verify in the WebUI by selecting the `quicktalk` model and sending a short prompt.

 ## Troubleshooting

-TODO
-
+- `connected=false`: check the asset path, QuickTalk dependencies, and `OPENTALKING_TORCH_DEVICE`.
+- Slow first turn: enable `OPENTALKING_QUICKTALK_WORKER_CACHE=1`.
+- Avatar load failure: make sure the avatar is readable and any configured template
+  asset is reachable.
--- a/docs/en/model-support/models/wav2lip.md
+++ b/docs/en/model-support/models/wav2lip.md
@@ -2,39 +2,102 @@

 ## When to Use Wav2Lip

-TODO
+Wav2Lip is a good fit for quick lip-sync validation, image avatars, short-video avatars, and lightweight local demos. It has a relatively low deployment cost and a clear asset path, which makes it a strong first real model for OpenTalking.

 ## Requirements

-TODO
+- Python dependencies must include the Wav2Lip extra.
+- The model weights must include `wav2lip384.pth` or a compatible checkpoint.
+- `s3fd.pth` is required for face detection.
+- NVIDIA GPU is recommended; CPU is only for functional checks.
+- Avatars use OpenTalking's shared avatar flow; Wav2Lip consumes a reference image,
+  preprocessed frames, or a detectable face region when it runs.

 ## Prepare Weights

-TODO
+Default model directory:

-## Prepare Avatar
+```text
+./avatar_models/wav2lip
+```

-TODO
+Configurable paths:
+
+```bash
+export OPENTALKING_WAV2LIP_MODEL_ROOT=./avatar_models/wav2lip
+export OPENTALKING_WAV2LIP_CHECKPOINT=./avatar_models/wav2lip/wav2lip384.pth
+```
+
+`s3fd.pth` can live at:
+
+```text
+./avatar_models/wav2lip/s3fd.pth
+```
+
+Full download commands are in [Wav2Lip Local](../../avatar_models/deployment/wav2lip-local.md).
+
+## Prepare Avatar Derivatives
+
+To pre-generate image-frame assets for Wav2Lip:
+
+```bash
+uv run python scripts/prepare_wav2lip_image_asset.py \
+  --source-image ./assets/avatar.png \
+  --out ./examples/avatars/my-wav2lip \
+  --avatar-id my-wav2lip \
+  --name "My Wav2Lip Avatar"
+```
+
+To pre-generate video-frame assets for Wav2Lip:
+
+```bash
+uv run python scripts/prepare_wav2lip_video_asset.py \
+  --source-video ./assets/avatar.mp4 \
+  --out ./examples/avatars/my-wav2lip-video \
+  --avatar-id my-wav2lip-video \
+  --name "My Wav2Lip Video Avatar"
+```

 ## Configure Backend

 ### local

-TODO
+```bash
+bash scripts/start_unified.sh --backend local --model wav2lip
+```

 ### omnirt

-TODO
+```bash
+bash scripts/start_unified.sh \
+  --backend omnirt \
+  --model wav2lip \
+  --omnirt http://127.0.0.1:9000
+```

 ## Start Service

-TODO
+```bash
+bash scripts/start_unified.sh --backend local --model wav2lip
+```

 ## Verify

-TODO
+1. Open the WebUI.
+2. Select an available avatar.
+3. Select the `wav2lip` model.
+4. Send a short sentence and confirm first frame, audio, and lip output.

 ## Troubleshooting

-TODO
+### Missing checkpoint

+Check `OPENTALKING_WAV2LIP_MODEL_ROOT` and `OPENTALKING_WAV2LIP_CHECKPOINT`.
+
+### Missing `s3fd.pth`
+
+Put `s3fd.pth` under `models/wav2lip/`.
+
+### Mouth region looks unnatural
+
+Adjust `OPENTALKING_WAV2LIP_PADS`, the postprocess mode, and the avatar reference image.
--- a/docs/en/model-support/selection.md
+++ b/docs/en/model-support/selection.md
@@ -11,7 +11,9 @@ Choose `mock` when you only need to verify WebUI, API, TTS, events, and WebRTC.
 Choose `wav2lip` or `quicktalk` with `backend=local`. They are the lightest
 paths for validating a real avatar and talking-head output.

-Use [Model Deployment / Local Audio + QuickTalk](../model-deployment/recipes/local-quicktalk-audio.md) when you also need to validate local STT, local TTS, and QuickTalk together.
+Use [Local Audio + QuickTalk](../recipes/local-quicktalk-audio.md) when you also need to validate local STT, local TTS, and QuickTalk together.
+
+Choose `fasterliveportrait` with `backend=omnirt` when you want camera or uploaded-video driving for avatar expression and head motion.

 ### High-quality Model

@@ -31,7 +33,7 @@ inference.
 | Hardware | Recommended path |
 | --- | --- |
 | CPU | `mock` only, or non-realtime experiments |
-| Single NVIDIA GPU | Wav2Lip local, QuickTalk local, MuseTalk local, or one OmniRT model service |
+| Single NVIDIA GPU | Wav2Lip local, QuickTalk local, MuseTalk local, or one OmniRT model service such as FasterLivePortrait |
 | Multi-GPU | Split heavyweight model services or bind different models to different GPUs |
 | Ascend NPU | Use OmniRT for models that have an Ascend runtime |
 | Remote inference service | Use `omnirt` or `direct_ws` so OpenTalking does not own model weights |
@@ -51,6 +53,7 @@ inference.
 | Install validation | Mock | `mock` | Confirm environment and page flow |
 | First real path | Wav2Lip / QuickTalk | `local` | Validate avatar and lip sync |
 | Local audio validation | SenseVoiceSmall + CosyVoice3 + QuickTalk | `local` | Validate local STT/TTS/Video |
+| Camera video driving | FasterLivePortrait | `omnirt` | Drive a source avatar with camera frames or uploaded driving video |
 | Single-machine quality validation | MuseTalk | `local` | Evaluate MuseTalk quality with official preprocessing |
 | High-quality service demo | FlashTalk / FlashHead | `omnirt` / `direct_ws` | Validate heavyweight output |
 | Production | Multi-model stack | `omnirt` + worker | Stable, scalable, observable deployment |
--- a/docs/en/quick-start/index.md
+++ b/docs/en/quick-start/index.md
@@ -1,26 +1,12 @@
 # Quick Start

-This page helps you quickly run OpenTalking. Choose one of two paths first: use the published **Compshare image** for the fastest hosted trial, or use **self deployment** when you want to run and customize the repo on your own machine or server.
+This page helps you quickly run OpenTalking. Start with **Mock mode** to validate the orchestration layer, LLM, TTS, subtitle events, and WebRTC playback. Then use the real model **QuickTalk** to validate real digital-human video rendering.

- Compshare image: no local dependency installation or model download; use the published instance image and open port `5173`.
- Self deployment: clone the repo, configure providers, start Mock mode first, then move to local QuickTalk or remote OmniRT when needed.
+- Mock mode: no model weights, no GPU, uses built-in static frames to validate the full interaction flow.
+- QuickTalk mode: uses a local CUDA GPU and QuickTalk weights to validate the real digital-human rendering path.
 - WebUI validation: select avatar, model, and voice in the page, then start a real-time conversation.

-## 1. Compshare Image
-
-If you want to skip local dependency installation and model downloads, deploy our published Compshare community image:
-
- Image URL: <https://www.compshare.cn/images/TdDwmKZUZebI>
- Exposed port: `5173`
- Guide: [Compshare image quick experience](compshare-image.md)
-
-The image already includes OpenTalking, OmniRT, the QuickTalk runtime environment, and model files. Use it to try the real digital-human path first; continue with the source-based steps below when you need local installation or development.
-
-## 2. Self Deployment
-
-Use this path when you want to run OpenTalking from source, change configuration, or continue into local/remote model deployment.
-
-### 2.1 Mock Mode
+## Mock Mode

 Mock mode is the recommended first path for OpenTalking. It does not require GPU, model weights, or an external inference service, but still validates the API, LLM, TTS, subtitle events, WebRTC, and browser playback path.

@@ -30,7 +16,7 @@ Use it for:
 - Checking whether LLM / TTS configuration works.
 - Previewing WebUI and session flow on a machine without GPU.

-#### Mock Mode Environment
+### Mock Mode Environment

 | Dependency | Recommended Version | Notes |
 | --- | --- | --- |
@@ -39,7 +25,7 @@ Use it for:
 | FFmpeg | Available as a system command | Audio/video processing dependency. |
 | GPU | Not required | Uses the built-in Mock static frame. |

-#### 1. Clone Repository
+### 1. Clone Repository

 ```bash
 export DIGITAL_HUMAN_HOME=/opt/digital_human
@@ -50,7 +36,7 @@ git clone https://github.com/datascale-ai/opentalking.git
 cd opentalking
 ```

-#### 2. Install Basic Dependencies
+### 2. Install Basic Dependencies

 Using `uv` is recommended:

@@ -72,7 +58,7 @@ pip install --index-url https://pypi.tuna.tsinghua.edu.cn/simple -e ".[dev]"
 cp .env.example .env
 ```

-#### 3. Configure Minimal Environment Variables
+### 3. Configure Minimal Environment Variables

 Edit `.env` and configure at least LLM and TTS. The example below uses an OpenAI-compatible endpoint and `edge` TTS:

@@ -87,7 +73,7 @@ OPENTALKING_TTS_EDGE_VOICE=zh-CN-XiaoxiaoNeural

 `edge` TTS does not require an API key. If you use DashScope STT or DashScope TTS, configure `OPENTALKING_STT_DASHSCOPE_API_KEY` or `OPENTALKING_TTS_DASHSCOPE_API_KEY` for that module.

-#### 4. Start Mock Mode
+### 4. Start Mock Mode

 ```bash
 cd "$DIGITAL_HUMAN_HOME/opentalking"
@@ -105,7 +91,7 @@ To specify ports:
 bash scripts/start_unified.sh --mock --api-port 8210 --web-port 5280
 ```

-#### 5. Open WebUI
+### 5. Open WebUI

 After startup, the terminal prints the WebUI URL. The default URL is:

@@ -113,19 +99,15 @@ After startup, the terminal prints the WebUI URL. The default URL is:
 http://127.0.0.1:5173
 ```

-![Mock mode WebUI home](../../assets/images/WebUI.png)
+![Mock mode WebUI home: realtime conversation workflow, avatar library, and session panel](../../assets/images/quick-start/mock-webui-home.png)

-*After startup, WebUI shows the avatar library, model selector, voice controls, and conversation area.*
-
-#### 6. Complete Your First Conversation
+### 6. Complete Your First Conversation

 In WebUI, select Mock / driverless mode, confirm LLM and TTS configuration, enter a short test sentence, and start the session. If the browser plays audio, shows subtitles, and displays the Mock frame, the base pipeline is working.

-![First conversation example](../../assets/images/product-demo-live-sales/05_product_intro.jpeg)
+![First Mock conversation workspace: driverless mode selected and ready to start](../../assets/images/quick-start/mock-first-session.png)

-*For the first validation, check user input, subtitle events, playback state, and video output.*
-
-### 2.2 QuickTalk Mode
+## QuickTalk Mode

 QuickTalk mode is a faster path toward real digital-human output. It can load QuickTalk weights locally and is suitable for single-machine validation on consumer CUDA GPUs.

@@ -134,7 +116,7 @@ Use it when:
 - You have an available NVIDIA GPU and CUDA environment.
 - You want to see real lip motion and avatar driving.

-#### QuickTalk Mode Environment
+### QuickTalk Mode Environment

 | Dependency | Recommended Version | Notes |
 | --- | --- | --- |
@@ -144,7 +126,7 @@ Use it when:
 | GPU | NVIDIA CUDA GPU | Start with a 3090 / 4090 class machine if possible. |
 | Weights | QuickTalk, HuBERT, InsightFace `buffalo_l` | Download or sync offline according to this page. |

-#### 1. Check GPU and System Environment
+### 1. Check GPU and System Environment

 QuickTalk mode requires a local CUDA GPU. Check:

@@ -155,7 +137,7 @@ python --version
 node --version
 ```

-#### 2. Install Model Dependencies
+### 2. Install Model Dependencies

 ```bash
 cd "$DIGITAL_HUMAN_HOME/opentalking"
@@ -163,7 +145,7 @@ uv sync --extra dev --extra models --python 3.11
 source .venv/bin/activate
 ```

-#### 3. Prepare QuickTalk Weights
+### 3. Prepare QuickTalk Weights

 Place local QuickTalk weights and dependencies under repository-root `models/quicktalk/`.

@@ -171,6 +153,8 @@ Place local QuickTalk weights and dependencies under repository-root `models/qui
 cd "$DIGITAL_HUMAN_HOME/opentalking"
 mkdir -p models/quicktalk/checkpoints

+uv pip install -U "huggingface_hub[cli]"
+
 # Optional: use a Hugging Face mirror when the network is slow.
 export HF_ENDPOINT=https://hf-mirror.com

@@ -233,15 +217,16 @@ models/
        ...
 ```

-#### 4. Prepare a Custom Avatar
+### 4. Prepare a Custom Avatar

 You can start with the built-in QuickTalk example avatar. Later, if you want to upload your own identity, use a clear frontal or half-body image and create a custom avatar in WebUI through “upload from local”.

-![Select or upload an avatar in WebUI](../../assets/images/product-demo-live-sales/02_select_avatar.jpeg)
+<div class="ot-figure-placeholder">
+  <strong>Screenshot placeholder: upload QuickTalk Avatar</strong>
+  <span>Add later: WebUI avatar library, upload entry, model selection, and preview result.</span>
+</div>

-*The WebUI avatar library supports built-in avatars and custom images through the upload entry.*
-
-#### 5. Start QuickTalk Mode
+### 5. Start QuickTalk Mode

 ```bash
 export OPENTALKING_TORCH_DEVICE=cuda:0
@@ -264,10 +249,32 @@ bash scripts/start_unified.sh \

 The first startup may build face cache and worker state, so it can take longer than Mock mode.

-#### 6. Select QuickTalk in WebUI
+### 6. Select QuickTalk in WebUI

 After opening WebUI, select a `QuickTalk` avatar and the `quicktalk` model, then start a session. If the video frame is generated along with audio, the local QuickTalk rendering path is available.

-![QuickTalk session output example](../../assets/images/companion/04_webrtc_connected.jpeg)
+## Next Steps

-*After selecting a QuickTalk avatar and model, check the generation state, connection status, and playback output.*
+### Platform Notes
+
+Read [Platform Notes](platform-notes.md) for Linux, macOS, Windows, mirrors, and common system dependencies.
+
+### Docker Deployment
+
+Read [Docker Deployment](docker-deployment.md) for containerized deployment.
+
+### WebUI Usage
+
+Read [WebUI Usage](../usage/webui/basic.md) to learn the page layout, avatar selection, voice configuration, and session operations.
+
+### Video Clone
+
+Read [Video Clone](../usage/webui/video-clone.md) when you want to drive a source avatar with camera frames or uploaded video through FasterLivePortrait.
+
+### Model Support
+
+Read [Model Support](../model-support/index.md) to choose Wav2Lip, QuickTalk, FasterLivePortrait, MuseTalk, FlashTalk, OmniRT, or later inference backends.
+
+### Command Line Tools
+
+Read [Command Line Tools](../usage/cli.md) for `opentalking-unified`, startup scripts, and common arguments.
--- a/docs/en/quick-start/windows-deployment.md
+++ b/docs/en/quick-start/windows-deployment.md
@@ -17,7 +17,7 @@ Windows Host
 Recommended directory structure:

 ```text
-/root/test/
+$DIGITAL_HUMAN_HOME/
 ├── opentalking/
 │   ├── .venv/
 │   ├── apps/
@@ -28,10 +28,10 @@ Recommended directory structure:
 │   ├── .venv/
 │   └── models/quicktalk/
 └── models/
-    └── quicktalk -> /root/test/opentalking/models/quicktalk
+    └── quicktalk -> $DIGITAL_HUMAN_HOME/opentalking/models/quicktalk
 ```

-Place code in WSL2's own Linux filesystem (e.g., `/root/test` or `/home/<user>/test`), not directly under `/mnt/d/...`.
+Place code in WSL2's own Linux filesystem (e.g., `$WSL_HOME/test` or `/home/<user>/test`), not directly under `/mnt/d/...`.

 ---

@@ -99,47 +99,6 @@ If WSL2 can see the RTX 3050, the CUDA inference prerequisites are met.

 ---

-### 1.3 WSL2 Network Mode Selection
-
-WSL2 supports two network modes that directly impact OpenTalking's WebRTC real-time audio/video streaming and browser microphone access.
-
-**.wslconfig** (located at `%USERPROFILE%\.wslconfig` on Windows):
-
-```ini
-[wsl2]
-networkingMode=NAT        # default mode
-# networkingMode=Mirrored
-```
-
-After making changes, run `wsl --shutdown` and reopen the WSL2 terminal for changes to take effect.
-
-**Comparison**:
-
-| | NAT (default) | Mirrored |
-|---|---|---|
-| WebRTC ICE connectivity | ✅ Working (when accessed via WSL2 IP) | ⚠️ ICE candidates may fail |
-| Browser access | `http://<WSL2-IP>:5280` | `http://localhost:5280` |
-| Microphone permission | Requires adding insecure origin whitelist in browser | localhost works directly |
-| Service startup compatibility | ✅ Normal | ⚠️ May fail in some scenarios |
-
-**Recommendation**:
-
- Use **NAT mode** for daily development and debugging. Get the WSL2 IP with `hostname -I` and access via that address.
- If the WSL2 IP changes after a restart, run `hostname -I` again.
- For one-click install scripts or first-time setup, prefer **NAT mode**.
-
-**Enabling microphone in NAT mode**:
-
-Non-localhost HTTP origins are not treated as secure contexts by browsers, so `getUserMedia` access is blocked. Choose one of these workarounds:
-
- **Edge**: Navigate to `edge://flags/#unsafely-treat-insecure-origin-as-secure`, enter `http://<WSL2-IP>:5280`, set to Enabled, and restart.
- **Chrome**: Close all Chrome windows, then run in PowerShell:
-  ```powershell
-  & "C:\Program Files\Google\Chrome\Application\chrome.exe" --unsafely-treat-insecure-origin-as-secure="http://<WSL2-IP>:5280" --user-data-dir="%TEMP%\chrome-opentalking"
-  ```
-
---
-
 ## 2. WSL2 Base Dependencies

 The following commands run inside WSL2 Ubuntu. If running as root, `sudo` is not needed; otherwise prepend `sudo` to `apt` commands.
@@ -174,8 +133,8 @@ nvidia-smi
 Enter the working directory:

 ```bash
-mkdir -p /root/test
-cd /root/test
+mkdir -p $WSL_HOME/test
+cd $WSL_HOME/test
 ```

 Clone both repositories:
@@ -188,15 +147,15 @@ git clone https://github.com/datascale-ai/omnirt.git
 The final structure should be:

 ```text
-/root/test/opentalking
-/root/test/omnirt
+$DIGITAL_HUMAN_HOME/opentalking
+$DIGITAL_HUMAN_HOME/omnirt
 ```

 Verify:

 ```bash
-ls /root/test/opentalking
-ls /root/test/omnirt
+ls $DIGITAL_HUMAN_HOME/opentalking
+ls $DIGITAL_HUMAN_HOME/omnirt
 ```

 ### Path Notes
@@ -204,15 +163,15 @@ ls /root/test/omnirt
 If the code already exists on Windows, copy it to WSL2:

 ```bash
-rsync -a --info=progress2 /mnt/d/test_opentalking/opentalking/ /root/test/opentalking/
-rsync -a --info=progress2 /mnt/d/test_opentalking/omnirt/ /root/test/omnirt/
+rsync -a --info=progress2 /mnt/d/test_opentalking/opentalking/ $DIGITAL_HUMAN_HOME/opentalking/
+rsync -a --info=progress2 /mnt/d/test_opentalking/omnirt/ $DIGITAL_HUMAN_HOME/omnirt/
 ```

 If the code is already downloaded on a server, sync it to WSL2:

 ```bash
-rsync -avP root@<your-server-ip>:/root/lyf/temp/opentalking/ /root/test/opentalking/
-rsync -avP root@<your-server-ip>:/root/lyf/temp/omnirt/ /root/test/omnirt/
+rsync -avP <user>@<server-host>:$DIGITAL_HUMAN_HOME/opentalking/ $DIGITAL_HUMAN_HOME/opentalking/
+rsync -avP <user>@<server-host>:$DIGITAL_HUMAN_HOME/omnirt/ $DIGITAL_HUMAN_HOME/omnirt/
 ```

 ---
@@ -261,7 +220,7 @@ source ~/.bashrc
 Enter OpenTalking:

 ```bash
-cd /root/test/opentalking
+cd $DIGITAL_HUMAN_HOME/opentalking
 ```

 Create an isolated virtual environment:
@@ -280,7 +239,7 @@ which python
 Expected:

 ```text
-/root/test/opentalking/.venv/bin/python
+$DIGITAL_HUMAN_HOME/opentalking/.venv/bin/python
 ```

 Install base packages:
@@ -331,7 +290,7 @@ Installing CUDA PyTorch on Linux/WSL2 will pull `nvidia-cudnn-cu12`, `nvidia-cub
 QuickTalk weights should be placed at:

 ```text
-/root/test/opentalking/models/quicktalk/checkpoints/
+$DIGITAL_HUMAN_HOME/opentalking/models/quicktalk/checkpoints/
 ```

 The complete structure should look like:
@@ -354,7 +313,7 @@ checkpoints/
 Check key files:

 ```bash
-cd /root/test/opentalking
+cd $DIGITAL_HUMAN_HOME/opentalking

 ls -lh models/quicktalk/checkpoints/quicktalk.pth
 ls -lh models/quicktalk/checkpoints/repair.npy
@@ -369,14 +328,14 @@ ls -lh models/quicktalk/checkpoints/auxiliary/models/buffalo_l/2d106det.onnx
 When starting QuickTalk in OmniRT, point directly to the `checkpoints` directory:

 ```bash
-export OMNIRT_QUICKTALK_MODEL_ROOT=/root/test/opentalking/models/quicktalk/checkpoints
+export OMNIRT_QUICKTALK_MODEL_ROOT=$DIGITAL_HUMAN_HOME/opentalking/models/quicktalk/checkpoints
 ```

-If the benchmark script expects `/root/test/models/quicktalk`, create a symlink:
+If the benchmark script expects `$DIGITAL_HUMAN_HOME/models/quicktalk`, create a symlink:

 ```bash
-mkdir -p /root/test/models
-ln -sfn /root/test/opentalking/models/quicktalk /root/test/models/quicktalk
+mkdir -p $DIGITAL_HUMAN_HOME/models
+ln -sfn $DIGITAL_HUMAN_HOME/opentalking/models/quicktalk $DIGITAL_HUMAN_HOME/models/quicktalk
 ```

 ---
@@ -386,7 +345,7 @@ ln -sfn /root/test/opentalking/models/quicktalk /root/test/models/quicktalk
 Enter OmniRT:

 ```bash
-cd /root/test/omnirt
+cd $DIGITAL_HUMAN_HOME/omnirt
 ```

 Create an isolated virtual environment:
@@ -405,7 +364,7 @@ which python
 Expected:

 ```text
-/root/test/omnirt/.venv/bin/python
+$DIGITAL_HUMAN_HOME/omnirt/.venv/bin/python
 ```

 Install dependencies:
@@ -438,16 +397,16 @@ omnirt --help
 Expected:

 ```text
-/root/test/omnirt/.venv/bin/omnirt
+$DIGITAL_HUMAN_HOME/omnirt/.venv/bin/omnirt
 ```

 Sync models to OmniRT:

 ```bash
-mkdir -p /root/test/omnirt/models/quicktalk
+mkdir -p $DIGITAL_HUMAN_HOME/omnirt/models/quicktalk
 rsync -a --info=progress2 \
-  /root/test/opentalking/models/quicktalk/ \
-  /root/test/omnirt/models/quicktalk/
+  $DIGITAL_HUMAN_HOME/opentalking/models/quicktalk/ \
+  $DIGITAL_HUMAN_HOME/omnirt/models/quicktalk/
 ```

 ---
@@ -457,7 +416,7 @@ rsync -a --info=progress2 \
 Enter OpenTalking:

 ```bash
-cd /root/test/opentalking
+cd $DIGITAL_HUMAN_HOME/opentalking
 cp -n .env.example .env
 ```

@@ -489,7 +448,7 @@ LLM, TTS, and STT are independent providers. Edge TTS does not require a key and
 Open a new WSL2 terminal and enter OmniRT:

 ```bash
-cd /root/test/omnirt
+cd $DIGITAL_HUMAN_HOME/omnirt
 source .venv/bin/activate
 ```

@@ -500,7 +459,7 @@ export CUDA_VISIBLE_DEVICES=0
 export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb=128

 export OMNIRT_QUICKTALK_RUNTIME=1
-export OMNIRT_QUICKTALK_MODEL_ROOT=/root/test/opentalking/models/quicktalk/checkpoints
+export OMNIRT_QUICKTALK_MODEL_ROOT=$DIGITAL_HUMAN_HOME/opentalking/models/quicktalk/checkpoints
 export OMNIRT_QUICKTALK_DEVICE=cuda:0
 export OMNIRT_QUICKTALK_HUBERT_DEVICE=cuda:0
 ```
@@ -532,7 +491,7 @@ Key parameters:
 Open another WSL2 terminal and enter OpenTalking:

 ```bash
-cd /root/test/opentalking
+cd $DIGITAL_HUMAN_HOME/opentalking
 source .venv/bin/activate
 ```

@@ -578,12 +537,12 @@ nvidia-smi --query-gpu=memory.used --format=csv,noheader

 | Location | Correct approach |
 | --- | --- |
-| Code directory | Place in `/root/test/opentalking` and `/root/test/omnirt` |
-| OpenTalking venv | `/root/test/opentalking/.venv` |
-| OmniRT venv | `/root/test/omnirt/.venv` |
-| QuickTalk weights | `/root/test/opentalking/models/quicktalk/checkpoints` |
+| Code directory | Place in `$DIGITAL_HUMAN_HOME/opentalking` and `$DIGITAL_HUMAN_HOME/omnirt` |
+| OpenTalking venv | `$DIGITAL_HUMAN_HOME/opentalking/.venv` |
+| OmniRT venv | `$DIGITAL_HUMAN_HOME/omnirt/.venv` |
+| QuickTalk weights | `$DIGITAL_HUMAN_HOME/opentalking/models/quicktalk/checkpoints` |
 | OmniRT QuickTalk root | Point to `.../checkpoints` |
-| Benchmark compatibility path | `ln -sfn /root/test/opentalking/models/quicktalk /root/test/models/quicktalk` |
+| Benchmark compatibility path | `ln -sfn $DIGITAL_HUMAN_HOME/opentalking/models/quicktalk $DIGITAL_HUMAN_HOME/models/quicktalk` |
 | Low-VRAM config | `resolution=160/128`, `batch=1`, `HuBERT=cpu` |

 ---
@@ -597,7 +556,7 @@ Before running the benchmark, verify each item:
 nvidia-smi

 # OpenTalking environment
-cd /root/test/opentalking
+cd $DIGITAL_HUMAN_HOME/opentalking
 source .venv/bin/activate
 python -c "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"

@@ -607,7 +566,7 @@ ls -lh models/quicktalk/checkpoints/chinese-hubert-large/pytorch_model.bin
 ls -lh models/quicktalk/checkpoints/auxiliary/models/buffalo_l/det_10g.onnx

 # OmniRT environment
-cd /root/test/omnirt
+cd $DIGITAL_HUMAN_HOME/omnirt
 source .venv/bin/activate
 which omnirt
 omnirt --help
@@ -621,7 +580,7 @@ npm --version
 If all checks pass, run:

 ```bash
-cd /root/test/opentalking
+cd $DIGITAL_HUMAN_HOME/opentalking
 source .venv/bin/activate
 bash scripts/run_opentalking_e2e_benchmark.sh \
  --tester xxx \
--- a/docs/en/recipes/index.md
+++ b/docs/en/recipes/index.md
@@ -0,0 +1,13 @@
+# Deployment Recipes
+
+This directory collects runnable combined deployment recipes. Use it when you already know the model or runtime goal and want a ready-to-run path.
+
+## Recipe Map
+
+- [Local Audio + QuickTalk](local-quicktalk-audio.md): fully local STT, TTS, and QuickTalk path for private validation.
+
+## Related Pages
+
+- [QuickTalk](../avatar_models/quicktalk.md)
+- [Speech Generation Models](../speech_models/index.md)
+- [Model Support](../model-support/index.md)
--- a/docs/en/recipes/local-quicktalk-audio.md
+++ b/docs/en/recipes/local-quicktalk-audio.md
@@ -0,0 +1,174 @@
+# Local STT/TTS + QuickTalk
+
+This page describes a single-machine path for private validation:
+
+- STT: local `SenseVoiceSmall`, CPU by default.
+- TTS: local `Fun-CosyVoice3-0.5B-2512`, served through the `local_cosyvoice` service.
+- Video: local `QuickTalk`, CUDA by default.
+- LLM: still configured through an OpenAI-compatible endpoint. If you already operate a local LLM server, point `OPENTALKING_LLM_BASE_URL` to that service.
+
+This path keeps the existing OpenTalking `/sessions/*`, `/tts/preview`, and session runner protocols unchanged. The frontend only chooses the local or API provider before a session starts. If an API provider is selected and its key is missing, startup fails before entering the digital-human session; OpenTalking does not silently fall back to local or cloud providers.
+
+## Hardware Guidance
+
+| Component | Default placement | Guidance |
+|-----------|-------------------|----------|
+| SenseVoiceSmall | CPU | Usually enough for short utterances and saves VRAM. |
+| Fun-CosyVoice3-0.5B-2512 | `cuda:0` | 12GB VRAM is recommended; use API TTS first on 8GB machines. |
+| QuickTalk | `cuda:0` | Watch peak VRAM and first-turn warmup when sharing the GPU with local TTS. |
+
+## Install Dependencies
+
+```bash title="terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+uv sync --extra dev --extra models --extra local-audio --extra quicktalk-cuda --python 3.11
+source .venv/bin/activate
+```
+
+The OpenTalking main venv runs the API, SenseVoice, and QuickTalk and keeps the
+project `transformers>=4.57,<6` dependency. Do not install the CosyVoice runtime
+into this venv.
+
+## Download Local Audio Models
+
+Do not commit model weights. The download helper uses ModelScope for these models by default; configure a Hugging Face mirror only for HF-backed models.
+
+```bash title="terminal"
+python scripts/download_local_audio_models.py \
+  --root ./avatar_models/local-audio \
+  --model sensevoice-small \
+  --model fun-cosyvoice3-0.5b-2512
+```
+
+Expected layout:
+
+```text
+models/local-audio/
+  iic__SenseVoiceSmall/
+  FunAudioLLM__Fun-CosyVoice3-0.5B-2512/
+```
+
+## Prepare QuickTalk Weights
+
+Place QuickTalk weights, HuBERT files, and InsightFace dependencies as described in [QuickTalk Local Deployment](../avatar_models/deployment/quicktalk-local.md):
+
+```text
+models/quicktalk/checkpoints/
+```
+
+The key setting is `OPENTALKING_QUICKTALK_ASSET_ROOT`, which must point to the directory containing `checkpoints/`.
+
+## Prepare the CosyVoice Runtime
+
+The recommended `local_cosyvoice` shape is a standalone Python service. Runtime source should stay outside git-tracked files; placing it under the model directory is fine:
+
+```bash title="terminal"
+mkdir -p ./avatar_models/local-audio/runtime
+git clone https://github.com/FunAudioLLM/CosyVoice.git ./avatar_models/local-audio/runtime/CosyVoice
+cd ./avatar_models/local-audio/runtime/CosyVoice
+git submodule update --init --recursive
+```
+
+Create the dedicated CosyVoice sidecar venv after the runtime checkout is ready:
+
+```bash title="terminal"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
+  bash scripts/prepare_cosyvoice_venv.sh
+```
+
+The sidecar venv is only for `scripts/local_cosyvoice_service.py` and the
+CosyVoice runtime. It pins `transformers==4.51.3` and stays separate from the
+OpenTalking main `.venv`.
+
+## `.env` Example
+
+```env title=".env"
+# LLM: separate module. Use DashScope, OpenAI, vLLM, Ollama, or a local OpenAI-compatible service.
+OPENTALKING_LLM_PROVIDER=openai_compatible
+OPENTALKING_LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
+OPENTALKING_LLM_API_KEY=<llm-key>
+OPENTALKING_LLM_MODEL=qwen-flash
+
+# STT: local SenseVoiceSmall
+OPENTALKING_STT_DEFAULT_PROVIDER=sensevoice
+OPENTALKING_STT_ENABLED_PROVIDERS=sensevoice,dashscope
+OPENTALKING_STT_SENSEVOICE_MODEL=iic/SenseVoiceSmall
+OPENTALKING_STT_SENSEVOICE_MODEL_DIR=./avatar_models/local-audio/iic__SenseVoiceSmall
+OPENTALKING_STT_SENSEVOICE_DEVICE=cpu
+
+# TTS: local CosyVoice3
+OPENTALKING_TTS_DEFAULT_PROVIDER=local_cosyvoice
+OPENTALKING_TTS_ENABLED_PROVIDERS=local_cosyvoice,dashscope,edge
+OPENTALKING_TTS_LOCAL_COSYVOICE_MODEL=FunAudioLLM/Fun-CosyVoice3-0.5B-2512
+OPENTALKING_TTS_LOCAL_COSYVOICE_MODEL_DIR=./avatar_models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512
+OPENTALKING_TTS_LOCAL_COSYVOICE_RUNTIME_DIR=./avatar_models/local-audio/runtime/CosyVoice
+OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL=http://127.0.0.1:19090/synthesize
+OPENTALKING_TTS_LOCAL_COSYVOICE_DEVICE=cuda:0
+OPENTALKING_COSYVOICE_VENV_DIR=./.venv-cosyvoice
+
+# Video: QuickTalk local
+OPENTALKING_DEFAULT_MODEL=quicktalk
+OPENTALKING_QUICKTALK_BACKEND=local
+OPENTALKING_QUICKTALK_ASSET_ROOT=./avatar_models/quicktalk
+OPENTALKING_QUICKTALK_WORKER_CACHE=1
+OPENTALKING_TORCH_DEVICE=cuda:0
+```
+
+If users can switch the frontend to API STT/TTS, configure the provider-specific keys explicitly:
+
+```env title=".env"
+OPENTALKING_STT_DASHSCOPE_API_KEY=<dashscope-stt-key>
+OPENTALKING_TTS_DASHSCOPE_API_KEY=<dashscope-tts-key>
+```
+
+## Startup Order
+
+Start the local TTS service first:
+
+```bash title="terminal"
+bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
+```
+
+Start OpenTalking:
+
+```bash title="terminal"
+bash scripts/start_unified.sh --backend local --model quicktalk
+```
+
+To set ports explicitly:
+
+```bash title="terminal"
+bash scripts/start_unified.sh --backend local --model quicktalk --api-port 8210 --web-port 5280
+```
+
+## Verification
+
+```bash title="terminal"
+curl -fsS http://127.0.0.1:19090/health
+curl -fsS http://127.0.0.1:8000/health
+curl -fsS http://127.0.0.1:8000/api/runtime/status
+curl -s http://127.0.0.1:8000/models | jq '.statuses[] | select(.id=="quicktalk")'
+```
+
+Expected state:
+
+- `stt_provider` is `sensevoice`.
+- `tts_provider` is `local_cosyvoice`.
+- `quicktalk_backend` is `local`.
+- `/models` reports `quicktalk.connected=true` and `reason=local_runtime`.
+
+In the frontend, select `Local SenseVoiceSmall`, `Local CosyVoice3-0.5B-2512`, a
+shared avatar, and the `quicktalk` model, then test:
+
+1. Text input: `LLM -> local_cosyvoice -> QuickTalk -> WebRTC`.
+2. Microphone input: `SenseVoiceSmall -> LLM -> local_cosyvoice -> QuickTalk -> WebRTC`.
+3. `/tts/preview`: confirm that local system voices and cloned voices play.
+
+## Notes
+
+- `*_DEFAULT_PROVIDER` only controls the default selection. It is not a failure fallback chain.
+- LLM, STT, and TTS keys are independent. `DASHSCOPE_API_KEY` does not automatically configure any module.
+- The CosyVoice3 service returns audio as a stream, but first-chunk latency still depends on model inference and warmup.
+- On 8GB VRAM machines, keep `SenseVoiceSmall CPU + QuickTalk local` and use DashScope or Edge TTS first if local TTS is slow or OOMs.
+- Weights, runtime checkouts, avatar caches, and logs are deployment artifacts and should not be committed.
--- a/docs/en/reference/benchmark.md
+++ b/docs/en/reference/benchmark.md
@@ -22,7 +22,7 @@ This script reads input assets according to the benchmark configuration, starts
 Enter OpenTalking:

 ```bash
-cd /root/test/opentalking
+cd $DIGITAL_HUMAN_HOME/opentalking
 source .venv/bin/activate
 ```

@@ -75,7 +75,7 @@ bash scripts/run_opentalking_e2e_benchmark.sh \
 Find results:

 ```bash
-find /root/test/opentalking -name "result.json" -o -name "result.csv" -o -name "report.md" -o -name "*.tar.gz"
+find $DIGITAL_HUMAN_HOME/opentalking -name "result.json" -o -name "result.csv" -o -name "report.md" -o -name "*.tar.gz"
 ```

 ### Notes
--- a/docs/en/speech_models/index.md
+++ b/docs/en/speech_models/index.md
@@ -0,0 +1,8 @@
+# Speech Models
+
+This directory collects speech-related model deployment, weight download, and verification for OpenTalking. Speech models are split into two groups:
+
+- [Speech Recognition Models](stt.md): convert microphone or uploaded audio into text; locally deployable models include [SenseVoice](stt/sensevoice.md).
+- [Speech Generation Models](tts.md): convert LLM text output into audio; locally deployable models include [CosyVoice](tts/cosyvoice.md), [IndexTTS](tts/indextts.md), and [Qwen3-TTS](tts/qwen3-tts.md).
+
+The LLM decides what to say and is not classified as a speech model; this section covers input recognition and output synthesis.
--- a/docs/en/speech_models/llm-stt.md
+++ b/docs/en/speech_models/llm-stt.md
@@ -0,0 +1,77 @@
+# LLM and STT
+
+The LLM decides what the digital human says. STT is required only when users speak
+through the microphone; text-only `speak` requests do not need STT.
+
+## LLM
+
+OpenTalking uses an OpenAI-compatible chat-completions interface. DashScope is the
+default because it works with the default Chinese demo settings.
+
+```env title=".env"
+OPENTALKING_LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
+OPENTALKING_LLM_API_KEY=<dashscope-api-key>
+OPENTALKING_LLM_MODEL=qwen-flash
+```
+
+Common alternatives:
+
+| Provider | Configuration notes |
+|----------|---------------------|
+| OpenAI | Set `OPENTALKING_LLM_BASE_URL=https://api.openai.com/v1` and use an OpenAI model id. |
+| vLLM | Point `OPENTALKING_LLM_BASE_URL` to the vLLM OpenAI-compatible server. |
+| Ollama | Use the Ollama OpenAI-compatible endpoint, usually `http://localhost:11434/v1`. |
+| DeepSeek | Use the provider's OpenAI-compatible base URL and model id. |
+
+Verify the API key and endpoint by starting OpenTalking and sending a text `speak`
+request after creating a `mock` session.
+
+## STT
+
+Select the STT provider with `OPENTALKING_STT_DEFAULT_PROVIDER`. The frontend can also select local STT or API STT before a session starts. When API STT is selected, the provider-specific key must be configured; it is not populated from the LLM key.
+
+### DashScope Paraformer realtime
+
+```env title=".env"
+OPENTALKING_STT_DEFAULT_PROVIDER=dashscope
+OPENTALKING_STT_DASHSCOPE_API_KEY=<dashscope-api-key>
+OPENTALKING_STT_DASHSCOPE_MODEL=paraformer-realtime-v2
+```
+
+For DashScope-based deployments, LLM and STT may use the same actual key, but it
+must be written explicitly to `OPENTALKING_LLM_API_KEY` and
+`OPENTALKING_STT_DASHSCOPE_API_KEY`. If microphone input fails but text `speak` works, verify
+the STT module key first.
+
+### Local SenseVoiceSmall
+
+```env title=".env"
+OPENTALKING_STT_DEFAULT_PROVIDER=sensevoice
+OPENTALKING_STT_ENABLED_PROVIDERS=sensevoice,dashscope
+OPENTALKING_STT_SENSEVOICE_MODEL=iic/SenseVoiceSmall
+OPENTALKING_STT_SENSEVOICE_MODEL_DIR=./avatar_models/local-audio/iic__SenseVoiceSmall
+OPENTALKING_STT_SENSEVOICE_DEVICE=cpu
+```
+
+SenseVoiceSmall uses the local FunASR adapter and supports both uploaded audio and WebSocket PCM microphone input. CPU inference is usually enough for short realtime utterances, which makes it a good match for QuickTalk local.
+
+Download the weights:
+
+```bash title="terminal"
+uv sync --extra dev --extra models --extra local-audio --python 3.11
+python scripts/download_local_audio_models.py \
+  --root ./avatar_models/local-audio \
+  --model sensevoice-small
+```
+
+## Verification
+
+```bash title="terminal"
+curl -fsS http://127.0.0.1:8000/health
+curl -s -X POST http://127.0.0.1:8000/sessions \
+  -H 'content-type: application/json' \
+  -d '{"avatar_id":"demo-avatar","model":"mock"}'
+```
+
+Then use the frontend microphone flow to confirm STT events and LLM responses appear
+in the session event stream.
--- a/docs/en/speech_models/stt.md
+++ b/docs/en/speech_models/stt.md
@@ -0,0 +1,52 @@
+# Speech Recognition Models
+
+Speech recognition models convert microphone or uploaded audio into text. Text-only `speak` requests do not require STT; configure this section only when users speak through voice input.
+
+## Provider Options
+
+| Provider / model | Best for | Required configuration |
+| --- | --- | --- |
+| DashScope Paraformer realtime | Hosted realtime Chinese speech recognition and the default microphone path | `OPENTALKING_STT_DASHSCOPE_API_KEY` |
+| [SenseVoiceSmall](stt/sensevoice.md) | Local short-utterance recognition for private deployments and QuickTalk local setups | SenseVoiceSmall weights and FunASR dependencies |
+
+## DashScope Paraformer Realtime
+
+```env title=".env"
+OPENTALKING_STT_DEFAULT_PROVIDER=dashscope
+OPENTALKING_STT_DASHSCOPE_API_KEY=<dashscope-api-key>
+OPENTALKING_STT_DASHSCOPE_MODEL=paraformer-realtime-v2
+```
+
+For DashScope deployments, LLM and STT may use the same actual key, but it must be written separately to `OPENTALKING_LLM_API_KEY` and `OPENTALKING_STT_DASHSCOPE_API_KEY`.
+
+## Local SenseVoiceSmall
+
+```env title=".env"
+OPENTALKING_STT_DEFAULT_PROVIDER=sensevoice
+OPENTALKING_STT_ENABLED_PROVIDERS=sensevoice,dashscope
+OPENTALKING_STT_SENSEVOICE_MODEL=iic/SenseVoiceSmall
+OPENTALKING_STT_SENSEVOICE_MODEL_DIR=./avatar_models/local-audio/iic__SenseVoiceSmall
+OPENTALKING_STT_SENSEVOICE_DEVICE=cpu
+```
+
+Download the weights:
+
+```bash title="Terminal"
+uv sync --extra dev --extra models --extra local-audio --python 3.11
+python scripts/download_local_audio_models.py \
+  --root ./avatar_models/local-audio \
+  --model sensevoice-small
+```
+
+SenseVoiceSmall uses the local FunASR adapter and supports both uploaded audio and WebSocket PCM microphone input. CPU inference is usually enough for short realtime utterances.
+
+## Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:8000/health
+curl -s -X POST http://127.0.0.1:8000/sessions \
+  -H 'content-type: application/json' \
+  -d '{"avatar_id":"demo-avatar","model":"mock"}'
+```
+
+Then use the frontend microphone flow to confirm STT events and LLM responses appear in the session event stream.
--- a/docs/en/speech_models/stt/sensevoice.md
+++ b/docs/en/speech_models/stt/sensevoice.md
@@ -0,0 +1,54 @@
+# SenseVoice Local Deployment
+
+SenseVoiceSmall is the recommended local speech-recognition model for OpenTalking. Use it for private deployments, short realtime utterances, and the local audio + QuickTalk path.
+
+## Use Cases
+
+- Microphone audio should not be sent to an external STT service.
+- Short realtime recognition should run on CPU.
+- STT, TTS, and QuickTalk local need to run on one machine for validation.
+
+## Weight Preparation
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+uv sync --extra dev --extra models --extra local-audio --python 3.11
+
+python scripts/download_local_audio_models.py \
+  --root ./avatar_models/local-audio \
+  --model sensevoice-small
+```
+
+## Configuration
+
+```env title=".env"
+OPENTALKING_STT_DEFAULT_PROVIDER=sensevoice
+OPENTALKING_STT_ENABLED_PROVIDERS=sensevoice,dashscope
+OPENTALKING_STT_SENSEVOICE_MODEL=iic/SenseVoiceSmall
+OPENTALKING_STT_SENSEVOICE_MODEL_DIR=./avatar_models/local-audio/iic__SenseVoiceSmall
+OPENTALKING_STT_SENSEVOICE_DEVICE=cpu
+```
+
+## Start Command
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+bash scripts/start_unified.sh --backend mock --model mock --api-port 8000 --web-port 5173
+```
+
+## Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:8000/health
+curl -s http://127.0.0.1:8000/api/runtime/status | jq
+```
+
+Then select microphone input in the WebUI and confirm that STT results and LLM responses appear in the event stream.
+
+## Common Errors
+
+| Symptom | Action |
+|---------|--------|
+| Model directory not found | Check that `OPENTALKING_STT_SENSEVOICE_MODEL_DIR` points to the downloaded directory. |
+| Recognition latency is high | Validate short utterances on CPU first; use a dedicated STT service for long audio or high concurrency. |
+| API STT key errors | Local SenseVoice does not read the DashScope key; confirm that the frontend selected local STT. |
--- a/docs/en/speech_models/tts.md
+++ b/docs/en/speech_models/tts.md
@@ -0,0 +1,27 @@
+# Speech Generation Models
+
+Speech generation models are usually integrated as TTS providers. They convert LLM
+output into audio that drives the talking-head backend. This page is only for model
+selection and navigation; weight preparation, startup, verification, and troubleshooting
+live in the model pages.
+
+## Provider Options
+
+| Provider | Type | Best for | Entry |
+|----------|------|----------|-------|
+| `edge` | Hosted / online | First run, CPU evaluation, no API key | `.env` provider config |
+| `dashscope` | Hosted API | Chinese realtime TTS, voice cloning, DashScope deployments | `.env` provider config |
+| `cosyvoice` | Self-hosted service | Existing CosyVoice WebSocket / HTTP service | Service-specific docs |
+| `elevenlabs` | Hosted API | Hosted multilingual voices | `.env` provider config |
+| `local_cosyvoice` | Local deployment | Local Chinese TTS, built-in voices, and cloned voices | [CosyVoice](tts/cosyvoice.md) |
+| `indextts` | Local deployment / OmniRT | Controllable dubbing, emotion control, and voice cloning | [IndexTTS](tts/indextts.md) |
+| `local_qwen3_tts` | Local deployment | Local Qwen3-TTS Base voice cloning | [Qwen3-TTS](tts/qwen3-tts.md) |
+
+## Local Model Entries
+
+- [CosyVoice Local Deployment](tts/cosyvoice.md)
+- [IndexTTS Local Deployment](tts/indextts.md)
+- [Qwen3-TTS Local Deployment](tts/qwen3-tts.md)
+
+Each local model page contains use cases, weight preparation, startup commands,
+verification commands, and common errors.
--- a/docs/en/speech_models/tts/cosyvoice.md
+++ b/docs/en/speech_models/tts/cosyvoice.md
@@ -0,0 +1,199 @@
+# CosyVoice Deployment
+
+CosyVoice can be integrated through two OpenTalking providers:
+
+- `local_cosyvoice`: OpenTalking manages a local CosyVoice sidecar, useful for
+  single-machine or private deployments.
+- `cosyvoice`: connect to an existing CosyVoice WebSocket / HTTP service, useful when
+  your team already operates a TTS service.
+
+For local deployment, run CosyVoice as a standalone sidecar service and let OpenTalking
+consume PCM audio over HTTP.
+
+## Use Cases
+
+- Local Chinese TTS with built-in voices or cloned voices.
+- TTS inference should be isolated from the main OpenTalking process.
+- SenseVoice, CosyVoice, and QuickTalk local should form a full local audio pipeline.
+
+## Weight Preparation
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+uv sync --extra dev --extra models --extra local-audio --python 3.11
+
+python scripts/download_local_audio_models.py \
+  --root ./avatar_models/local-audio \
+  --model fun-cosyvoice3-0.5b-2512
+```
+
+To enable TensorRT / FP16, download the extra ONNX assets from Hugging Face and place
+them in the same CosyVoice3 model directory:
+
+```bash title="Terminal"
+env HF_ENDPOINT=https://huggingface.co \
+  python - <<'PY'
+from huggingface_hub import hf_hub_download
+repo = "yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX"
+target = "./avatar_models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512"
+for name in [
+    "flow.decoder.estimator.autocast_fp16.onnx",
+    "flow.decoder.estimator.streaming.autocast_fp16.onnx",
+]:
+    hf_hub_download(repo_id=repo, filename=name, repo_type="model", local_dir=target)
+PY
+```
+
+These assets are used as follows:
+
+| Asset | Source | Required for |
+|-------|--------|--------------|
+| `flow.decoder.estimator.autocast_fp16.onnx` | Hugging Face `yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX` | `FP16 + LOAD_TRT=1`; first startup builds the GPU-specific `flow.decoder.estimator.autocast_fp16.mygpu.plan`. |
+| `flow.decoder.estimator.streaming.autocast_fp16.onnx` | Hugging Face `yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX` | Optional streaming fp16 ONNX asset; keep it beside the estimator ONNX for runtime compatibility. |
+
+Generated `*.mygpu.plan` files are machine-specific TensorRT engines. Do not copy them
+between different GPU / CUDA / TensorRT environments; rebuild them from ONNX on the
+target host.
+
+Prepare the CosyVoice runtime:
+
+```bash title="Terminal"
+mkdir -p ./avatar_models/local-audio/runtime
+git clone https://github.com/FunAudioLLM/CosyVoice.git ./avatar_models/local-audio/runtime/CosyVoice
+cd ./avatar_models/local-audio/runtime/CosyVoice
+git submodule update --init --recursive
+```
+
+Create the sidecar venv:
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
+  bash scripts/prepare_cosyvoice_venv.sh
+```
+
+If TensorRT is required, install TRT dependencies into the CosyVoice sidecar venv, not
+into the main OpenTalking `.venv`:
+
+```bash title="Terminal"
+PIP_EXTRA_INDEX_URL=https://pypi.nvidia.com/ \
+OPENTALKING_COSYVOICE_INSTALL_TENSORRT=1 \
+OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
+  bash scripts/prepare_cosyvoice_venv.sh
+```
+
+## Configuration
+
+Local sidecar:
+
+```env title=".env"
+OPENTALKING_TTS_DEFAULT_PROVIDER=local_cosyvoice
+OPENTALKING_TTS_ENABLED_PROVIDERS=local_cosyvoice,dashscope,edge
+OPENTALKING_TTS_LOCAL_COSYVOICE_MODEL=FunAudioLLM/Fun-CosyVoice3-0.5B-2512
+OPENTALKING_TTS_LOCAL_COSYVOICE_MODEL_DIR=./avatar_models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512
+OPENTALKING_TTS_LOCAL_COSYVOICE_RUNTIME_DIR=./avatar_models/local-audio/runtime/CosyVoice
+OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL=http://127.0.0.1:19090/synthesize
+OPENTALKING_TTS_LOCAL_COSYVOICE_DEVICE=cuda:0
+OPENTALKING_TTS_LOCAL_COSYVOICE_FP16=auto
+OPENTALKING_TTS_LOCAL_COSYVOICE_LOAD_TRT=0
+```
+
+The main OpenTalking `.venv` is for orchestration, SenseVoice, and the video backend. Keep CosyVoice in a separate sidecar venv so its runtime dependencies do not conflict with the main environment.
+
+Existing CosyVoice service:
+
+```env title=".env"
+OPENTALKING_TTS_DEFAULT_PROVIDER=cosyvoice
+OPENTALKING_TTS_ENABLED_PROVIDERS=cosyvoice,dashscope,edge
+OPENTALKING_TTS_COSYVOICE_URL=http://127.0.0.1:19090/synthesize
+```
+
+## Start Command
+
+Default FP16 mode, with CUDA enabling FP16 automatically and TRT disabled:
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
+```
+
+FP16 + TensorRT mode:
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+export OPENTALKING_TTS_LOCAL_COSYVOICE_FP16=auto
+export OPENTALKING_TTS_LOCAL_COSYVOICE_LOAD_TRT=1
+bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
+```
+
+On first startup with `LOAD_TRT=1`, if
+`flow.decoder.estimator.autocast_fp16.onnx` exists in the model directory, the
+CosyVoice runtime builds a GPU-specific TensorRT plan. Startup can take longer than
+the default mode. `start_local_cosyvoice.sh` automatically adds the sidecar venv's
+`site-packages/tensorrt_libs` directory to `LD_LIBRARY_PATH`.
+
+Start OpenTalking from another terminal:
+
+```bash title="Terminal"
+bash scripts/start_unified.sh --backend mock --model mock --api-port 8000 --web-port 5173
+```
+
+## Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:19090/health
+curl -fsS http://127.0.0.1:8000/health
+```
+
+Check that the sidecar enabled FP16 / TRT as expected:
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:19090/health | python3 -m json.tool
+```
+
+The health payload should show `fp16=true`; when TRT is enabled, it should show
+`load_trt=true`.
+
+After creating a `mock` session, call `/speak` to confirm that OpenTalking receives
+CosyVoice audio:
+
+```bash title="Terminal"
+SID=<session-id>
+curl -s -X POST "http://127.0.0.1:8000/sessions/$SID/speak" \
+  -H 'content-type: application/json' \
+  -d '{"text":"Hello, this is a local CosyVoice speech test."}'
+```
+
+## Benchmark Baseline
+
+The benchmark called sidecar `/synthesize` directly and measured TTFB as first
+PCM-byte arrival. The RTX 3090 baseline used a CosyVoice3 standalone sidecar venv
+with `FP16 + LOAD_TRT=1` and the autocast fp16 TensorRT plan loaded. The RTX 4090
+results were measured on the same OpenTalking sidecar path with `TOKEN_HOP_LEN=8`,
+`TOKEN_MAX_HOP_LEN=16`, and `STREAM_SCALE_FACTOR=1`.
+
+| Device | Mode | Text length | TTFB | Wall time | Audio duration | RTF |
+|---|---|---:|---:|---:|---:|---:|
+| RTX 3090 | FP16 + TRT autocast | 43 chars | 0.683 s | 6.215 s | 7.200 s | 0.863 |
+| RTX 3090 | FP16 + TRT autocast | 42 chars | 0.642 s | 5.858 s | 6.960 s | 0.842 |
+| RTX 3090 | FP16 + TRT autocast | 29 chars | 0.639 s | 5.771 s | 6.520 s | 0.885 |
+| RTX 3090 | **Average** | **-** | **0.655 s** | **5.948 s** | **6.893 s** | **0.863** |
+| RTX 4090 | FP16 CUDA | 39 chars | 1.316 s | 11.662 s | 6.800 s | 1.715 |
+| RTX 4090 | FP16 CUDA | 38 chars | 0.895 s | 11.199 s | 7.120 s | 1.573 |
+| RTX 4090 | FP16 CUDA | 21 chars | 1.110 s | 9.493 s | 5.640 s | 1.683 |
+| RTX 4090 | **FP16 CUDA average** | **-** | **1.107 s** | **10.785 s** | **6.520 s** | **1.657** |
+| RTX 4090 | FP16 + TRT autocast | 39 chars | 0.772 s | 7.507 s | 6.800 s | 1.104 |
+| RTX 4090 | FP16 + TRT autocast | 38 chars | 0.560 s | 5.613 s | 7.120 s | 0.788 |
+| RTX 4090 | FP16 + TRT autocast | 21 chars | 0.507 s | 4.435 s | 5.640 s | 0.786 |
+| RTX 4090 | **FP16 + TRT autocast average** | **-** | **0.613 s** | **5.852 s** | **6.520 s** | **0.893** |
+
+This baseline covers only the TTS sidecar. It does not include STT, LLM, QuickTalk,
+WebRTC, or browser playback latency.
+
+## Common Errors
+
+| Symptom | Action |
+|---------|--------|
+| `transformers` version conflict | Keep CosyVoice in a separate sidecar venv; do not install it into the main OpenTalking `.venv`. |
+| High first-chunk latency | First chunk depends on model inference and voice loading; prewarm in production. |
+| OpenTalking cannot reach the service | Check `OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL` and the sidecar port. |
--- a/docs/en/speech_models/tts/indextts.md
+++ b/docs/en/speech_models/tts/indextts.md
@@ -0,0 +1,93 @@
+# IndexTTS Local Deployment
+
+IndexTTS is integrated through OpenTalking's `indextts` provider. Use it for controllable dubbing, emotion control, and cloned voices. This page covers the same-machine HTTP sidecar shape.
+
+## Use Cases
+
+- More voice control than the default Edge TTS path.
+- IndexTTS runtime should be isolated from the main OpenTalking process.
+- TTS must run locally instead of through a hosted API.
+
+## Weight Preparation
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+mkdir -p ./avatar_models/local-audio
+
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+uv sync --extra dev --extra models --extra local-audio --python 3.11
+
+python scripts/download_local_audio_models.py \
+  --root ./avatar_models/local-audio \
+  --model indextts2 \
+  --model indextts2-w2v-bert \
+  --model indextts2-maskgct \
+  --model indextts2-campplus \
+  --model indextts2-bigvgan
+```
+
+Prepare the runtime:
+
+```bash title="Terminal"
+mkdir -p ./avatar_models/local-audio/runtime
+GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/index-tts/index-tts.git ./avatar_models/local-audio/runtime/index-tts
+cd ./avatar_models/local-audio/runtime/index-tts
+uv sync --python 3.11
+uv pip install fastapi "uvicorn[standard]" soundfile
+```
+
+## Configuration
+
+```env title=".env"
+OPENTALKING_TTS_DEFAULT_PROVIDER=indextts
+OPENTALKING_TTS_INDEXTTS_BACKEND=local
+OPENTALKING_TTS_INDEXTTS_SERVICE_URL=http://127.0.0.1:19190/synthesize
+```
+
+When OmniRT hosts the IndexTTS runtime, OpenTalking still exposes
+`provider=indextts`; `backend=omnirt` switches it to the remote resident service.
+OmniRT owns model loading, segmented streaming, and token-window streaming:
+
+```env title=".env"
+OPENTALKING_TTS_DEFAULT_PROVIDER=indextts
+OPENTALKING_TTS_INDEXTTS_BACKEND=omnirt
+OPENTALKING_TTS_OMNIRT_INDEXTTS_SERVICE_URL=http://127.0.0.1:9012/v1/text2audio/indextts
+OPENTALKING_TTS_OMNIRT_INDEXTTS_STREAMING_MODE=token_window
+OPENTALKING_TTS_OMNIRT_INDEXTTS_TOKEN_WINDOW_SIZE=40
+```
+
+## Start Command
+
+Start the IndexTTS sidecar first, then start OpenTalking. The exact sidecar command depends on the IndexTTS runtime version; make sure it exposes an HTTP endpoint matching `OPENTALKING_TTS_INDEXTTS_SERVICE_URL`.
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+cd ./models/local-audio/runtime/index-tts
+cd "$OPENTALKING_HOME"
+./models/local-audio/runtime/index-tts/.venv/bin/python scripts/local_indextts_service.py --host 127.0.0.1 --port 19092
+```
+
+Then start OpenTalking:
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+bash scripts/start_unified.sh --backend local --model quicktalk --api-port 8000 --web-port 5173
+```
+
+## Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:8000/health
+curl -fsS --max-time 300 http://127.0.0.1:8000/runtime/status | jq '.tts_providers.indextts.backend, .tts_providers.indextts.resolved_provider'
+```
+
+After creating a `mock` session, call `/speak` to verify that the TTS provider returns audio.
+
+## Common Errors
+
+| Symptom | Action |
+|---------|--------|
+| Sidecar API path mismatch | Check that the IndexTTS runtime path matches `SERVICE_URL`. |
+| Missing downloaded files | Re-run the download script and confirm all five `indextts2*` model directories exist. |
+| Dependency conflicts | Keep the IndexTTS runtime in its own venv. |
+| Slow first startup | The downloader resumes partial downloads; confirm the model directories are complete, then restart the sidecar. |
--- a/docs/en/speech_models/tts/qwen3-tts.md
+++ b/docs/en/speech_models/tts/qwen3-tts.md
@@ -0,0 +1,80 @@
+# Qwen3-TTS Local Deployment
+
+Qwen3-TTS is integrated through OpenTalking's `local_qwen3_tts` provider. It runs as a local HTTP sidecar and is useful for private deployments that need Qwen3-TTS Base voice-cloning behavior.
+
+## Use Cases
+
+- Local Qwen3-TTS Base generation or voice cloning is required.
+- The TTS runtime should be isolated from the main OpenTalking process.
+- Reference audio and reference text are available for the Base model's voice-clone input.
+
+## Weight Preparation
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+mkdir -p ./avatar_models/local-audio
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download Qwen/Qwen3-TTS-12Hz-0.6B-Base \
+  --local-dir ./avatar_models/local-audio/Qwen__Qwen3-TTS-12Hz-0.6B-Base
+```
+
+## Sidecar Environment
+
+Use a separate venv for Qwen3-TTS to avoid dependency conflicts with the main OpenTalking environment:
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+uv venv .venv-qwen3-tts --python 3.11
+source .venv-qwen3-tts/bin/activate
+uv pip install -e ".[local-qwen3-tts-service]"
+```
+
+## Configuration
+
+```env title=".env"
+OPENTALKING_TTS_DEFAULT_PROVIDER=local_qwen3_tts
+OPENTALKING_LOCAL_QWEN3_TTS_SERVICE_URL=http://127.0.0.1:19091/synthesize
+OPENTALKING_LOCAL_QWEN3_TTS_MODEL_DIR=./avatar_models/local-audio/Qwen__Qwen3-TTS-12Hz-0.6B-Base
+OPENTALKING_LOCAL_QWEN3_TTS_DEVICE=cuda:0
+OPENTALKING_LOCAL_QWEN3_TTS_DTYPE=bfloat16
+OPENTALKING_LOCAL_QWEN3_TTS_REF_AUDIO=/path/to/reference.wav
+OPENTALKING_LOCAL_QWEN3_TTS_REF_TEXT=Transcript matching the reference audio
+```
+
+## Start Command
+
+Start the Qwen3-TTS sidecar first:
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+source .venv-qwen3-tts/bin/activate
+python scripts/local_qwen3_tts_service.py --host 127.0.0.1 --port 19091
+```
+
+Start OpenTalking from another terminal:
+
+```bash title="Terminal"
+cd "$OPENTALKING_HOME"
+bash scripts/start_unified.sh --backend mock --model mock --api-port 8000 --web-port 5173
+```
+
+## Verification
+
+```bash title="Terminal"
+curl -fsS http://127.0.0.1:19091/health
+curl -s -X POST http://127.0.0.1:19091/synthesize \
+  -H 'content-type: application/json' \
+  -d '{"text":"Hello, this is a local Qwen3-TTS test."}' \
+  --output /tmp/qwen3-tts-test.wav
+```
+
+## Common Errors
+
+| Symptom | Action |
+|---------|--------|
+| `reference audio and reference text` error | The Base model requires reference audio and text; configure `REF_AUDIO` and `REF_TEXT`. |
+| Model directory not found | Check that `OPENTALKING_LOCAL_QWEN3_TTS_MODEL_DIR` points to the downloaded directory. |
+| Dependency conflicts | Use the separate `.venv-qwen3-tts`; do not install sidecar dependencies into the main `.venv`. |
--- a/docs/en/tutorials/cases/custom-avatar.md
+++ b/docs/en/tutorials/cases/custom-avatar.md
@@ -2,9 +2,9 @@

 ## Goal

-Prepare a custom avatar that OpenTalking can discover and use in a browser session. Each
-talking-head model has its own asset expectations; this case uses Wav2Lip-style assets as
-the smallest runnable path.
+Prepare a custom avatar that OpenTalking can discover and use in a browser session.
+The avatar itself is a shared visual asset; this case uses the Wav2Lip preparation
+scripts to generate the smallest runnable reference-frame derivatives.

 ## Prerequisites

@@ -14,7 +14,7 @@ the smallest runnable path.

 ## Steps

-Create a Wav2Lip avatar from an image:
+Create reference-frame derivatives from an image:

 ```bash title="Terminal"
 python scripts/prepare_wav2lip_image_asset.py \
@@ -25,7 +25,7 @@ python scripts/prepare_wav2lip_image_asset.py \
  --fps 25
 ```

-Create a Wav2Lip avatar from a video:
+Create reference-frame derivatives from a video:

 ```bash title="Terminal"
 python scripts/prepare_wav2lip_video_asset.py \
--- a/docs/en/tutorials/cases/flashtalk.md
+++ b/docs/en/tutorials/cases/flashtalk.md
@@ -10,7 +10,7 @@ to OmniRT or the external model service.

 - [Mock E2E](mock-e2e.md) has passed.
 - SoulX-FlashTalk-14B and wav2vec2 weights are prepared as described in
-  [Talking-head Models → FlashTalk](../../model-deployment/flashtalk.md).
+  [Talking-head Models → FlashTalk](../../../avatar_models/flashtalk.md).
 - CUDA evaluation needs a 4090/A100-class GPU; Ascend evaluation needs the CANN environment.

 ## Steps
--- a/docs/en/tutorials/cases/wav2lip.md
+++ b/docs/en/tutorials/cases/wav2lip.md
@@ -10,7 +10,7 @@ can move to `local`.

 - [Mock E2E](mock-e2e.md) has passed.
 - `wav2lip384.pth` and `s3fd.pth` are downloaded as described in
-  [Talking-head Models → Wav2Lip](../../model-deployment/wav2lip/local.md).
+  [Talking-head Models → Wav2Lip](../../avatar_models/wav2lip.md).
 - An OmniRT checkout exists next to `opentalking/`.

 ## Steps
@@ -40,8 +40,8 @@ bash scripts/quickstart/start_all.sh --omnirt http://127.0.0.1:9000
 curl -fsS http://127.0.0.1:8000/models | jq '.statuses[] | select(.id=="wav2lip")'
 ```

-The status should report `backend: omnirt` and `connected: true`. In the browser, choose a
-Wav2Lip-compatible avatar before starting the session.
+The status should report `backend: omnirt` and `connected: true`. In the browser,
+choose an available avatar before starting the session.

 ## Troubleshooting

@@ -49,4 +49,4 @@ Wav2Lip-compatible avatar before starting the session.
 |---------|--------|
 | `/models` reports `not_configured` | Check `OMNIRT_ENDPOINT` in the active `.env` and restart OpenTalking. |
 | OmniRT exits during startup | Inspect the script log path and verify the Wav2Lip/S3FD weight filenames. |
-| Avatar mismatch | Use an avatar with `model_type: wav2lip` or regenerate assets with `scripts/prepare_wav2lip_*_asset.py`. |
+| Avatar asset unavailable | Check that the avatar is uploaded, readable, and the session configuration is complete. |
--- a/docs/en/tutorials/configuration.md
+++ b/docs/en/tutorials/configuration.md
@@ -54,11 +54,10 @@ and does not require an API key.

 | Variable | Default | Description |
 |----------|---------|-------------|
-| `OPENTALKING_TTS_DEFAULT_PROVIDER` | `edge` | One of `edge`, `dashscope`, `local_cosyvoice`, `indextts`, `cosyvoice`, `elevenlabs`, `openai_compatible`, `xiaomi_mimo`. |
-| `OPENTALKING_TTS_ENABLED_PROVIDERS` | _empty_ | TTS providers shown in the frontend and status page, for example `edge,dashscope,local_cosyvoice,indextts`. |
+| `OPENTALKING_TTS_DEFAULT_PROVIDER` | `edge` | One of `edge`, `dashscope`, `local_cosyvoice`, `cosyvoice`, `elevenlabs`. |
+| `OPENTALKING_TTS_ENABLED_PROVIDERS` | _empty_ | TTS providers shown in the frontend and status page, for example `edge,dashscope,local_cosyvoice`. |
 | `OPENTALKING_TTS_DASHSCOPE_MODEL` | _empty_ | TTS model id; DashScope Qwen realtime TTS commonly uses `qwen3-tts-flash-realtime`. |
 | `OPENTALKING_TTS_DASHSCOPE_API_KEY` | _empty_ | TTS module API key. It is not populated from LLM or STT fallback keys. |
-| `OPENTALKING_TTS_INDEXTTS_BACKEND` | `local` | Backend used by `indextts`: `local` connects to a same-host IndexTTS HTTP sidecar; `omnirt` connects to an OmniRT resident service. |
 | `OPENTALKING_TTS_EDGE_VOICE` | `zh-CN-XiaoxiaoNeural` | Edge TTS voice. |
 | `OPENTALKING_TTS_DASHSCOPE_VOICE` | `Cherry` | DashScope Qwen realtime TTS voice. |
 | `OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL` | _empty_ | Local CosyVoice service HTTP URL, for example `http://127.0.0.1:19090/synthesize`. |
@@ -72,7 +71,7 @@ Configuration for DashScope realtime TTS and ElevenLabs is documented in
 The variables in this section are consulted only when the client selects `wav2lip`,
 `musetalk`, `flashtalk`, or `flashhead`. The `mock` backend ignores all entries here.
 For weight downloads and model-specific startup commands, see
-[Models](../model-deployment/index.md).
+[Models](../deployment/index.md).

 OpenTalking selects the inference entry point per model through `backend`; it is not
 tied to one inference platform. Recommended defaults:
@@ -224,7 +223,7 @@ infrastructure:
  worker_url: http://127.0.0.1:9001
 flashtalk:
  mode: off
-  ckpt_dir: ./models/SoulX-FlashTalk-14B
+  ckpt_dir: ./avatar_models/SoulX-FlashTalk-14B
  port: 8765
 flashhead:
  ws_url: ""
--- a/docs/en/tutorials/index.md
+++ b/docs/en/tutorials/index.md
@@ -18,7 +18,7 @@ Verify orchestration with `mock` first, then connect a real talking-head backend
 |----------|----------|
 | First real lip-sync model | [Wav2Lip integration](cases/wav2lip.md) |
 | High-quality FlashTalk/OmniRT path | [FlashTalk integration](cases/flashtalk.md) |
-| Model, weight, and deployment selection | [Model deployment](../model-deployment/index.md) |
+| Model, weight, and deployment selection | [Model deployment](../deployment/index.md) |

 These pages focus on low-level integration steps. If you want to start from business
 scenarios, see [Use Cases](../cases/index.md).
--- a/docs/en/tutorials/install-from-source.md
+++ b/docs/en/tutorials/install-from-source.md
@@ -271,7 +271,7 @@ System resources required:

 For production deployments that require horizontal Worker scaling or component
 isolation. The architecture and operational characteristics are documented in
-[Deployment](../model-deployment/deployment.md#api-and-worker-split).
+[Deployment](../deployment/index.md).

 ### Prerequisites

@@ -335,7 +335,7 @@ For single-host production deployments using source installation:
   WantedBy=multi-user.target
   ```

-4. Configure the production checklist items documented in [Deployment → Production checklist](../model-deployment/deployment.md#production-checklist).
+4. Configure the production checklist items documented in [Deployment → Production checklist](../deployment/index.md).

 ## Updates

--- a/docs/en/tutorials/installation.md
+++ b/docs/en/tutorials/installation.md
@@ -18,13 +18,13 @@ streamlined first-run procedure, see the [Quickstart](quickstart.md).
 | Evaluation on Ascend NPU | Huawei 910B (CANN 8.0+) | Source install on the host CANN environment | [From source → Ascend 910B](install-from-source.md#ascend-910b) |
 | Continuous integration | CPU | Source install or Docker Compose, depending on reproducibility needs | [From source](install-from-source.md#development-cpu-mock-synthesis) or [Docker Compose → CPU profile](install-with-docker.md#cpu-profile) |
 | Production single-host deployment | Linux + GPU or NPU | Source install or Docker, depending on operations preference | [From source → Production](install-from-source.md#production-deployment) or [Docker Compose](install-with-docker.md) |
-| Production multi-host deployment with horizontal Worker scaling | Linux + GPU or NPU | Source install, API/Worker split, external Redis | [From source → API and Worker split](install-from-source.md#api-and-worker-split) and [Deployment](../model-deployment/deployment.md) |
+| Production multi-host deployment with horizontal Worker scaling | Linux + GPU or NPU | Source install, API/Worker split, external Redis | [From source → API and Worker split](install-from-source.md#api-and-worker-split) and [Deployment](../deployment/index.md) |

 ## Platform support matrix

 | Platform | Synthesis backends | Notes |
 |----------|-------------------|-------|
-| macOS (Apple Silicon and Intel) | `mock`, experimental `quicktalk` local on Apple Silicon | Suitable for orchestration and frontend development. QuickTalk local can be tested on Apple Silicon with `quicktalk-cpu`; see [QuickTalk on Apple Silicon](../model-deployment/quicktalk/apple-silicon.md) for the full path. Realtime production paths still target Linux GPU/NPU or OmniRT. |
+| macOS (Apple Silicon and Intel) | `mock` | Suitable for orchestration and frontend development. Real talking-head models are not supported on macOS. |
 | Linux x86_64 + CUDA 12 | `mock`, `wav2lip`, `musetalk`, `flashtalk`, `flashhead`, `quicktalk` | Primary deployment target. |
 | Linux aarch64 + Ascend 910B (CANN 8.0+) | `mock`, `wav2lip`, `flashtalk` | NPU production deployment path. |
 | Windows | `mock` (WSL2 recommended) | Not part of the continuous integration matrix. |
@@ -70,4 +70,4 @@ curl -s http://127.0.0.1:8000/models | jq
 - [From source](install-from-source.md) — install from a git checkout. Covers development, production, and Ascend variants.
 - [Docker Compose](install-with-docker.md) — install with the packaged Docker stack for reproducible deployments.
 - [Configuration](configuration.md) — required environment configuration after installation.
- [Deployment](../model-deployment/deployment.md) — selecting a runtime topology.
+- [Deployment](../deployment/index.md) — selecting a runtime topology.
--- a/docs/en/tutorials/quickstart.md
+++ b/docs/en/tutorials/quickstart.md
@@ -97,7 +97,7 @@ validation of the pipeline before a real model is integrated.

 Once the mock path has been verified, the system may be reconfigured to use a real
 talking-head model. The complete per-model weight download and startup procedures are
-documented in [Models](../model-deployment/index.md). The shortest paths are:
+documented in [Models](../deployment/index.md). The shortest paths are:

 === "wav2lip"

@@ -119,7 +119,7 @@ documented in [Models](../model-deployment/index.md). The shortest paths are:

    Restart `start_all.sh` and select `wav2lip` in the model selector. For
    China-friendly download alternatives, see
-    [Models → Wav2Lip](../model-deployment/wav2lip/local.md).
+    [Models → Wav2Lip](../avatar_models/wav2lip.md).

 === "FlashTalk"

@@ -138,7 +138,7 @@ documented in [Models](../model-deployment/index.md). The shortest paths are:

    Select `flashtalk` in the model selector. For FlashTalk weight directories,
    CUDA/Ascend startup, and domestic mirror links, see
-    [Models → FlashTalk](../model-deployment/flashtalk.md).
+    [Models → FlashTalk](../../avatar_models/flashtalk.md).

 === "Ascend 910B"

@@ -176,7 +176,7 @@ The following table lists common installation issues and their resolutions.
 ## Next steps

 - [Configuration](configuration.md) — reference for all environment variables and YAML fields.
- [Models](../model-deployment/index.md) — end-to-end setup for each supported model backend.
- [Deployment](../model-deployment/deployment.md) — multi-process deployment, Docker Compose, and production guidance.
+- [Models](../deployment/index.md) — end-to-end setup for each supported model backend.
+- [Deployment](../deployment/index.md) — multi-process deployment, Docker Compose, and production guidance.
 - [Architecture](../docs/architecture.md) — system internals and event bus schema.
 - [API interfaces](../docs/api/index.md) — complete HTTP and WebSocket endpoint documentation.
--- a/docs/en/usage/cli.md
+++ b/docs/en/usage/cli.md
@@ -120,7 +120,7 @@ If several OpenTalking instances are running, check `status.sh` first.

 ### `prepare_wav2lip_image_asset.py`

-Prepare one image as a Wav2Lip avatar asset:
+Prepare one image as avatar reference-frame derivatives:

 ```bash
 uv run python scripts/prepare_wav2lip_image_asset.py \
@@ -134,7 +134,7 @@ The script writes `manifest.json`, `reference.png`, `preview.png`, frame assets,

 ### `prepare_wav2lip_video_asset.py`

-Prepare one video as a Wav2Lip avatar asset:
+Prepare one video as avatar reference-frame derivatives:

 ```bash
 uv run python scripts/prepare_wav2lip_video_asset.py \
--- a/docs/en/usage/index.md
+++ b/docs/en/usage/index.md
@@ -10,6 +10,7 @@ This section is for developers and integrators who have completed the basic setu

 - Starting OpenTalking services, the frontend, and helper scripts.
 - Using WebUI for avatar, model, voice, and session configuration.
+- Using WebUI Video Clone to drive a source avatar with camera frames or uploaded video.
 - Preparing custom avatars and previewing or cloning voices.
 - Common parameters, ports, backends, and environment files.

@@ -25,9 +26,9 @@ Start with [Command Line Tools](./cli.md), then use [Advanced CLI Arguments](./c

 ### WebUI

-WebUI is best for interactive validation. It provides avatar selection, model selection, TTS provider and voice configuration, text or voice conversation, video creation, video clone, and status feedback.
+WebUI is best for interactive validation. It provides avatar selection, model selection, TTS provider and voice configuration, text or voice conversation, Video Clone, and status feedback.

-Start with [WebUI Basic Usage](./webui/basic.md), then continue to [Custom Avatar](./webui/custom-avatar.md), [Voice and TTS](./webui/voice-and-tts.md), [Video Creation](./webui/video-creation.md), or [Video Clone](./webui/video-clone.md) when needed.
+Start with [WebUI Basic Usage](./webui/basic.md), then continue to [Custom Avatar](./webui/custom-avatar.md), [Voice and TTS](./webui/voice-and-tts.md), or [Video Clone](./webui/video-clone.md) when needed.

 ## Recommended Paths

@@ -52,21 +53,11 @@ Mock mode does not require model weights, so it is the fastest way to verify the
 2. Select or preview a default voice in WebUI.
 3. If voice cloning is needed, prepare provider credentials, sample audio, and public access settings.

-### I Want to Create a Digital-Human Video
+### I Want to Drive an Avatar with Video

-1. Confirm WebUI opens and at least one avatar is available.
-2. Read [Video Creation](./webui/video-creation.md), then choose a connected talking-head model for offline generation.
-3. Choose an audio source: upload audio, synthesize text, or clone a voice first.
-4. Preview the result on the right or open the asset library to download the generated video.
-
-Video Creation is an offline or near-realtime generation workflow. It does not require a realtime conversation session or WebRTC session.
-
-### I Want to Drive an Avatar with Camera Expressions
-
-1. Start the FasterLivePortrait OmniRT runtime according to the deployment guide.
-2. Read [Video Clone](./webui/video-clone.md), then choose the source avatar.
-3. Use a camera or uploaded driving video for expression and head-motion input.
-4. Tune pasteback, driving crop, animation region, and mouth controls as needed.
+1. Prepare and start the FasterLivePortrait OmniRT runtime.
+2. Open WebUI and switch to Video Clone.
+3. Read [Video Clone](./webui/video-clone.md), select a source avatar, then use camera frames or an uploaded video as driving input.

 ### I Want to Start Services from CLI

@@ -92,7 +83,6 @@ The recommended getting-started path currently focuses on WebUI and CLI. Detaile

 - Start and debug services with scripts: [Command Line Tools](./cli.md).
 - Learn the UI workflow: [WebUI Basic Usage](./webui/basic.md).
+- Drive an avatar with camera or uploaded video: [Video Clone](./webui/video-clone.md).
 - Add your own avatar: [Custom Avatar](./webui/custom-avatar.md).
 - Configure speech: [Voice and TTS](./webui/voice-and-tts.md).
- Generate offline videos: [Video Creation](./webui/video-creation.md).
- Drive an avatar with camera or uploaded video: [Video Clone](./webui/video-clone.md).
--- a/docs/en/usage/webui/basic.md
+++ b/docs/en/usage/webui/basic.md
@@ -14,16 +14,6 @@ Use WebUI to:

 WebUI is not a production admin system or a full asset management platform. It is a visual validation and debugging entry point.

-## Workflow Entrypoints
-
-The top navigation exposes different workflows:
-
- “Realtime Conversation”: select avatar, model, and voice, then enter the LLM / TTS / talking-head pipeline.
- “Video Creation”: select an avatar and audio source, then generate an offline digital-human video.
- “Video Clone”: keep one digital-human asset as the source, then drive its expression and head motion with a camera or uploaded video.
-
-Video Creation and Video Clone are independent from the realtime conversation `speak` queue. Video Creation is for downloadable narrated videos. Video Clone is for validating camera-driven expression cloning after the FasterLivePortrait runtime is available.
-
 ## Open WebUI

 Start services with:
@@ -40,12 +30,19 @@ http://127.0.0.1:5173

 If you changed ports, use the URL printed by the terminal.

-![WebUI first screen: workflow tabs, model selection, avatar library, and session panel.](../../../assets/images/usage/webui/webui-first-screen.png)
-
-*WebUI first screen: workflow tabs, model selection, avatar library, and session panel.*
+![WebUI first screen: realtime conversation workflow, avatar library, and settings panel](../../../assets/images/usage/webui-first-screen.png)

 ## Page Layout

+### Top Workflows
+
+The top navigation switches between workflows. The most common entries are Realtime Conversation and Video Clone:
+
+- Realtime Conversation is for digital-human sessions involving LLM / TTS / STT.
+- Video Clone is for FasterLivePortrait video driving, where camera frames or uploaded video are only the driving input.
+
+For details, see [Video Clone](./video-clone.md).
+
 ### Avatar Selection

 The avatar area lists available digital humans. Each item usually has a preview image, name, and type label. Custom avatars are marked as custom and can be deleted.
@@ -70,35 +67,6 @@ The conversation panel is used for text input, replies, and digital human playba

 Start with short text to verify first frame, audio, and captions before testing long prompts or continuous voice.

-### Video Creation Panel
-
-After entering “Video Creation”, the page has three columns:
-
- Left Source: select the avatar for the narrated video, or upload an image to create a new avatar.
- Center Offline Generation: choose the generation model, title, and audio source. Audio can come from an upload, TTS text, or a cloned voice.
- Right Result: preview, download, or open the asset library after generation completes.
-
-See [Video Creation](./video-creation.md) for the detailed workflow.
-
-### Video Clone Panel
-
-After entering “Video Clone”, the page has three columns:
-
- Left Source: select an existing avatar or upload a new source image. The source is the digital-human asset that will be driven.
- Center Output: cloned output, connection status, sent/received frames, dropped frames, and latency.
- Right Driving: select a camera, set FPS/resolution, or upload a driving video. Driving only provides expression and head motion; it does not become the identity.
-
-For source uploads, use a clear frontal or half-body image. The uploaded image is added to the avatar library and selected automatically. Uploading a driving video is a separate flow for testing a selfie video as the motion input.
-
-Useful controls:
-
- “Pasteback”: preserve the original source composition instead of showing only a zoomed head.
- “Crop driving face”: off by default; enable it only when the driving face is small or unstable.
- “Mouth opening” and “lip retargeting”: tune mouth motion. Retargeting can improve mouth shape, but aggressive settings may reduce motion to simple vertical opening.
- “Animation region”: choose mouth-only for lip tests, or full expression for richer motion.
-
-See [Video Clone](./video-clone.md) for the detailed workflow.
-
 ### Status and Errors

 WebUI displays connection, session, model, and TTS errors. When something fails, read the page message first, then inspect API and WebUI logs.
@@ -139,9 +107,7 @@ Hello, please briefly introduce OpenTalking.

 After reply, audio, and video are working, test more complex input.

-![Pre-session confirmation: selected avatar, driver model, and voice.](../../../assets/images/usage/webui/webui-session-ready.png)
-
-*Pre-session confirmation: check avatar, driver model, and voice before clicking Start Conversation.*
+![WebUI session workspace: selected avatar, model, and session panel](../../../assets/images/quick-start/mock-first-session.png)

 ## Common Operations

@@ -161,28 +127,6 @@ Voice changes affect future replies. Already generated audio is not re-synthesiz

 The page shows conversation text, generated replies, and some status events. For detailed backend events, inspect API logs or later reference materials.

-### Use Video Creation
-
-After OpenTalking is running:
-
-1. Switch the top navigation to “Video Creation”.
-2. Select an existing avatar on the left, or upload an image to create a narrated-video avatar.
-3. Choose `quicktalk` or `wav2lip` as the generation model.
-4. Choose an audio source: upload audio, synthesize text, or clone a voice first.
-5. Click Generate and Save.
-6. Preview the result on the right or open the asset library to download it.
-
-### Use Video Clone
-
-After FasterLivePortrait and OmniRT are started according to the model documentation:
-
-1. Switch the top navigation to “Video Clone”.
-2. Select an existing avatar on the left, or upload a new source image.
-3. Select a camera on the right; for uploaded-video testing, upload a driving video.
-4. Adjust FPS, resolution, animation region, and mouth controls as needed.
-5. Click Start and inspect the output in the center.
-6. Click Stop or switch pages to release the camera and WebSocket.
-
 ### Stop or Recreate Session

 If inference stalls, audio breaks, or configuration changes behave unexpectedly, stop the current session and create a new one. If needed, restart services:
@@ -214,11 +158,3 @@ Check browser mute state, TTS provider configuration, voice availability, creden
 ### Microphone Unavailable

 Check browser permissions, system microphone permissions, and whether the page is opened from `localhost` or `127.0.0.1`.
-
-### Video Clone Cannot Start the Camera
-
-Open the page from `localhost` or `127.0.0.1`, allow camera permissions, and make sure the camera is not occupied by another app. If camera access is unavailable, upload a driving video first to validate the backend video-clone service.
-
-### Video Clone Service Connection Fails
-
-Check `/video-clone/status`, then verify that the OmniRT FasterLivePortrait runtime is running. Startup steps are covered in [FasterLivePortrait](../../model-support/models/fasterliveportrait.md).
--- a/docs/en/usage/webui/custom-avatar.md
+++ b/docs/en/usage/webui/custom-avatar.md
@@ -14,14 +14,17 @@ This page explains:

 ## Avatar and Model Compatibility

-OpenTalking is gradually decoupling avatars from models so one avatar can be reused where possible, but models still have different asset requirements.
+OpenTalking treats avatars as shared visual assets: one avatar can be reused across
+different talking-head models. Model-specific caches, template videos, reference
+frames, or preprocessing files are generated by upload flows, preparation scripts, or
+the first session.

 General guidance:

 - Image avatars are best for quick validation.
 - Video avatars preserve more natural motion but require more preparation.
- QuickTalk can generate a template video from an uploaded image for quick validation.
- Wav2Lip depends more on preprocessed frames, mouth metadata, and manifest files.
+- Deployment pages explain how to generate the derivatives a model needs, but the
+  avatar itself should not be documented as model-specific.

 If unsure, start with image upload in WebUI.

@@ -61,15 +64,16 @@ Video assets are currently best prepared with scripts and placed under the avata
 5. Enter a new avatar name and upload an image.
 6. Select the new avatar after processing completes.

-![Custom avatar upload entry in WebUI.](../../../assets/images/usage/webui/custom-avatar-upload.png)
-
-*Custom avatar upload entry in WebUI. Click it to choose a local image and create a new avatar.*
+<div class="ot-figure-placeholder">
+  <strong>Screenshot placeholder: upload custom avatar</strong>
+  <span>To be added: base avatar, name input, and image upload entry.</span>
+</div>

 If the result is not good, try a clearer image with a more frontal face.

 ### Prepare a Wav2Lip Image Asset

-To create a built-in Wav2Lip avatar from one image:
+To create built-in reference-frame derivatives from one image:

 ```bash
 uv run python scripts/prepare_wav2lip_image_asset.py \
--- a/docs/en/usage/webui/video-clone.md
+++ b/docs/en/usage/webui/video-clone.md
@@ -1,129 +1,98 @@
 # Video Clone

-Video Clone keeps one digital-human avatar as the source and uses a camera or uploaded video as the driving input. The driving input controls expression, head motion, and mouth motion on the source avatar.
+Video Clone is a WebUI workflow next to Realtime Conversation. It keeps an OpenTalking avatar-library asset as the source, then uses browser camera frames or an uploaded video as the driving video so the avatar follows the user's expression, mouth, and head motion.

-It does not enter the LLM / STT / TTS conversation path and does not reuse the Realtime Conversation `speak` queue. The main v1 flow is live camera driving; uploaded driving video is available for testing a recorded selfie video.
+![Video Clone workspace](../../../assets/images/usage/video-clone-workspace.png)

-![Video Clone page: source, clone output, and driving controls.](../../../assets/images/usage/webui/video-clone.png)
+## When to Use It

-*Video Clone page: source avatar on the left, clone output in the center, and camera or driving-video controls on the right.*
+- Validate FasterLivePortrait video-driven output.
+- Use a camera to test realtime expression and head-motion following.
+- Upload a selfie video to inspect driving-video mouth shape, crop, and pasteback behavior.
+
+Video Clone does not start an LLM conversation and does not call TTS or STT. It is separate from the audio-driven Realtime Conversation workflow.

 ## Prerequisites

-Video Clone depends on the FasterLivePortrait runtime. Start OmniRT according to the [FasterLivePortrait guide](../../model-support/models/fasterliveportrait.md), then confirm OpenTalking can see the video-clone status.
+1. OmniRT is running the FasterLivePortrait runtime.
+2. OpenTalking API can reach OmniRT.
+3. WebUI model status reports `fasterliveportrait` as connected.
+4. The browser can access the camera. Use `localhost`, `127.0.0.1`, or HTTPS.

-Common check:
-
-```bash
-curl -s http://127.0.0.1:8000/video-clone/status | python3 -m json.tool
-```
-
-If the status is disconnected, check the OmniRT endpoint, FasterLivePortrait source dependency, and model weight paths.
-
-## Source and Driving
-
-Two concepts matter:
-
- `source`: the digital-human avatar shown in the final output. It comes from the OpenTalking avatar library or from an uploaded source image.
- `driving`: the face input that provides expression, head motion, and mouth motion. It can come from a camera or uploaded driving video.
-
-The camera user does not become the digital-human identity. The camera only provides the motion signal; the output remains the selected source avatar.
+If services are not running yet, first follow the [FasterLivePortrait model page](../../../avatar_models/fasterliveportrait.md) and complete both Start OmniRT and Start OpenTalking WebUI.

 ## Page Layout

-### Left Source
+### Source Avatar

-Use the left column to fix the source avatar:
+The source panel lists digital-human assets from the avatar library. Click an asset to make it the output character. Camera or uploaded video frames do not become the source; they only provide motion.

- Click an existing avatar to switch the source.
- Click Upload Source Image to upload a local image as a new source.
- After upload, OpenTalking adds the image to the avatar library and selects it.
+If the output appears as an over-zoomed head, make sure `Pasteback` is enabled. It pastes the animated face back into the original source image so the body, background, and aspect ratio are preserved.

-Use a clear frontal or half-body image. Avoid heavy occlusion, extreme side faces, or very dark images.
+### Clone Output

-### Center Output
+The center panel shows the selected source and the cloned output. The status strip shows sent frames, received frames, dropped frames, and latency.

-The center column shows clone output. The top controls include:
+After stopping, click the change-avatar control to return to source selection and choose another asset.

- “Record output”: record the current output and save it to the exported video asset library.
- “Change avatar”: return to source selection.
- Status button: shows stopped, connecting, or running state.
+### Driving Input

-The bottom status shows sent frames, received frames, dropped frames, and latency.
+The right panel selects camera, FPS, resolution, and local preview. After Start, the browser samples camera frames through a canvas timer and sends them to the backend.

-### Right Driving
+Uploaded driving video is a secondary testing path for comparing the same selfie video under different parameters.

-The right column configures driving input:
+## Live Camera Driving

- “Camera”: select the local camera.
- “FPS”: frontend frame sampling rate.
- “Resolution”: frame size sent to the runtime.
- “Mirror preview”: mirrors camera preview and sent frames for selfie use.
- “Upload driving video”: loop a local video as the driving input.
+1. Open WebUI and switch to Video Clone.
+2. Select a digital-human source on the left.
+3. Select camera, FPS, and resolution on the right.
+4. Click Start and allow browser camera permission.
+5. Watch the center output and status strip.
+6. Click Stop or leave the page only after the camera preview closes.

-If the browser cannot open the camera, upload a driving video first to validate the backend service.
-
-## Steps
-
-1. Start the FasterLivePortrait OmniRT runtime.
-2. Start OpenTalking and open WebUI.
-3. Switch the top navigation to “Video Clone”.
-4. Select a source avatar on the left, or upload a new source image.
-5. Select a camera on the right, or upload a driving video.
-6. Adjust FPS, resolution, animation region, and mouth controls as needed.
-7. Click Start.
-8. Inspect the center output; click Record Output when you need to save it.
-9. Click Stop or switch workflows to release camera tracks, WebSocket, and the current clone session.
-
-## Parameter Tips
-
-### Pasteback
-
-Keep it enabled by default. Pasteback preserves the original source composition and avoids showing only a zoomed face.
-
-### Crop Driving Face
-
-Keep it disabled by default. Over-cropping uploaded driving video can make mouth shape and head position feel unnatural. Enable it only when the driving face is too small or face detection is unstable.
-
-### Animation Region
-
- “Full expression”: full head motion and expression demo.
- “Expression”: expression-focused motion.
- “Pose”: head-pose-focused motion.
- “Mouth”: lip-only checks.
- “Eyes”: blink and eye-motion checks.
-
-### Mouth Controls
-
-Mouth opening increases or reduces mouth amplitude. Lip retargeting can improve mouth closure, but aggressive settings may collapse motion into simple vertical opening. Change one parameter at a time.
+Start with `12fps` and `448px`. Increase FPS or resolution only after output is stable.

 ## Uploaded Driving Video

-Uploaded driving video does not change the source identity. It only provides motion input.
+Uploaded video is for validating driving-video expression, mouth, and crop behavior. Use a clear frontal or half-body selfie video. Avoid a tiny face, heavy occlusion, extreme head turns, or very narrow aspect ratios.

-Recommended driving video:
+If uploaded-video output looks worse than camera output:

- Clear, unobstructed face.
- Face not too far from the camera.
- Face stays in frame.
- Short clips for initial tuning.
+- Disable `Crop driving face` so the driving face is not cropped too tightly.
+- Enable `Pasteback` so output is not a cropped head-only view.
+- Enable `Lip retargeting` and disable `Relative motion`.
+- Change the driving region from `Mouth` to `Expression` or `All` and check whether mouth corners and cheeks recover.

-If uploaded video makes the mouth look puffy or unable to open, disable Crop Driving Face first, then check face position and scale in the driving video.
+## Parameter Suggestions
+
+| Parameter | Effect | Suggestion |
+| --- | --- | --- |
+| Motion amplitude | Overall driving strength | Start from `1.0` |
+| Expression amplitude | Expression and mouth strength | Start from `1.0` |
+| Head amplitude | Overall head motion | Start from `0.3` |
+| Mouth opening | Mouth open/close amplitude | `0.8-1.3` |
+| Yaw / pitch / roll | Pose components | Lower the component that looks too strong |
+| Pasteback | Preserve source composition | Keep enabled |
+| Stitching | Stabilize face boundary | Keep enabled |
+| Relative motion | Preserve source base pose | Usually disable when lip retargeting is enabled |
+| Lip normalize | Reduce initial mouth-shape offset | Keep enabled |
+| Lip retargeting | Improve mouth following | Try when the mouth is puffy or does not open enough |
+| Crop driving face | Crop input-video face | Disable when uploaded-video aspect ratio looks wrong |

 ## Common Issues

-### Camera Does Not Open
+### Cannot Start Camera or Video Clone Service

-Open the page from `localhost` or `127.0.0.1`, allow camera permission, and make sure the camera is not occupied by another application.
+Check browser permissions, page origin (`localhost` / `127.0.0.1` / HTTPS), and whether FasterLivePortrait is connected in `/models`.

-### Video Clone Service Fails to Connect
+### Uploaded Video Mouth Looks Puffy or Too Closed

-Check `/video-clone/status`, confirm the FasterLivePortrait runtime is running, and verify the OpenTalking OmniRT endpoint points to the right service.
+This is usually related to driving-video crop, face position, scaling, or lip parameters. Disable `Crop driving face` first, then try `Lip retargeting + Relative motion off`.

-### Mouth Only Opens Vertically
+### Lip Retargeting Turns into Mostly Vertical Mouth Opening

-Reduce lip retargeting strength, or switch back to a fuller animation region. Mouth-only is useful for lip checks, but full demos should usually use Full Expression.
+Lip retargeting strengthens mouth open/close. If relative motion stays enabled, mouth corners and cheek movement can become weak. Disable `Relative motion` and switch the driving region to `Expression` or `All`.

-### Head Is Zoomed In
+### Avatar Aspect Ratio Looks Wrong After Selection

-Enable Pasteback. If the source image itself is face-heavy, use a half-body or wider-composition source image.
+Enable `Pasteback` and choose a source with the desired original composition. Video Clone should use the source image for output composition; the driving video only provides motion.
--- a/docs/en/user-guide/deployment.md
+++ b/docs/en/user-guide/deployment.md
@@ -7,6 +7,6 @@ search:
 # Deployment

 !!! note "Page moved"
-    This page only keeps compatibility with old links. The latest content is available at [Model Deployment → Deployment](../model-deployment/deployment.md).
+    This page only keeps compatibility with old links. The latest content is available at [Model Deployment → Deployment](../deployment/index.md).

-[Go to the new page →](../model-deployment/deployment.md)
+[Go to the new page →](../deployment/index.md)
--- a/docs/zh/avatar_models/avatar.md
+++ b/docs/zh/avatar_models/avatar.md
@@ -0,0 +1,47 @@
+# Avatar 资产
+
+Avatar 资产定义数字人的视觉形象。当前 OpenTalking 将 avatar 作为通用会话资产处理：
+同一个 avatar 可以被不同 talking-head 模型复用，模型在启动或创建会话时按需生成自己的
+缓存、模板或预处理产物。
+
+## 最小规则
+
+一个可用的 avatar bundle 至少应包含：
+
+- `manifest.json`：声明 `id`、展示名、尺寸、帧率和采样率等基础信息。
+- `preview.png`：用于 WebUI 形象库展示。
+- 可选素材：单张参考图、抽帧结果、模板视频或模型生成的缓存。
+
+不要把 avatar 写成 QuickTalk、MuseTalk 或 Wav2Lip 的专属资产。模型需要的派生产物
+（例如 QuickTalk 模板、Wav2Lip 参考帧、MuseTalk `prepared/`）应由准备脚本、上传流程
+或部署命令生成。
+
+## 示例 manifest
+
+```json title="examples/avatars/demo-avatar/manifest.json"
+{
+  "id": "demo-avatar",
+  "name": "Demo Avatar",
+  "fps": 25,
+  "sample_rate": 16000,
+  "width": 512,
+  "height": 512,
+  "metadata": {}
+}
+```
+
+## 准备与验证
+
+完整 schema 和准备脚本见：
+
+- [Avatar 资产格式](../docs/avatar-format.md)
+- [模型 → Talking-head 模型](talking-head.md)
+
+验证服务是否识别 avatar：
+
+```bash title="终端"
+curl -s http://127.0.0.1:8000/avatars | jq
+```
+
+排查时同时检查三项：会话 `model`、avatar 是否能被服务读取、`/models` 中对应 backend
+是否 connected。
--- a/docs/zh/avatar_models/deployment/musetalk-local.md
+++ b/docs/zh/avatar_models/deployment/musetalk-local.md
@@ -0,0 +1,68 @@
+# MuseTalk Local 部署
+
+Local 模式由 OpenTalking 启动 MuseTalk adapter，并在创建会话前调用官方预处理流程。它适合希望使用 MuseTalk 质量、但暂时不独立部署 OmniRT 的团队。
+
+## 适用场景
+
+- 单机 CUDA 环境，Web/API 与 MuseTalk runtime 在同一机器。
+- 需要 OpenTalking 自动生成 avatar 的 `prepared/` 产物。
+- 能接受首次会话前加载 DWPose、face parsing、VAE 的额外耗时。
+
+## 权重准备
+
+MuseTalk local 需要模型权重、官方源码和预处理 Python：
+
+```bash title="终端"
+export DIGITAL_HUMAN_HOME="$HOME/digital-human"
+export OPENTALKING_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+mkdir -p "$OPENTALKING_MODEL_ROOT" "$DIGITAL_HUMAN_HOME/model-repos"
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download TMElyralab/MuseTalk \
+  --local-dir "$OPENTALKING_MODEL_ROOT"
+
+git clone https://github.com/TMElyralab/MuseTalk.git \
+  "$DIGITAL_HUMAN_HOME/model-repos/MuseTalk"
+```
+
+最终需要能在模型根目录找到 `musetalk/`、`sd-vae-ft-mse/`、`whisper/`、`dwpose/`、`face-parse-bisenet/` 等目录。推荐使用仓库脚本检查预处理环境：
+
+```bash title="终端"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+bash scripts/quickstart/prepare_local_musetalk.sh
+```
+
+## 启动命令
+
+```bash title="终端"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+uv sync --extra dev --extra models --python 3.11
+
+export OPENTALKING_MUSETALK_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+export OPENTALKING_MUSETALK_REPO="$DIGITAL_HUMAN_HOME/model-repos/MuseTalk"
+export OPENTALKING_MUSETALK_PREPROCESS_PYTHON="$DIGITAL_HUMAN_HOME/runtimes/musetalk-preprocess/venv/bin/python"
+
+bash scripts/start_unified.sh --backend local --model musetalk --api-port 18000 --web-port 18173
+```
+
+创建会话时，如果 avatar 缺少 `prepared/prepared_info.json`，OpenTalking 会先运行 MuseTalk 官方预处理，然后再加载会话。
+
+## 验证命令
+
+```bash title="终端"
+curl -fsS http://127.0.0.1:18000/health
+curl -s http://127.0.0.1:18000/models | jq '.statuses[] | select(.id=="musetalk")'
+```
+
+期望返回 `backend=local`、`connected=true`。
+
+## 常见错误
+
+| 现象 | 处理 |
+|------|------|
+| `No module named 'mmcv._ext'` | 预处理 Python 需要 full `mmcv`，不能只装 `mmcv-lite`。 |
+| 预处理失败 | 检查 `OPENTALKING_MUSETALK_REPO`、`dwpose`、`face-parse-bisenet`。 |
+| 首次会话慢 | 预处理和 VAE 加载耗时正常；可提前为常用 avatar 生成 `prepared/`。 |
+| avatar 资源不可用 | 检查 avatar 是否已上传、可读取，并确认会话配置完整。 |
--- a/docs/zh/avatar_models/deployment/musetalk-omnirt.md
+++ b/docs/zh/avatar_models/deployment/musetalk-omnirt.md
@@ -0,0 +1,70 @@
+# MuseTalk OmniRT 部署
+
+OmniRT 模式让外部 MuseTalk 服务负责权重加载、官方 runtime 和 GPU 调度，OpenTalking 只连接统一的 `/v1/audio2video/musetalk` 接口。
+
+## 适用场景
+
+- MuseTalk 依赖较重，希望与 OpenTalking 主进程隔离。
+- Web/API 和推理 GPU 分开部署。
+- 需要与 Wav2Lip、QuickTalk 等模型共用 OmniRT 服务入口。
+
+## 权重准备
+
+```bash title="终端"
+export DIGITAL_HUMAN_HOME="$HOME/digital-human"
+export OMNIRT_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+mkdir -p "$OMNIRT_MODEL_ROOT" "$DIGITAL_HUMAN_HOME/model-repos"
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download TMElyralab/MuseTalk \
+  --local-dir "$OMNIRT_MODEL_ROOT"
+
+git clone https://github.com/TMElyralab/MuseTalk.git \
+  "$DIGITAL_HUMAN_HOME/model-repos/MuseTalk"
+```
+
+确认 `musetalk/`、`sd-vae-ft-mse/`、`whisper/`、`dwpose/`、`face-parse-bisenet/` 均在 `$OMNIRT_MODEL_ROOT` 下。
+
+## 启动命令
+
+使用 quickstart 脚本准备并启动 MuseTalk runtime：
+
+```bash title="终端"
+cd "$OMNIRT_HOME"
+export OMNIRT_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+export OMNIRT_MUSETALK_REPO="$DIGITAL_HUMAN_HOME/model-repos/MuseTalk"
+export OMNIRT_MUSETALK_DEVICE=cuda
+export OMNIRT_MUSETALK_PORT=8766
+
+bash scripts/quickstart/start_omnirt_musetalk.sh
+```
+
+然后启动 OpenTalking：
+
+```bash title="终端"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+bash scripts/start_unified.sh \
+  --backend omnirt \
+  --model musetalk \
+  --omnirt http://127.0.0.1:9000 \
+  --api-port 8310 \
+  --web-port 5380
+```
+
+## 验证命令
+
+```bash title="终端"
+curl -fsS http://127.0.0.1:9000/v1/audio2video/models | jq
+curl -s http://127.0.0.1:8310/models | jq '.statuses[] | select(.id=="musetalk")'
+```
+
+## 常见错误
+
+| 现象 | 处理 |
+|------|------|
+| OmniRT 未列出 `musetalk` | 检查 `OMNIRT_MUSETALK_REPO`、模型目录和启动脚本日志。 |
+| `reason=omnirt_unavailable` | 检查 OpenTalking `--omnirt` 地址和 OmniRT 端口。 |
+| MuseTalk 子服务端口冲突 | 调整 `OMNIRT_MUSETALK_PORT`。 |
+| 首次加载慢 | MuseTalk 预加载和 avatar 预处理耗时较长，生产环境建议预热。 |
--- a/docs/zh/avatar_models/deployment/quicktalk-apple-silicon.md
+++ b/docs/zh/avatar_models/deployment/quicktalk-apple-silicon.md
@@ -0,0 +1,66 @@
+# QuickTalk Apple Silicon 部署
+
+Apple Silicon 适合做配置、avatar 和前端链路验证。QuickTalk 的实时生产推理仍建议使用 CUDA 或 OmniRT；在 Mac 上优先把它当成开发模式。
+
+## 适用场景
+
+- 在 M 系列 Mac 上准备权重、检查 manifest、验证 WebUI 流程。
+- 不方便使用 CUDA，但需要复用 QuickTalk 目录结构。
+- 准备把同一套资产同步到 Linux GPU 或 OmniRT 服务。
+
+## 权重准备
+
+目录结构与 Linux local 模式保持一致：
+
+```bash title="终端"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+mkdir -p models/quicktalk/checkpoints
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download datascale-ai/quicktalk \
+  quicktalk.pth \
+  repair.npy \
+  chinese-hubert-large/config.json \
+  chinese-hubert-large/preprocessor_config.json \
+  chinese-hubert-large/pytorch_model.bin \
+  --local-dir models/quicktalk/checkpoints
+```
+
+如果本机只做文档和资产检查，也可以跳过 CUDA 相关依赖，只确认权重目录、通用 avatar 和可选模板资源存在。
+
+## 启动命令
+
+优先用 `mock` 验证 API/WebUI，再切到 QuickTalk 资产检查：
+
+```bash title="终端"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+uv sync --extra dev --extra models --extra quicktalk-cpu --python 3.11
+
+export OPENTALKING_TORCH_DEVICE=mps
+export OPENTALKING_QUICKTALK_ASSET_ROOT="$DIGITAL_HUMAN_HOME/opentalking/models/quicktalk"
+export OPENTALKING_QUICKTALK_WORKER_CACHE=0
+
+bash scripts/start_unified.sh --backend local --model quicktalk --api-port 8210 --web-port 5280
+```
+
+如果依赖或算子不支持 MPS，请改用 `--backend mock` 验证产品流程，或把相同 `models/quicktalk/` 同步到 CUDA 机器运行。
+
+## 验证命令
+
+```bash title="终端"
+curl -fsS http://127.0.0.1:8210/health
+curl -s http://127.0.0.1:8210/models | jq '.statuses[] | select(.id=="quicktalk")'
+```
+
+Apple Silicon 下 `connected=false` 不一定代表资产错误，重点看 `reason` 是否指向缺依赖、缺权重或不支持的 device。
+
+## 常见错误
+
+| 现象 | 处理 |
+|------|------|
+| MPS 算子不支持 | 使用 CUDA 机器或 OmniRT 服务跑真实推理；Mac 仅保留资产验证。 |
+| ONNX Runtime provider 不匹配 | 使用 `quicktalk-cpu` 依赖或切换到 Linux CUDA。 |
+| 模板视频找不到 | 如果配置了固定模板视频，使用可访问的绝对路径或仓库内相对资产路径。 |
+| 下载慢 | 设置 `HF_ENDPOINT`，或先在可联网机器下载后同步。 |
--- a/docs/zh/avatar_models/deployment/quicktalk-local.md
+++ b/docs/zh/avatar_models/deployment/quicktalk-local.md
@@ -0,0 +1,86 @@
+# QuickTalk Local 部署
+
+Local 模式把 QuickTalk adapter 加载在 OpenTalking 进程内，适合单机 CUDA 机器验证实时口播、调试 avatar cache，以及在引入 OmniRT 前确认前后端链路。
+
+## 适用场景
+
+- 已经跑通 `mock`，现在需要真实 talking-head 输出。
+- 单机部署，GPU、WebUI、API 都在同一台机器。
+- 需要使用 `opentalking-prepare-cache` 为常用通用 avatar 预热 QuickTalk 缓存。
+
+## 权重准备
+
+权重统一放在仓库根目录 `models/quicktalk/`。网络慢时可以设置 `HF_ENDPOINT`。
+
+```bash title="终端"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+mkdir -p models/quicktalk/checkpoints
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download datascale-ai/quicktalk \
+  quicktalk.pth \
+  repair.npy \
+  chinese-hubert-large/config.json \
+  chinese-hubert-large/preprocessor_config.json \
+  chinese-hubert-large/pytorch_model.bin \
+  --local-dir models/quicktalk/checkpoints
+```
+
+InsightFace `buffalo_l` 需要单独准备：
+
+```bash title="终端"
+mkdir -p /tmp/opentalking-insightface models/quicktalk/checkpoints/auxiliary/models
+curl -L \
+  -o /tmp/opentalking-insightface/buffalo_l.zip \
+  https://github.com/deepinsight/insightface/releases/download/v0.7/buffalo_l.zip
+unzip -q -o /tmp/opentalking-insightface/buffalo_l.zip \
+  -d /tmp/opentalking-insightface
+rsync -a /tmp/opentalking-insightface/buffalo_l/ \
+  models/quicktalk/checkpoints/auxiliary/models/buffalo_l/
+```
+
+## 启动命令
+
+```bash title="终端"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+uv sync --extra dev --extra models --extra quicktalk-cuda --python 3.11
+
+export OPENTALKING_TORCH_DEVICE=cuda:0
+export OPENTALKING_QUICKTALK_ASSET_ROOT="$DIGITAL_HUMAN_HOME/opentalking/models/quicktalk"
+export OPENTALKING_QUICKTALK_WORKER_CACHE=1
+
+bash scripts/start_unified.sh --backend local --model quicktalk --api-port 8210 --web-port 5280
+```
+
+打开 `http://localhost:5280`，选择通用 avatar 和 `quicktalk` 模型。如果需要固定模板视频，
+请在会话或部署配置中确认模板资源可访问。
+
+## 验证命令
+
+```bash title="终端"
+curl -fsS http://127.0.0.1:8210/health
+curl -s http://127.0.0.1:8210/models | jq '.statuses[] | select(.id=="quicktalk")'
+```
+
+期望返回 `backend=local`、`connected=true`。如需提前生成缓存：
+
+```bash title="终端"
+opentalking-prepare-cache \
+  --model quicktalk \
+  --avatars-root examples/avatars \
+  --quicktalk-model-root models/quicktalk \
+  --device cuda:0 \
+  --model-backend pth \
+  --verify
+```
+
+## 常见错误
+
+| 现象 | 处理 |
+|------|------|
+| `connected=false` | 检查 `OPENTALKING_QUICKTALK_ASSET_ROOT`、CUDA 设备和 `models/quicktalk/checkpoints`。 |
+| 首轮等待很久 | 开启 `OPENTALKING_QUICKTALK_WORKER_CACHE=1` 或提前执行 `opentalking-prepare-cache`。 |
+| avatar 加载失败 | 检查 avatar 是否能被服务读取；如配置了固定模板视频，确认路径可访问。 |
+| Hugging Face 下载失败 | 配置 `HF_ENDPOINT` 或先离线下载后同步到同样目录。 |
--- a/docs/zh/avatar_models/deployment/quicktalk-omnirt.md
+++ b/docs/zh/avatar_models/deployment/quicktalk-omnirt.md
@@ -0,0 +1,82 @@
+# QuickTalk OmniRT 部署
+
+OmniRT 模式把 QuickTalk 推理放到 OpenTalking 进程外，适合多模型共用一个服务端点、隔离 GPU 依赖，或把推理服务部署到独立机器。
+
+## 适用场景
+
+- OpenTalking 只负责会话、TTS 和 WebRTC，QuickTalk 由独立服务承载。
+- 同一个 OmniRT endpoint 需要同时暴露 `quicktalk`、`wav2lip` 等模型。
+- 需要更清晰地区分 Web 服务资源和推理 GPU 资源。
+
+## 权重准备
+
+OmniRT 默认读取 `$OMNIRT_MODEL_ROOT/quicktalk`：
+
+```bash title="终端"
+export DIGITAL_HUMAN_HOME="$HOME/digital-human"
+export OMNIRT_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+mkdir -p "$OMNIRT_MODEL_ROOT/quicktalk/checkpoints"
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download datascale-ai/quicktalk \
+  quicktalk.pth \
+  repair.npy \
+  chinese-hubert-large/config.json \
+  chinese-hubert-large/preprocessor_config.json \
+  chinese-hubert-large/pytorch_model.bin \
+  --local-dir "$OMNIRT_MODEL_ROOT/quicktalk/checkpoints"
+```
+
+确认 `quicktalk.pth`、`repair.npy`、HuBERT 和 InsightFace `buffalo_l` 都在 QuickTalk 模型目录下；InsightFace 准备方式见 [Local](quicktalk-local.md)。
+
+## 启动命令
+
+先启动 OmniRT：
+
+```bash title="终端"
+cd "$OMNIRT_HOME"
+uv sync --extra server --extra quicktalk-cuda --python 3.11
+source .venv/bin/activate
+
+export OMNIRT_QUICKTALK_RUNTIME=1
+export OMNIRT_QUICKTALK_MODEL_ROOT="$OMNIRT_MODEL_ROOT/quicktalk"
+export OMNIRT_QUICKTALK_CHECKPOINT="$OMNIRT_MODEL_ROOT/quicktalk/checkpoints/quicktalk.pth"
+export OMNIRT_QUICKTALK_DEVICE=cuda:0
+export OMNIRT_QUICKTALK_HUBERT_DEVICE=cuda:0
+export OMNIRT_QUICKTALK_MAX_LONG_EDGE=900
+export OMNIRT_QUICKTALK_MAX_TEMPLATE_SECONDS=1
+
+omnirt serve-avatar-ws --host 0.0.0.0 --port 9000 --backend cuda
+```
+
+再启动 OpenTalking：
+
+```bash title="终端"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+bash scripts/start_unified.sh \
+  --backend omnirt \
+  --model quicktalk \
+  --omnirt http://127.0.0.1:9000 \
+  --api-port 8310 \
+  --web-port 5380
+```
+
+## 验证命令
+
+```bash title="终端"
+curl -fsS http://127.0.0.1:9000/v1/audio2video/models | jq
+curl -s http://127.0.0.1:8310/models | jq '.statuses[] | select(.id=="quicktalk")'
+```
+
+OpenTalking 侧期望 `backend=omnirt`、`connected=true`。
+
+## 常见错误
+
+| 现象 | 处理 |
+|------|------|
+| `reason=omnirt_unavailable` | 检查 OmniRT 端口、`OMNIRT_ENDPOINT` 和 `/v1/audio2video/models`。 |
+| OmniRT 未列出 `quicktalk` | 检查 `OMNIRT_QUICKTALK_RUNTIME=1`、checkpoint 路径和启动日志。 |
+| 首帧慢或显存高 | 调整 `OMNIRT_QUICKTALK_MAX_LONG_EDGE`、HuBERT device 或预热策略。 |
+| avatar 资源不可用 | 检查所选 avatar 是否已上传、可读取，并确认会话配置完整。 |
--- a/docs/zh/avatar_models/deployment/wav2lip-local.md
+++ b/docs/zh/avatar_models/deployment/wav2lip-local.md
@@ -0,0 +1,64 @@
+# Wav2Lip Local 部署
+
+Local 模式使用 OpenTalking 内置 Wav2Lip adapter，是最轻量的真实口型同步验证路径。它适合单机 GPU、资产验证和低成本 demo。
+
+## 适用场景
+
+- 第一次从 `mock` 切到真实 talking-head 模型。
+- 希望在 OpenTalking 进程内完成推理，不额外部署 OmniRT。
+- 使用内置或自定义通用 avatar，并让 Wav2Lip 流程按需读取参考图或帧资产。
+
+## 权重准备
+
+```bash title="终端"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+mkdir -p models/wav2lip
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download Pypa/wav2lip384 \
+  wav2lip384.pth \
+  --local-dir models/wav2lip
+hf download rippertnt/wav2lip \
+  s3fd.pth \
+  --local-dir models/wav2lip
+
+stat models/wav2lip/wav2lip384.pth
+stat models/wav2lip/s3fd.pth
+```
+
+## 启动命令
+
+```bash title="终端"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+uv sync --extra dev --extra models --python 3.11
+
+export OPENTALKING_WAV2LIP_MODEL_ROOT="$DIGITAL_HUMAN_HOME/opentalking/models/wav2lip"
+export OPENTALKING_WAV2LIP_DEVICE=cuda
+export OPENTALKING_WAV2LIP_BATCH_SIZE=16
+export OPENTALKING_WAV2LIP_MAX_LONG_EDGE=832
+export OPENTALKING_WAV2LIP_FACE_DET_DEVICE=cpu
+
+bash scripts/start_unified.sh --backend local --model wav2lip --api-port 8210 --web-port 5280
+```
+
+打开 `http://localhost:5280`，选择一个可用 avatar 和 `wav2lip` 模型。
+
+## 验证命令
+
+```bash title="终端"
+curl -fsS http://127.0.0.1:8210/health
+curl -s http://127.0.0.1:8210/models | jq '.statuses[] | select(.id=="wav2lip")'
+```
+
+期望返回 `backend=local`、`connected=true`。首次加载会初始化 checkpoint、S3FD 和 avatar cache，可能需要几十秒。
+
+## 常见错误
+
+| 现象 | 处理 |
+|------|------|
+| checkpoint 找不到 | 检查 `OPENTALKING_WAV2LIP_MODEL_ROOT` 和两个 `.pth` 文件。 |
+| 显存不足 | 降低 `OPENTALKING_WAV2LIP_BATCH_SIZE` 或 `OPENTALKING_WAV2LIP_MAX_LONG_EDGE`。 |
+| 首帧慢 | 设置 `OPENTALKING_PREWARM_AVATARS=singer` 预热常用 avatar。 |
+| 画质增强报错 | `easy_enhanced` 需要 GFPGAN，并配置 `OPENTALKING_WAV2LIP_GFPGAN_CHECKPOINT`。 |
--- a/docs/zh/avatar_models/deployment/wav2lip-omnirt.md
+++ b/docs/zh/avatar_models/deployment/wav2lip-omnirt.md
@@ -0,0 +1,74 @@
+# Wav2Lip OmniRT 部署
+
+OmniRT 模式把 Wav2Lip 推理服务化，适合让 OpenTalking 与模型依赖解耦，或在同一 OmniRT endpoint 上同时启用多个 talking-head 模型。
+
+## 适用场景
+
+- Web/API 与推理 GPU 分离部署。
+- 需要统一通过 `/v1/audio2video/{model}` 管理模型。
+- 希望复用 OmniRT 的预加载、批处理和设备配置。
+
+## 权重准备
+
+```bash title="终端"
+export DIGITAL_HUMAN_HOME="$HOME/digital-human"
+export OMNIRT_MODEL_ROOT="$DIGITAL_HUMAN_HOME/models"
+mkdir -p "$OMNIRT_MODEL_ROOT/wav2lip"
+
+uv pip install -U "huggingface_hub[cli]"
+export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"
+
+hf download Pypa/wav2lip384 \
+  wav2lip384.pth \
+  --local-dir "$OMNIRT_MODEL_ROOT/wav2lip"
+hf download rippertnt/wav2lip \
+  s3fd.pth \
+  --local-dir "$OMNIRT_MODEL_ROOT/wav2lip"
+```
+
+## 启动命令
+
+```bash title="终端"
+cd "$OMNIRT_HOME"
+uv sync --extra server --extra wav2lip-cuda --python 3.11
+source .venv/bin/activate
+
+export OMNIRT_WAV2LIP_RUNTIME=1
+export OMNIRT_WAV2LIP_MODELS_DIR="$OMNIRT_MODEL_ROOT"
+export OMNIRT_WAV2LIP_CHECKPOINT="$OMNIRT_MODEL_ROOT/wav2lip/wav2lip384.pth"
+export OMNIRT_WAV2LIP_DEVICE=cuda
+export OMNIRT_WAV2LIP_FACE_DET_DEVICE=cpu
+export OMNIRT_WAV2LIP_BATCH_SIZE=16
+export OMNIRT_WAV2LIP_MAX_LONG_EDGE=832
+export OMNIRT_WAV2LIP_PRELOAD=1
+
+omnirt serve-avatar-ws --host 0.0.0.0 --port 9000 --backend cuda
+```
+
+另开终端启动 OpenTalking：
+
+```bash title="终端"
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+bash scripts/start_unified.sh \
+  --backend omnirt \
+  --model wav2lip \
+  --omnirt http://127.0.0.1:9000 \
+  --api-port 8310 \
+  --web-port 5380
+```
+
+## 验证命令
+
+```bash title="终端"
+curl -fsS http://127.0.0.1:9000/v1/audio2video/models | jq
+curl -s http://127.0.0.1:8310/models | jq '.statuses[] | select(.id=="wav2lip")'
+```
+
+## 常见错误
+
+| 现象 | 处理 |
+|------|------|
+| OmniRT 未加载 Wav2Lip | 检查 `OMNIRT_WAV2LIP_RUNTIME=1` 和 `OMNIRT_WAV2LIP_CHECKPOINT`。 |
+| `reason=omnirt_unavailable` | 检查 OpenTalking 的 `--omnirt` 地址和 OmniRT 健康状态。 |
+| 端到端延迟高 | 降低 batch size、限制 `MAX_LONG_EDGE`，并启用 `OMNIRT_WAV2LIP_PRELOAD=1`。 |
+| avatar 资源不可用 | 确认 avatar 资源可读取，并检查会话配置是否完整。 |
--- a/docs/zh/model-deployment/fasterliveportrait.md
+++ b/docs/zh/model-deployment/fasterliveportrait.md
@@ -6,7 +6,7 @@
 | 模型 ID | `fasterliveportrait` |
 | Backend | `omnirt` |
 | 证据等级 | 已文档化；实时链路通过 OmniRT runtime 暴露 |
-| 推荐用途 | 单卡实时音频驱动头像、贴回原始资产图、视频克隆、前端幅度热更新 |
+| 推荐用途 | 单卡实时音频驱动头像、贴回原始资产图、前端幅度热更新 |

 ## 常见问题

@@ -18,35 +18,19 @@
 | 浏览器能看到模型但创建会话失败 | 选择 `model_type` 匹配 `fasterliveportrait` 的 avatar，或准备对应 avatar bundle。 |


-FasterLivePortrait 当前也走 OmniRT `audio2video` 兼容路径。OpenTalking 负责会话、TTS/音频流、WebRTC 播放和前端参数下发；OmniRT 常驻加载 FasterLivePortrait 与 JoyVASA，统一暴露 `/v1/audio2video/fasterliveportrait`。OpenTalking 仓内没有进程内 `local` 后端；即使单机部署，也需要在同一台机器上启动 OmniRT，再让 OpenTalking 指向本机 `http://127.0.0.1:9000`。
+FasterLivePortrait 当前也走 OmniRT `audio2video` 兼容路径。OpenTalking 负责会话、TTS/音频流、WebRTC 播放和前端参数下发；OmniRT 常驻加载 FasterLivePortrait 与 JoyVASA，统一暴露 `/v1/audio2video/fasterliveportrait`。

 该路径适合单卡实时数字人：默认使用 25fps、1 秒音频 chunk、448 宽实时档，并把动头贴回原始资产图。上传整身图时仍以 FasterLivePortrait 检测到的人脸区域驱动，身体本身不会生成新动作。

-同一个 runtime 也可以服务“视频克隆”工作流：OpenTalking 固定形象库中的数字人图片作为 source，把浏览器摄像头或上传视频逐帧作为 driving input，转发到 OmniRT `/v1/avatar/video-clone/fasterliveportrait`。这条链路不经过 LLM、STT、TTS，也不会复用实时对话的 `speak` 队列。
-
 ## 1. 准备代码和权重

-需要先准备几个目录变量。`FASTERLIVEPORTRAIT_HOME` 是 FasterLivePortrait 源码 checkout；`OMNIRT_MODEL_ROOT` 是模型权重根目录。权重不要放进 OpenTalking 或 OmniRT 仓库。
-
-```bash title="终端"
-export DIGITAL_HUMAN_HOME="${DIGITAL_HUMAN_HOME:-/path/to/digital_human}"
-export OPENTALKING_HOME="${OPENTALKING_HOME:-$DIGITAL_HUMAN_HOME/opentalking}"
-export OMNIRT_REPO="${OMNIRT_REPO:-$DIGITAL_HUMAN_HOME/omnirt}"
-export FASTERLIVEPORTRAIT_HOME="${FASTERLIVEPORTRAIT_HOME:-$DIGITAL_HUMAN_HOME/FasterLivePortrait}"
-export OMNIRT_MODEL_ROOT="${OMNIRT_MODEL_ROOT:-/path/to/model}"
-export FASTERLIVEPORTRAIT_REF="${FASTERLIVEPORTRAIT_REF:-5dcf03aa2e6b2eb2a55b971efdc28fc0afdb1494}"
-```
-
-当前 OpenTalking 视频克隆和 OmniRT runtime 依赖 FasterLivePortrait 的细粒度动作控制、TRT 输出顺序修正和 PyTorch 新版本 checkpoint 加载修正。部署时先固定使用 `zyairehhh/FasterLivePortrait` fork；等这些 patch 进入官方稳定包后，再切换到上游包。
+需要两个目录：FasterLivePortrait 源码 checkout，以及真实 checkpoint 目录。不要用软链接时，直接复制或下载到模型根目录即可。

 ```bash title="终端"
 if [ ! -d "$FASTERLIVEPORTRAIT_HOME/.git" ]; then
-  git clone https://github.com/zyairehhh/FasterLivePortrait.git "$FASTERLIVEPORTRAIT_HOME"
+  git clone https://github.com/KlingAIResearch/LivePortrait.git "$FASTERLIVEPORTRAIT_HOME"
 fi

-git -C "$FASTERLIVEPORTRAIT_HOME" fetch origin master
-git -C "$FASTERLIVEPORTRAIT_HOME" checkout "$FASTERLIVEPORTRAIT_REF"
-
 mkdir -p "$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints"
 ```

@@ -81,12 +65,9 @@ test -f "$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints/chinese-hubert-base/p

 ## 2. 准备 OmniRT 环境

-服务器上建议把 `uv` 缓存放到数据盘，并通过 PyPI 镜像加速依赖安装。`PIP_INDEX_URL` 是给少数仍读取 pip 配置的构建步骤兜底。
-
 ```bash title="终端"
-cd "$OMNIRT_REPO"
+cd "$OMNIRT_HOME"
 export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export PIP_INDEX_URL="${PIP_INDEX_URL:-$UV_DEFAULT_INDEX}"
 export UV_CACHE_DIR="${UV_CACHE_DIR:-$DIGITAL_HUMAN_HOME/.uv-cache}"
 uv sync --extra server --extra fasterliveportrait --python 3.11
 ```
@@ -95,36 +76,24 @@ FasterLivePortrait 实时路径默认使用 TensorRT。`fasterliveportrait` extr

 部署前可确认 `uv run python -c "import tensorrt as trt; print(trt.__version__)"` 能正常输出版本号。

-TensorRT wheel 会把 `libnvinfer.so.10` 放在 OmniRT `.venv` 的 `site-packages/tensorrt_libs` 下。启动 TRT runtime 前需要把这个目录加入动态库搜索路径，否则 `libgrid_sample_3d_plugin.so` 会报 `libnvinfer.so.10: cannot open shared object file`：
-
-```bash title="终端"
-export TRT_LIB_DIR="$OMNIRT_REPO/.venv/lib/python3.11/site-packages/tensorrt_libs"
-export LD_LIBRARY_PATH="$TRT_LIB_DIR:${LD_LIBRARY_PATH:-}"
-```
-
-
 ## 3. 启动 OmniRT FasterLivePortrait runtime

 ```bash title="终端"
-cd "$OMNIRT_REPO"
-mkdir -p "$DIGITAL_HUMAN_HOME/logs"
-nohup env \
-  OMNIRT_FASTLIVEPORTRAIT_RUNTIME=1 \
-  OMNIRT_FASTLIVEPORTRAIT_LOAD_MODELS=1 \
-  OMNIRT_FASTLIVEPORTRAIT_ROOT="$FASTERLIVEPORTRAIT_HOME" \
-  OMNIRT_FASTLIVEPORTRAIT_CHECKPOINTS_DIR="$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints" \
-  OMNIRT_FASTLIVEPORTRAIT_CFG=configs/trt_infer.yaml \
-  OMNIRT_FASTLIVEPORTRAIT_DEVICE=cuda:0 \
-  OMNIRT_FASTLIVEPORTRAIT_JPEG_QUALITY=85 \
-  uv run omnirt serve-avatar-ws --host 0.0.0.0 --port 9000 --backend cuda \
-  > "$DIGITAL_HUMAN_HOME/logs/omnirt-fasterliveportrait-9000.log" 2>&1 &
-echo $! > "$DIGITAL_HUMAN_HOME/logs/omnirt-fasterliveportrait-9000.pid"
+cd "$OMNIRT_HOME"
+OMNIRT_FASTLIVEPORTRAIT_RUNTIME=1 \
+OMNIRT_FASTLIVEPORTRAIT_LOAD_MODELS=1 \
+OMNIRT_FASTLIVEPORTRAIT_ROOT="$FASTERLIVEPORTRAIT_HOME" \
+OMNIRT_FASTLIVEPORTRAIT_CHECKPOINTS_DIR="$OMNIRT_MODEL_ROOT/FasterLivePortrait/checkpoints" \
+OMNIRT_FASTLIVEPORTRAIT_CFG=configs/trt_infer.yaml \
+OMNIRT_FASTLIVEPORTRAIT_DEVICE=cuda:0 \
+OMNIRT_FASTLIVEPORTRAIT_JPEG_QUALITY=85 \
+uv run omnirt serve-avatar-ws --host 0.0.0.0 --port 9000 --backend cuda
 ```

 服务启动后验证 OmniRT 是否报告模型：

 ```bash title="终端"
-curl -s http://127.0.0.1:9000/v1/audio2video/models | python3 -m json.tool
+curl -s http://127.0.0.1:9000/v1/audio2video/models | jq '.statuses[] | select(.id=="fasterliveportrait")'
 ```

 期望状态类似：
@@ -135,16 +104,6 @@ curl -s http://127.0.0.1:9000/v1/audio2video/models | python3 -m json.tool

 ## 4. 配置并启动 OpenTalking

-先同步 OpenTalking 环境。这里继续使用与 OmniRT 相同的 uv 镜像和缓存目录。
-
-```bash title="终端"
-cd "$OPENTALKING_HOME"
-export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/simple}"
-export PIP_INDEX_URL="${PIP_INDEX_URL:-$UV_DEFAULT_INDEX}"
-export UV_CACHE_DIR="${UV_CACHE_DIR:-$DIGITAL_HUMAN_HOME/.uv-cache}"
-uv sync --extra dev --python 3.11
-```
-
 OpenTalking 默认把 `fasterliveportrait` 配成 `backend: omnirt`。实时档参数位于 `configs/synthesis/fasterliveportrait.yaml`，常用默认值：

 ```yaml title="configs/synthesis/fasterliveportrait.yaml"
@@ -169,30 +128,27 @@ flag_stitching: true
 head_only_pasteback: false
 ```

-启动 OpenTalking 并指向 OmniRT。`scripts/start_unified.sh` 会设置 `OPENTALKING_FASTLIVEPORTRAIT_BACKEND=omnirt`、`OPENTALKING_DEFAULT_MODEL=fasterliveportrait` 和 `OMNIRT_ENDPOINT`，并在 API 启动后拉起 WebUI：
+启动 OpenTalking 并指向 OmniRT：

 ```bash title="终端"
 cd "$OPENTALKING_HOME"
-bash scripts/start_unified.sh \
-  --backend omnirt \
-  --model fasterliveportrait \
-  --omnirt http://127.0.0.1:9000 \
-  --api-port 8000 \
-  --web-port 5173 \
-  --host 0.0.0.0
+OMNIRT_ENDPOINT=http://127.0.0.1:9000 \
+OPENTALKING_OMNIRT_ENDPOINT=http://127.0.0.1:9000 \
+uv run opentalking-unified --host 0.0.0.0 --port 8000
 ```

-上一步已经会启动 WebUI。若只需要重启前端，或后端已经在 `8000` 端口运行，另开终端执行：
+前端：

 ```bash title="终端"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
+cd "$OPENTALKING_HOME/apps/web"
+npm ci
+VITE_BACKEND_PORT=8000 npm run dev -- --host 0.0.0.0 --port 5173
 ```

 验证 OpenTalking 能看到模型：

 ```bash title="终端"
-curl -s http://127.0.0.1:8000/models | python3 -m json.tool
+curl -s http://127.0.0.1:8000/models | jq '.statuses[] | select(.id=="fasterliveportrait")'
 ```

 期望：
@@ -201,18 +157,6 @@ curl -s http://127.0.0.1:8000/models | python3 -m json.tool
 {"id":"fasterliveportrait","backend":"omnirt","connected":true,"reason":"omnirt"}
 ```

-同时验证视频克隆入口：
-
-```bash title="终端"
-curl -s http://127.0.0.1:8000/video-clone/status | python3 -m json.tool
-```
-
-期望：
-
-```json
-{"model":"fasterliveportrait","connected":true,"reason":"omnirt"}
-```
-
 ## 5. 前端参数和热更新

 在前端选择 `FasterLivePortrait` 后，会出现“FasterLivePortrait 幅度”配置区。未启动会话时，点击“应用配置”会保存到下一次会话；会话运行中点击“实时应用”，下一块音频 chunk 开始生效，无需重启会话。
@@ -234,39 +178,10 @@ curl -s http://127.0.0.1:8000/video-clone/status | python3 -m json.tool

 推荐先用 `head_motion_multiplier=0.3`、`pose_motion_multiplier=0.35`、`yaw_multiplier=0.85`、`roll_multiplier=0.85`、`animation_region=lip`、`expression_multiplier=1.0`、`mouth_open_multiplier=1.25`、`mouth_corner_multiplier=0.85`、`cheek_jaw_multiplier=0.9`、`cfg_scale=4.0`，并保持 `flag_relative_motion=true`。如果头左右晃动明显，先把 `yaw_multiplier` 降到 `0.7`；如果嘴型偏嘟或笑得过大，先把 `mouth_corner_multiplier` 降到 `0.75`；如果需要更丰富表情，再把驱动区域从 `lip` 切到 `all`。不要用抽帧来提速。

-## 6. 视频克隆模式
-
-视频克隆在 WebUI 顶部与“实时对话”并列。进入“视频克隆”后：
-
- Source：左侧选择已有数字人，或上传新的 source 图片。source 是被驱动的数字人资产。
- Driving：右侧选择摄像头，或上传 driving video。driving 只提供表情、头动和嘴部运动。
- Output：中间显示实时输出，状态条展示发送帧、接收帧、丢帧和延迟。
-
-前端会连接 OpenTalking：
-
-```text
-ws://<opentalking-host>/video-clone/fasterliveportrait/ws
-```
-
-OpenTalking 再把 source 图片和 driving 帧流转发到 OmniRT：
-
-```text
-ws://<omnirt-host>/v1/avatar/video-clone/fasterliveportrait
-```
-
-常用调试建议：
-
- 想保持原图构图时，打开“拼回原图”。
- 上传 driving video 嘴张不开时，先调高“张嘴开合”；如果嘴部变成单纯上下开合，降低“唇形重定向”。
- 感觉嘴鼓或位置不对时，先关闭“裁剪 driving 人脸”，确认 driving 输入没有被过度裁剪。
- 摄像头权限失败时，确认页面通过 `localhost` / `127.0.0.1` 或 HTTPS 打开；也可以先上传 driving video 验证后端服务。
-
-停止或切页后，前端会释放摄像头 track、WebSocket 和当前 video-clone session。
-
-## 7. 性能验收
+## 6. 性能验收

 ```bash title="终端"
-cd "$OMNIRT_REPO"
+cd "$OMNIRT_HOME"
 uv run python scripts/bench_fasterliveportrait_ws.py \
  --url ws://127.0.0.1:9000/v1/audio2video/fasterliveportrait \
  --duration 30 \
--- a/docs/zh/model-deployment/flashhead.md
+++ b/docs/zh/model-deployment/flashhead.md
@@ -50,7 +50,7 @@ bash scripts/quickstart/start_all.sh
 ## `/models` 验证

 ```bash title="终端"
-curl -s http://127.0.0.1:8000/models | python3 -m json.tool
+curl -s http://127.0.0.1:8000/models | jq '.statuses[] | select(.id=="flashhead")'
 ```

 配置 WebSocket URL 后，期望：
@@ -65,15 +65,4 @@ curl -s http://127.0.0.1:8000/models | python3 -m json.tool
 |------|------|
 | `reason=not_configured` | 设置 `OPENTALKING_FLASHHEAD_WS_URL`。 |
 | WebSocket 握手失败 | 检查 FlashHead 服务路径、端口和跨机器网络。 |
-| Avatar 不匹配 | 使用 `model_type: flashhead` 的 avatar。 |
-
-## 前端入口
-
-模型或后端服务启动后，统一用 OpenTalking WebUI 访问：
-
-```bash title="终端"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-远程服务器部署时，把本地浏览器端口映射到服务器 `5173`，再打开 `http://127.0.0.1:5173`。
+| Avatar 不匹配 | 确认 avatar 能被服务读取，并且 FlashHead 服务端能访问所需参考图。 |
--- a/docs/zh/model-deployment/flashtalk.md
+++ b/docs/zh/model-deployment/flashtalk.md
@@ -69,7 +69,7 @@ bash scripts/quickstart/start_omnirt_flashtalk.sh --device npu --nproc 8
 ## `/models` 验证

 ```bash title="终端"
-curl -s http://127.0.0.1:8000/models | python3 -m json.tool
+curl -s http://127.0.0.1:8000/models | jq '.statuses[] | select(.id=="flashtalk")'
 ```

 期望：
@@ -86,14 +86,3 @@ curl -s http://127.0.0.1:8000/models | python3 -m json.tool
 | CUDA OOM | 降低 `OPENTALKING_FLASHTALK_FRAME_NUM`、`OPENTALKING_FLASHTALK_SAMPLE_STEPS` 或分辨率。 |
 | NPU import 失败 | 确认已 source CANN，且 `torch_npu`、驱动和 CANN 版本匹配。 |
 | `reason=not_configured` | 配置 `OMNIRT_ENDPOINT` 或用 `start_all.sh --omnirt ...`。 |
-
-## 前端入口
-
-模型或后端服务启动后，统一用 OpenTalking WebUI 访问：
-
-```bash title="终端"
-cd "$OPENTALKING_HOME"
-bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0
-```
-
-远程服务器部署时，把本地浏览器端口映射到服务器 `5173`，再打开 `http://127.0.0.1:5173`。
--- a/Show More
+++ b/Show More