feat: add local CosyVoice TRT sidecar deployment (#119)

2026-07-03 15:22:34 +08:00 · 2026-06-23 23:06:58 +08:00
parent 61f4007965
commit 7f37c3b49d
14 changed files with 691 additions and 21 deletions
--- a/README.en.md
+++ b/README.en.md
@@ -126,7 +126,7 @@ OpenTalking's **orchestration layer** (API / Worker / frontend) and **digital-hu
 | Fast trial | `mock` | CPU / no GPU | Validate API, LLM, TTS, WebRTC, and browser playback without downloading model weights | [Quickstart](docs/en/user-guide/quickstart.md) |
 | Entry validation | `quicktalk` / `wav2lip` | RTX 3050 Laptop, RTX 3060, RTX 4060 | Run real video rendering for demos and deployment validation; lower the resolution on low-memory devices | [QuickTalk](docs/en/model-deployment/quicktalk.md) / [Wav2Lip](docs/en/model-deployment/wav2lip-local.md) |
 | Consumer-GPU single machine | `quicktalk` / `wav2lip` / `musetalk` | RTX 3090, RTX 4090 | Closer to real-time local demos, private validation, and lightweight pre-production evaluation | [Model deployment](docs/en/model-deployment/index.md) |
-| Fully local private path | `sensevoice` + `local_cosyvoice` + `quicktalk` | RTX 3090 / 4090 or similar GPU | Run STT, TTS, and video driving with local models to reduce external dependencies | [Local STT/TTS + QuickTalk](docs/en/model-deployment/local-quicktalk-audio.md) |
+| Fully local private path | `sensevoice` + `local_cosyvoice` + `quicktalk` | RTX 3090 / 4090 or similar GPU | Run STT, TTS, and video driving locally; OpenTalking uses the main `.venv`, while CosyVoice runs in a dedicated sidecar venv | [Local STT/TTS + QuickTalk](docs/en/model-deployment/local-quicktalk-audio.md) |
 | High-quality remote inference | `flashtalk` / `flashhead` / `fasterliveportrait` + OmniRT | Multi-GPU, Ascend 910B2, remote GPU service | Multi-card, GPU/NPU, production isolation, higher visual quality, or video clone workflows | [FlashTalk](docs/en/model-deployment/flashtalk.md) / [FasterLivePortrait](docs/en/model-deployment/fasterliveportrait.md) |
 | Docker / production deployment | API, Web, Worker, external model services | Single GPU, remote GPU, distributed cluster | Service deployment, remote GPU, distributed runtime, and production validation | [Deployment](docs/en/user-guide/deployment.md) |

--- a/README.md
+++ b/README.md
@@ -126,7 +126,7 @@ OpenTalking's **orchestration layer** (API / Worker / frontend) and **digital-hu
 | Fast trial | `mock` | CPU / no GPU | Validate API, LLM, TTS, WebRTC, and browser playback without downloading model weights | [Quickstart](docs/en/user-guide/quickstart.md) |
 | Entry validation | `quicktalk` / `wav2lip` | RTX 3050 Laptop, RTX 3060, RTX 4060 | Run real video rendering for demos and deployment validation; lower the resolution on low-memory devices | [QuickTalk](docs/en/model-deployment/quicktalk.md) / [Wav2Lip](docs/en/model-deployment/wav2lip-local.md) |
 | Consumer-GPU single machine | `quicktalk` / `wav2lip` / `musetalk` | RTX 3090, RTX 4090 | Closer to real-time local demos, private validation, and lightweight pre-production evaluation | [Model deployment](docs/en/model-deployment/index.md) |
-| Fully local private path | `sensevoice` + `local_cosyvoice` + `quicktalk` | RTX 3090 / 4090 or similar GPU | Run STT, TTS, and video driving with local models to reduce external dependencies | [Local STT/TTS + QuickTalk](docs/en/model-deployment/local-quicktalk-audio.md) |
+| Fully local private path | `sensevoice` + `local_cosyvoice` + `quicktalk` | RTX 3090 / 4090 or similar GPU | Run STT, TTS, and video driving locally; OpenTalking uses the main `.venv`, while CosyVoice runs in a dedicated sidecar venv | [Local STT/TTS + QuickTalk](docs/en/model-deployment/local-quicktalk-audio.md) |
 | High-quality remote inference | `flashtalk` / `flashhead` / `fasterliveportrait` + OmniRT | Multi-GPU, Ascend 910B2, remote GPU service | Multi-card, GPU/NPU, production isolation, higher visual quality, or video clone workflows | [FlashTalk](docs/en/model-deployment/flashtalk.md) / [FasterLivePortrait](docs/en/model-deployment/fasterliveportrait.md) |
 | Docker / production deployment | API, Web, Worker, external model services | Single GPU, remote GPU, distributed cluster | Service deployment, remote GPU, distributed runtime, and production validation | [Deployment](docs/en/user-guide/deployment.md) |

--- a/README.zh.md
+++ b/README.zh.md
@@ -126,7 +126,7 @@ OpenTalking 的 **编排层**（API / Worker / 前端）和 **数字人合成后
 | 快速体验 | `mock` | CPU / 无 GPU | 不下载模型权重，先验证 API、LLM、TTS、WebRTC 与浏览器播放链路 | [快速开始](docs/zh/user-guide/quickstart.md) |
 | 入门验证 | `quicktalk` / `wav2lip` | RTX 3050 Laptop、RTX 3060、RTX 4060 | 能跑通真实视频渲染，适合功能演示和部署验证；低显存设备建议降低分辨率 | [QuickTalk](docs/zh/model-deployment/quicktalk.md) / [Wav2Lip](docs/zh/model-deployment/wav2lip-local.md) |
 | 消费级显卡单机 | `quicktalk` / `wav2lip` / `musetalk` | RTX 3090、RTX 4090 | 更接近实时体验，适合本地 demo、私有化验证和轻量生产前评估 | [模型部署](docs/zh/model-deployment/index.md) |
-| 全本地私有化 | `sensevoice` + `local_cosyvoice` + `quicktalk` | RTX 3090 / 4090 或同级 GPU | STT、TTS、视频驱动都走本地模型，减少外部依赖 | [本地 STT/TTS + QuickTalk](docs/zh/model-deployment/local-quicktalk-audio.md) |
+| 全本地私有化 | `sensevoice` + `local_cosyvoice` + `quicktalk` | RTX 3090 / 4090 或同级 GPU | STT、TTS、视频驱动都走本地；OpenTalking 使用主 `.venv`，CosyVoice 使用独立 sidecar venv | [本地 STT/TTS + QuickTalk](docs/zh/model-deployment/local-quicktalk-audio.md) |
 | 高质量远端推理 | `flashtalk` / `flashhead` / `fasterliveportrait` + OmniRT | 多卡 GPU、Ascend 910B2、远端 GPU 服务 | 多卡、GPU/NPU、生产隔离、更高画质或视频克隆 | [FlashTalk](docs/zh/model-deployment/flashtalk.md) / [FasterLivePortrait](docs/zh/model-deployment/fasterliveportrait.md) |
 | Docker / 生产部署 | API、Web、Worker、外部模型服务分离 | 单机 GPU、远端 GPU、分布式集群 | 服务化部署、远端 GPU、分布式和生产验证 | [部署文档](docs/zh/user-guide/deployment.md) |

--- a/apps/api/routes/tts_preview.py
+++ b/apps/api/routes/tts_preview.py
@@ -24,6 +24,7 @@ router = APIRouter(prefix="/tts", tags=["tts"])
 logger = logging.getLogger(__name__)

 MAX_PREVIEW_TEXT_CHARS = 1000
+LOCAL_COSYVOICE_PREVIEW_SECONDS = 3.0
 _INDEXTTS_PROVIDERS = {"indextts", "local_indextts", "omnirt_indextts"}
 PreviewUploadFile = UploadFile | StarletteUploadFile

@@ -36,6 +37,12 @@ class TTSPreviewRequest(BaseModel):
    indextts_config: dict[str, Any] | None = None


+def _preview_sample_limit(provider: str | None, sample_rate: int) -> int | None:
+    if provider == "local_cosyvoice":
+        return max(1, int(sample_rate * LOCAL_COSYVOICE_PREVIEW_SECONDS))
+    return None
+
+
 def _wav_bytes(chunks: list[np.ndarray], sample_rate: int) -> bytes:
    pcm = np.concatenate(chunks) if chunks else np.zeros(0, dtype=np.int16)
    pcm = np.asarray(pcm, dtype="<i2").reshape(-1)
@@ -215,12 +222,17 @@ async def preview_tts(request: Request) -> Response:
    )
    chunks: list[np.ndarray] = []
    effective_sample_rate = sample_rate
+    sample_limit = _preview_sample_limit(provider, sample_rate)
+    total_samples = 0
    try:
        async for chunk in tts.synthesize_stream(text, voice=voice):
            arr = np.asarray(chunk.data, dtype=np.int16).reshape(-1)
            if arr.size:
                chunks.append(arr.copy())
+                total_samples += int(arr.size)
            effective_sample_rate = int(chunk.sample_rate or effective_sample_rate)
+            if sample_limit is not None and total_samples >= sample_limit:
+                break
    except Exception as exc:
        raise HTTPException(status_code=502, detail=f"TTS preview failed: {exc}") from exc
    finally:
--- a/apps/api/tests/test_tts_preview.py
+++ b/apps/api/tests/test_tts_preview.py
@@ -246,6 +246,45 @@ def test_tts_preview_form_passes_indextts_emotion_audio_file(monkeypatch):
    assert calls[0]["emotion_audio_bytes"] == b"RIFFemotion"


+
+def test_tts_preview_local_cosyvoice_returns_after_enough_preview_audio(monkeypatch):
+    from apps.api.routes import tts_preview
+
+    yielded: list[int] = []
+
+    class FakeTTS:
+        async def synthesize_stream(self, text: str, voice: str | None = None):
+            for i in range(20):
+                yielded.append(i)
+                yield AudioChunk(
+                    data=np.ones(16000, dtype=np.int16),
+                    sample_rate=16000,
+                    duration_ms=1000.0,
+                )
+
+    def fake_build_tts_adapter(**kwargs):
+        return FakeTTS()
+
+    monkeypatch.setattr(tts_preview, 'build_tts_adapter', fake_build_tts_adapter)
+
+    app = FastAPI()
+    app.include_router(tts_preview.router)
+    client = TestClient(app)
+
+    response = client.post(
+        '/tts/preview',
+        json={
+            'text': '你好，我正在测试音色。',
+            'voice': 'local-office-serena',
+            'tts_provider': 'local_cosyvoice',
+            'tts_model': 'FunAudioLLM/Fun-CosyVoice3-0.5B-2512',
+        },
+    )
+
+    assert response.status_code == 200
+    assert response.content.startswith(b'RIFF')
+    assert 1 <= len(yielded) < 20
+
 def test_tts_preview_rejects_empty_text():
    from apps.api.routes import tts_preview

--- a/docs/en/model-deployment/recipes/local-quicktalk-audio.md
+++ b/docs/en/model-deployment/recipes/local-quicktalk-audio.md
@@ -60,13 +60,18 @@ OPENTALKING_TTS_DASHSCOPE_API_KEY=<dashscope-tts-key>
 ## Install and Models

 ```bash title="terminal"
-uv sync --extra dev --extra models --extra local-audio --extra local-cosyvoice-service --extra quicktalk-cuda --python 3.11
+uv sync --extra dev --extra models --extra local-audio --extra quicktalk-cuda --python 3.11
 python scripts/download_local_audio_models.py \
  --root ./models/local-audio \
  --model sensevoice-small \
  --model fun-cosyvoice3-0.5b-2512
 ```

+Use the main `.venv` for OpenTalking, SenseVoice, and QuickTalk. Create a
+separate CosyVoice sidecar venv after the runtime checkout.
+
+For CosyVoice3 model sources and the optional fp16 TensorRT ONNX files, see [TTS deployment](../tts.md#local-cosyvoice3-05b).
+
 Prepare QuickTalk weights as described in [QuickTalk Local](../quicktalk/local.md). Put the CosyVoice runtime under the model directory:

 ```bash title="terminal"
@@ -74,6 +79,9 @@ mkdir -p ./models/local-audio/runtime
 git clone https://github.com/FunAudioLLM/CosyVoice.git ./models/local-audio/runtime/CosyVoice
 cd ./models/local-audio/runtime/CosyVoice
 git submodule update --init --recursive
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
+  bash scripts/prepare_cosyvoice_venv.sh
 ```

 ## Start
@@ -81,8 +89,7 @@ git submodule update --init --recursive
 Start the local TTS service first:

 ```bash title="terminal"
-OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD=1 \
-python scripts/local_cosyvoice_service.py --host 127.0.0.1 --port 19090
+bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
 ```

 Then start OpenTalking:
--- a/docs/en/model-deployment/tts.md
+++ b/docs/en/model-deployment/tts.md
@@ -60,12 +60,41 @@ export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/s
 export UV_INDEX_URL="${UV_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
 export PIP_INDEX_URL="${PIP_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
 export UV_LINK_MODE=copy
-uv sync --extra dev --extra models --extra local-audio --extra local-cosyvoice-service --python 3.11
+uv sync --extra dev --extra models --extra local-audio --python 3.11
 .venv/bin/python scripts/download_local_audio_models.py \
  --root ./models/local-audio \
  --model fun-cosyvoice3-0.5b-2512
 ```

+This downloads the base CosyVoice3 model from ModelScope:
+
+| Asset | Source | Destination |
+|---|---|---|
+| Base CosyVoice3 weights | ModelScope `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | `./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512/` |
+
+The base model directory must include the files used by the sidecar runtime,
+including `cosyvoice3.yaml`, `llm.pt`, `flow.pt`, `hift.pt`,
+`speech_tokenizer_v3.onnx`, `speech_tokenizer_v3.batch.onnx`, `campplus.onnx`,
+and `flow.decoder.estimator.fp32.onnx`. The built-in zero-shot voice also needs
+a prompt wav configured by `OPENTALKING_TTS_LOCAL_COSYVOICE_PROMPT_AUDIO`; cloned
+voices store their own prompt wav under the local voice directory.
+
+For fp16 TensorRT, download the extra ONNX assets from Hugging Face and place
+them in the same base model directory:
+
+| Asset | Source | Required for |
+|---|---|---|
+| `flow.decoder.estimator.autocast_fp16.onnx` | Hugging Face `yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX` | `FP16 + LOAD_TRT=1`; OpenTalking builds `flow.decoder.estimator.autocast_fp16.mygpu.plan` from it. |
+| `flow.decoder.estimator.streaming.autocast_fp16.onnx` | Hugging Face `yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX` | Optional streaming fp16 ONNX asset; keep beside the estimator ONNX for runtime compatibility. |
+
+The generated `*.mygpu.plan` files are machine-specific TensorRT engines. Do not
+copy them between different GPU / TensorRT / CUDA environments; rebuild them on
+the target host from the ONNX files.
+
+This main `.venv` is for OpenTalking, SenseVoice, and the video backend. Keep
+CosyVoice in its own sidecar venv so its `transformers==4.51.3` runtime does not
+conflict with OpenTalking's `transformers>=4.57,<6`.
+
 Prepare the CosyVoice runtime:

 ```bash title="Terminal"
@@ -75,18 +104,50 @@ cd ./models/local-audio/runtime/CosyVoice
 git submodule update --init --recursive
 ```

+Create or update the CosyVoice sidecar venv:
+
+```bash title="Terminal"
+OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
+  bash scripts/prepare_cosyvoice_venv.sh
+```
+
+If you need TensorRT, install the TRT dependencies into the CosyVoice sidecar venv, not into OpenTalking's main `.venv`:
+
+```bash title="Terminal"
+PIP_EXTRA_INDEX_URL=https://pypi.nvidia.com/ OPENTALKING_COSYVOICE_INSTALL_TENSORRT=1 \
+  OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice bash scripts/prepare_cosyvoice_venv.sh
+```
+
 Start the local TTS service:

 ```bash title="Terminal"
-OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD=1 \
-python scripts/local_cosyvoice_service.py --host 127.0.0.1 --port 19090
+bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
 ```

 In prior GPU validation, the main CosyVoice3 issue was not a single TTFA number but seed-dependent output-length drift. The local CosyVoice service therefore keeps two stability guards on by default: `OPENTALKING_TTS_LOCAL_COSYVOICE_MASK_STOP_TOKENS=1` masks every stop token exposed by the CosyVoice LLM, and `OPENTALKING_TTS_LOCAL_COSYVOICE_MAX_TOKEN_TEXT_RATIO=6` bounds the token/text ratio so long prompts do not occasionally produce runaway audio. Keep these guards enabled for realtime use.

 TensorRT is optional. Enable it only after the current CosyVoice runtime, CUDA, onnxruntime-gpu/TensorRT engines, and model directory are compatible:

-```env title=".env"
+```bash title="Terminal"
+.venv-cosyvoice/bin/python -c "import tensorrt as trt; print(trt.__version__)"
+test -f ./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512/flow.decoder.estimator.fp32.onnx
+```
+
+For CosyVoice3 fp16 TRT, prefer the official autocast fp16 ONNX asset. A TRT engine can be built from `flow.decoder.estimator.fp32.onnx`, but some GPU/TensorRT combinations can produce NaNs or silent audio. Before enabling `FP16 + LOAD_TRT`, place `flow.decoder.estimator.autocast_fp16.onnx` in the same model directory. If the server needs a proxy for Hugging Face, inject proxy variables only for the download command; do not add them to the OpenTalking service env:
+
+```bash title="Terminal"
+env ALL_PROXY=socks5h://127.0.0.1:7890 HTTPS_PROXY=socks5h://127.0.0.1:7890 \
+  HF_ENDPOINT=https://huggingface.co .venv-cosyvoice/bin/python - <<'PY'
+from huggingface_hub import hf_hub_download
+repo = "yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX"
+target = "./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512"
+for name in ["flow.decoder.estimator.autocast_fp16.onnx", "flow.decoder.estimator.streaming.autocast_fp16.onnx"]:
+    hf_hub_download(repo_id=repo, filename=name, repo_type="model", local_dir=target)
+PY
+```
+
+```env title="scripts/quickstart/env"
+OPENTALKING_TTS_LOCAL_COSYVOICE_FP16=auto
 OPENTALKING_TTS_LOCAL_COSYVOICE_LOAD_TRT=1
 OPENTALKING_TTS_LOCAL_COSYVOICE_TRT_CONCURRENT=1
 OPENTALKING_TTS_LOCAL_COSYVOICE_TOKEN_HOP_LEN=8
@@ -94,12 +155,26 @@ OPENTALKING_TTS_LOCAL_COSYVOICE_TOKEN_MAX_HOP_LEN=16
 OPENTALKING_TTS_LOCAL_COSYVOICE_STREAM_SCALE_FACTOR=1
 ```

-After startup, check the sidecar health payload and verify `runtime_flags.load_trt`, `streaming`, `llm_token_ratio`, and `llm_stop_token_patch`:
+`start_local_cosyvoice.sh` automatically adds the sidecar venv's `site-packages/tensorrt_libs` directory to `LD_LIBRARY_PATH`. On first startup with `FP16 + LOAD_TRT=1`, if `flow.decoder.estimator.autocast_fp16.onnx` exists in the model directory, OpenTalking builds the GPU-specific `flow.decoder.estimator.autocast_fp16.mygpu.plan` from it; this can take longer than a normal startup. SenseVoice still runs in the OpenTalking main `.venv` and should not follow the CosyVoice TRT settings.
+
+After startup, check the sidecar health payload and verify `runtime_flags.load_trt`, `runtime.trt_autocast_fp16`, `streaming`, `llm_token_ratio`, and `llm_stop_token_patch`:

 ```bash title="Terminal"
 curl -fsS http://127.0.0.1:19090/health | python3 -m json.tool
 ```

+Measured on a Linux server with an NVIDIA RTX 3090, CosyVoice3 sidecar venv,
+`FP16 + LOAD_TRT=1`, and the autocast fp16 TensorRT plan loaded. The benchmark
+called the sidecar `/synthesize` endpoint directly and measured first PCM byte
+arrival as TTFB:
+
+| Text length | TTFB | Wall time | Audio duration | RTF |
+|---:|---:|---:|---:|---:|
+| 43 chars | 0.683 s | 6.215 s | 7.200 s | 0.863 |
+| 42 chars | 0.642 s | 5.858 s | 6.960 s | 0.842 |
+| 29 chars | 0.639 s | 5.771 s | 6.520 s | 0.885 |
+| **Average** | **0.655 s** | **5.948 s** | **6.893 s** | **0.863** |
+
 For the full local speech input, speech synthesis, and QuickTalk video chain, see [Local STT/TTS + QuickTalk](recipes/local-quicktalk-audio.md).

 ## IndexTTS Deployment (provider = indextts)
--- a/docs/zh/model-deployment/recipes/local-quicktalk-audio.md
+++ b/docs/zh/model-deployment/recipes/local-quicktalk-audio.md
@@ -60,13 +60,18 @@ OPENTALKING_TTS_DASHSCOPE_API_KEY=<dashscope-tts-key>
 ## 安装与模型

 ```bash title="终端"
-uv sync --extra dev --extra models --extra local-audio --extra local-cosyvoice-service --extra quicktalk-cuda --python 3.11
+uv sync --extra dev --extra models --extra local-audio --extra quicktalk-cuda --python 3.11
 python scripts/download_local_audio_models.py \
  --root ./models/local-audio \
  --model sensevoice-small \
  --model fun-cosyvoice3-0.5b-2512
 ```

+主 `.venv` 只负责 OpenTalking、SenseVoice 和 QuickTalk。CosyVoice runtime
+准备好后，创建独立 sidecar venv。
+
+CosyVoice3 主权重来源和可选 fp16 TensorRT ONNX 文件见 [TTS 部署](../tts.md)。
+
 QuickTalk 权重按 [QuickTalk Local](../quicktalk/local.md) 页面准备。CosyVoice runtime 放在模型目录下即可：

 ```bash title="终端"
@@ -74,6 +79,9 @@ mkdir -p ./models/local-audio/runtime
 git clone https://github.com/FunAudioLLM/CosyVoice.git ./models/local-audio/runtime/CosyVoice
 cd ./models/local-audio/runtime/CosyVoice
 git submodule update --init --recursive
+cd "$DIGITAL_HUMAN_HOME/opentalking"
+OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
+  bash scripts/prepare_cosyvoice_venv.sh
 ```

 ## 启动
@@ -81,8 +89,7 @@ git submodule update --init --recursive
 先启动本地 TTS service：

 ```bash title="终端"
-OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD=1 \
-python scripts/local_cosyvoice_service.py --host 127.0.0.1 --port 19090
+bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
 ```

 再启动 OpenTalking：
--- a/docs/zh/model-deployment/tts.md
+++ b/docs/zh/model-deployment/tts.md
@@ -60,12 +60,39 @@ export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/s
 export UV_INDEX_URL="${UV_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
 export PIP_INDEX_URL="${PIP_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
 export UV_LINK_MODE=copy
-uv sync --extra dev --extra models --extra local-audio --extra local-cosyvoice-service --python 3.11
+uv sync --extra dev --extra models --extra local-audio --python 3.11
 .venv/bin/python scripts/download_local_audio_models.py \
  --root ./models/local-audio \
  --model fun-cosyvoice3-0.5b-2512
 ```

+这一步会从 ModelScope 下载 CosyVoice3 主模型：
+
+| 资产 | 来源 | 目标目录 |
+|---|---|---|
+| CosyVoice3 主权重 | ModelScope `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | `./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512/` |
+
+主模型目录至少需要包含 sidecar runtime 会加载的文件，包括 `cosyvoice3.yaml`、
+`llm.pt`、`flow.pt`、`hift.pt`、`speech_tokenizer_v3.onnx`、
+`speech_tokenizer_v3.batch.onnx`、`campplus.onnx` 和
+`flow.decoder.estimator.fp32.onnx`。内置 zero-shot 音色还需要
+`OPENTALKING_TTS_LOCAL_COSYVOICE_PROMPT_AUDIO` 指向一段 prompt wav；复刻音色会把
+自己的 prompt wav 保存在本地音色目录。
+
+如果要启用 fp16 TensorRT，再从 Hugging Face 下载额外 ONNX 资产，并放到同一个主模型目录：
+
+| 资产 | 来源 | 用途 |
+|---|---|---|
+| `flow.decoder.estimator.autocast_fp16.onnx` | Hugging Face `yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX` | `FP16 + LOAD_TRT=1` 必需；OpenTalking 会由它生成 `flow.decoder.estimator.autocast_fp16.mygpu.plan`。 |
+| `flow.decoder.estimator.streaming.autocast_fp16.onnx` | Hugging Face `yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX` | 可选 streaming fp16 ONNX 资产；建议和 estimator ONNX 放在一起，保持 runtime 兼容。 |
+
+生成的 `*.mygpu.plan` 是和机器绑定的 TensorRT engine，不要跨 GPU / TensorRT /
+CUDA 环境复制；在目标机器上由 ONNX 重新构建。
+
+这个主 `.venv` 负责 OpenTalking、SenseVoice 和视频后端。CosyVoice 需要独立
+sidecar venv，避免它的 `transformers==4.51.3` runtime 与 OpenTalking 的
+`transformers>=4.57,<6` 冲突。
+
 准备 CosyVoice runtime：

 ```bash title="终端"
@@ -75,18 +102,50 @@ cd ./models/local-audio/runtime/CosyVoice
 git submodule update --init --recursive
 ```

+创建或更新 CosyVoice 专用 sidecar venv：
+
+```bash title="终端"
+OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
+  bash scripts/prepare_cosyvoice_venv.sh
+```
+
+如果要启用 TensorRT，把 TRT 依赖安装在 CosyVoice sidecar venv 中，不要安装进 OpenTalking 主 `.venv`：
+
+```bash title="终端"
+PIP_EXTRA_INDEX_URL=https://pypi.nvidia.com/ OPENTALKING_COSYVOICE_INSTALL_TENSORRT=1 \
+  OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice bash scripts/prepare_cosyvoice_venv.sh
+```
+
 启动本地 TTS service：

 ```bash title="终端"
-OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD=1 \
-python scripts/local_cosyvoice_service.py --host 127.0.0.1 --port 19090
+bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
 ```

 在既有 GPU 验证中，CosyVoice3 的关键问题不是单次 TTFA，而是随机种子导致的生成长度漂移。OpenTalking 的本地 CosyVoice service 因此默认保留两类稳定性保护：`OPENTALKING_TTS_LOCAL_COSYVOICE_MASK_STOP_TOKENS=1` 会屏蔽 CosyVoice LLM 暴露的全部 stop token，`OPENTALKING_TTS_LOCAL_COSYVOICE_MAX_TOKEN_TEXT_RATIO=6` 会限制 token/text 比例，避免长文本偶发生成过长音频。不要为了追求更快首包把这两个保护关掉。

 TensorRT 是可选加速。只有当当前 CosyVoice runtime、CUDA、onnxruntime-gpu/TensorRT engine 与模型目录匹配时再开启：

-```env title=".env"
+```bash title="终端"
+.venv-cosyvoice/bin/python -c "import tensorrt as trt; print(trt.__version__)"
+test -f ./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512/flow.decoder.estimator.fp32.onnx
+```
+
+CosyVoice3 的 fp16 TRT 推荐使用官方 autocast fp16 ONNX 资产。普通 `flow.decoder.estimator.fp32.onnx` 可以生成 TRT engine，但在部分 GPU/TensorRT 组合上会出现 NaN 或静音；如果要开启 `FP16 + LOAD_TRT`，先把 `flow.decoder.estimator.autocast_fp16.onnx` 放到同一个模型目录。服务器需要代理访问 Hugging Face 时，只给下载命令临时注入代理变量即可，不要写入 OpenTalking 主服务环境：
+
+```bash title="终端"
+env ALL_PROXY=socks5h://127.0.0.1:7890 HTTPS_PROXY=socks5h://127.0.0.1:7890 \
+  HF_ENDPOINT=https://huggingface.co .venv-cosyvoice/bin/python - <<'PY'
+from huggingface_hub import hf_hub_download
+repo = "yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX"
+target = "./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512"
+for name in ["flow.decoder.estimator.autocast_fp16.onnx", "flow.decoder.estimator.streaming.autocast_fp16.onnx"]:
+    hf_hub_download(repo_id=repo, filename=name, repo_type="model", local_dir=target)
+PY
+```
+
+```env title="scripts/quickstart/env"
+OPENTALKING_TTS_LOCAL_COSYVOICE_FP16=auto
 OPENTALKING_TTS_LOCAL_COSYVOICE_LOAD_TRT=1
 OPENTALKING_TTS_LOCAL_COSYVOICE_TRT_CONCURRENT=1
 OPENTALKING_TTS_LOCAL_COSYVOICE_TOKEN_HOP_LEN=8
@@ -94,12 +153,25 @@ OPENTALKING_TTS_LOCAL_COSYVOICE_TOKEN_MAX_HOP_LEN=16
 OPENTALKING_TTS_LOCAL_COSYVOICE_STREAM_SCALE_FACTOR=1
 ```

-启动后先检查 sidecar 健康信息，确认 `runtime_flags.load_trt`、`streaming`、`llm_token_ratio` 和 `llm_stop_token_patch` 符合预期：
+`start_local_cosyvoice.sh` 会自动把 sidecar venv 里的 `site-packages/tensorrt_libs` 加入 `LD_LIBRARY_PATH`。首次启动 `FP16 + LOAD_TRT=1` 时，如果模型目录里存在 `flow.decoder.estimator.autocast_fp16.onnx`，OpenTalking 会从它生成当前 GPU 对应的 `flow.decoder.estimator.autocast_fp16.mygpu.plan`；这个步骤可能比普通启动更久。SenseVoice 仍然运行在 OpenTalking 主 `.venv`，不需要也不应该跟随 CosyVoice TRT 配置。
+
+启动后先检查 sidecar 健康信息，确认 `runtime_flags.load_trt`、`runtime.trt_autocast_fp16`、`streaming`、`llm_token_ratio` 和 `llm_stop_token_patch` 符合预期：

 ```bash title="终端"
 curl -fsS http://127.0.0.1:19090/health | python3 -m json.tool
 ```

+在 NVIDIA RTX 3090 Linux 服务器上实测，CosyVoice3 使用独立 sidecar venv，已加载
+`FP16 + LOAD_TRT=1` 和 autocast fp16 TensorRT plan。测试直接请求 sidecar 的
+`/synthesize`，TTFB 按第一批 PCM 字节到达时间计算：
+
+| 文本长度 | TTFB | 总耗时 | 音频时长 | RTF |
+|---:|---:|---:|---:|---:|
+| 43 字 | 0.683 s | 6.215 s | 7.200 s | 0.863 |
+| 42 字 | 0.642 s | 5.858 s | 6.960 s | 0.842 |
+| 29 字 | 0.639 s | 5.771 s | 6.520 s | 0.885 |
+| **平均** | **0.655 s** | **5.948 s** | **6.893 s** | **0.863** |
+
 完整本地语音输入、语音合成和 QuickTalk 视频链路见 [本地 STT/TTS + QuickTalk](recipes/local-quicktalk-audio.md)。

 ## IndexTTS 部署（provider = indextts）
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -104,7 +104,7 @@ quicktalk-cpu = [
 ]
 quicktalk-cuda = [
  "imageio-ffmpeg>=0.5",
-  "onnxruntime-gpu>=1.24.0",
+  "onnxruntime-gpu>=1.24.0,<1.27",
 ]
 local-cosyvoice-service = [
  "fastapi>=0.109",
--- a/scripts/local_cosyvoice_service.py
+++ b/scripts/local_cosyvoice_service.py
@@ -1,6 +1,7 @@
 from __future__ import annotations

 import argparse
+import importlib
 import io
 import os
 import sys
@@ -18,6 +19,112 @@ from fastapi.responses import StreamingResponse
 from pydantic import BaseModel


+
+def _soundfile_load_wav(wav: str, target_sr: int):
+    import torch
+
+    audio, sr = sf.read(wav, dtype="float32", always_2d=False)
+    arr = np.asarray(audio, dtype=np.float32)
+    if arr.ndim > 1:
+        arr = arr.mean(axis=1)
+    tensor = torch.from_numpy(arr).unsqueeze(0)
+    if int(sr) == int(target_sr):
+        return tensor
+    try:
+        import torchaudio.functional as AF
+
+        return AF.resample(tensor, int(sr), int(target_sr))
+    except Exception:
+        import torch.nn.functional as F
+
+        n_dst = max(1, int(round(tensor.shape[-1] * int(target_sr) / int(sr))))
+        return F.interpolate(
+            tensor.unsqueeze(0),
+            size=n_dst,
+            mode="linear",
+            align_corners=False,
+        ).squeeze(0)
+
+
+def _build_strongly_typed_trt(trt_model: str, trt_kwargs: dict[str, Any], onnx_model: str) -> None:
+    import tensorrt as trt
+
+    logger = trt.Logger(trt.Logger.INFO)
+    builder = trt.Builder(logger)
+    network_flags = 1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED)
+    network = builder.create_network(network_flags)
+    parser = trt.OnnxParser(network, logger)
+    config = builder.create_builder_config()
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 32)
+    profile = builder.create_optimization_profile()
+    with open(onnx_model, "rb") as f:
+        if not parser.parse(f.read()):
+            errors = [str(parser.get_error(i)) for i in range(parser.num_errors)]
+            raise RuntimeError(f"failed to parse {onnx_model}: {'; '.join(errors)}")
+    for i, name in enumerate(trt_kwargs["input_names"]):
+        profile.set_shape(name, trt_kwargs["min_shape"][i], trt_kwargs["opt_shape"][i], trt_kwargs["max_shape"][i])
+    config.add_optimization_profile(profile)
+    engine_bytes = builder.build_serialized_network(network, config)
+    if engine_bytes is None:
+        raise RuntimeError(f"failed to build TensorRT engine from {onnx_model}")
+    with open(trt_model, "wb") as f:
+        f.write(engine_bytes)
+
+
+def _patch_cosyvoice_autocast_fp16_trt() -> None:
+    try:
+        import cosyvoice.cli.model as cosy_model
+    except Exception:
+        return
+    if getattr(cosy_model, "_opentalking_autocast_fp16_trt_patched", False):
+        return
+
+    original_convert = cosy_model.convert_onnx_to_trt
+    original_load_trt = cosy_model.CosyVoiceModel.load_trt
+
+    def convert_onnx_to_trt(trt_model, trt_kwargs, onnx_model, fp16):
+        onnx_path = Path(str(onnx_model))
+        if fp16 and onnx_path.name == "flow.decoder.estimator.autocast_fp16.onnx":
+            print(f"building strongly-typed autocast fp16 TensorRT engine: {trt_model}", flush=True)
+            return _build_strongly_typed_trt(str(trt_model), trt_kwargs, str(onnx_model))
+        return original_convert(trt_model, trt_kwargs, onnx_model, fp16)
+
+    def load_trt(self, flow_decoder_estimator_model, flow_decoder_onnx_model, trt_concurrent, fp16):
+        if fp16:
+            model_dir = Path(str(flow_decoder_estimator_model)).parent
+            autocast_onnx = model_dir / "flow.decoder.estimator.autocast_fp16.onnx"
+            if autocast_onnx.exists():
+                flow_decoder_estimator_model = str(model_dir / "flow.decoder.estimator.autocast_fp16.mygpu.plan")
+                flow_decoder_onnx_model = str(autocast_onnx)
+                setattr(self, "_opentalking_trt_autocast_fp16", True)
+                setattr(self, "_opentalking_trt_plan", flow_decoder_estimator_model)
+                setattr(self, "_opentalking_trt_onnx", flow_decoder_onnx_model)
+                print(
+                    "using CosyVoice autocast fp16 TensorRT asset "
+                    f"onnx={flow_decoder_onnx_model} plan={flow_decoder_estimator_model}",
+                    flush=True,
+                )
+        return original_load_trt(self, flow_decoder_estimator_model, flow_decoder_onnx_model, trt_concurrent, fp16)
+
+    cosy_model.convert_onnx_to_trt = convert_onnx_to_trt
+    cosy_model.CosyVoiceModel.load_trt = load_trt
+    cosy_model._opentalking_autocast_fp16_trt_patched = True
+    print("patched cosyvoice autocast fp16 TensorRT loader", flush=True)
+
+
+def _patch_cosyvoice_load_wav() -> None:
+    patched: list[str] = []
+    for module_name in ("cosyvoice.utils.file_utils", "cosyvoice.cli.frontend"):
+        try:
+            module = importlib.import_module(module_name)
+        except Exception:
+            continue
+        setattr(module, "load_wav", _soundfile_load_wav)
+        patched.append(module_name)
+    if patched:
+        print(f"patched cosyvoice load_wav via soundfile modules={','.join(patched)}", flush=True)
+
+
 class SynthesizeRequest(BaseModel):
    text: str
    voice: str | None = None
@@ -79,6 +186,15 @@ def apply_streaming_tuning(
    return {"requested": requested, "applied": applied, "effective": effective}


+def ensure_cosyvoice_flow_half(cosyvoice: Any) -> bool:
+    model = _cosyvoice_model(cosyvoice)
+    flow = getattr(model, "flow", None)
+    if flow is None or not hasattr(flow, "half"):
+        return False
+    flow.half()
+    return True
+
+
 def reset_streaming_tuning(cosyvoice: Any) -> dict[str, Any]:
    model = _cosyvoice_model(cosyvoice)
    baseline = getattr(model, "_opentalking_streaming_tuning", None)
@@ -207,6 +323,9 @@ def current_runtime_info(cosyvoice: Any) -> dict[str, Any]:
        "fp16": bool(getattr(cosyvoice, "fp16", False)),
        "flow_decoder_estimator": estimator_type,
        "flow_decoder_trt": estimator_type == "TrtContextWrapper",
+        "trt_autocast_fp16": bool(getattr(model, "_opentalking_trt_autocast_fp16", False)),
+        "trt_plan": getattr(model, "_opentalking_trt_plan", ""),
+        "trt_onnx": getattr(model, "_opentalking_trt_onnx", ""),
    }


@@ -293,8 +412,10 @@ class CosyVoiceService:
        for path in (runtime, matcha):
            if str(path) not in sys.path:
                sys.path.insert(0, str(path))
+        _patch_cosyvoice_load_wav()
        try:
            from cosyvoice.cli.cosyvoice import AutoModel
+            _patch_cosyvoice_autocast_fp16_trt()
        except ImportError as exc:
            raise RuntimeError(
                "CosyVoice runtime is not importable. Clone FunAudioLLM/CosyVoice and install its requirements in this service venv."
@@ -318,6 +439,9 @@ class CosyVoiceService:
            "trt_concurrent": self.trt_concurrent,
        }
        self._model, self._loaded_model_kwargs = _instantiate_automodel(AutoModel, model_kwargs)
+        flow_half_applied = False
+        if self.load_trt and self.fp16:
+            flow_half_applied = ensure_cosyvoice_flow_half(self._model)
        self._apply_runtime_tuning()
        # Keep the service zero-shot first so it does not require precomputed spk2info.pt.
        print(
@@ -325,6 +449,7 @@ class CosyVoiceService:
            f"model={self.model_dir} runtime={runtime} device={self.device} "
            f"fp16={self.fp16} load_jit={self.load_jit} load_trt={self.load_trt} "
            f"load_vllm={self.load_vllm} trt_concurrent={self.trt_concurrent} "
+            f"flow_half_applied={flow_half_applied} "
            f"seconds={time.perf_counter() - t0:.3f}",
            flush=True,
        )
@@ -386,6 +511,7 @@ class CosyVoiceService:
                "torch",
                "torchaudio",
                "numpy",
+                "onnxruntime-gpu",
                "onnxruntime",
            ),
        }
@@ -542,9 +668,10 @@ class CosyVoiceService:
            self.model()
            return
        req = SynthesizeRequest(text=warmup_text)
+        # Exhaust the stream so CosyVoice releases its request state and model lock.
        stream, _sr = self.synthesize_pcm_stream(req)
        for _chunk in stream:
-            break
+            pass


 def create_app(service: CosyVoiceService) -> FastAPI:
--- a/scripts/prepare_cosyvoice_venv.sh
+++ b/scripts/prepare_cosyvoice_venv.sh
@@ -0,0 +1,121 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+script_dir="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
+repo_root="$(cd -- "$script_dir/.." && pwd)"
+
+venv_dir="${OPENTALKING_COSYVOICE_VENV_DIR:-$repo_root/.venv-cosyvoice}"
+runtime_dir="${OPENTALKING_TTS_LOCAL_COSYVOICE_RUNTIME_DIR:-$repo_root/models/local-audio/runtime/CosyVoice}"
+requirements_file="${OPENTALKING_COSYVOICE_REQUIREMENTS:-$runtime_dir/requirements.txt}"
+
+export PIP_INDEX_URL="${PIP_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
+
+if [[ ! -d "$runtime_dir" ]]; then
+  echo "Missing CosyVoice runtime: $runtime_dir" >&2
+  echo "Clone FunAudioLLM/CosyVoice there, or set OPENTALKING_TTS_LOCAL_COSYVOICE_RUNTIME_DIR." >&2
+  exit 1
+fi
+
+if [[ ! -f "$requirements_file" ]]; then
+  echo "Missing CosyVoice requirements: $requirements_file" >&2
+  exit 1
+fi
+
+resolve_bootstrap_python() {
+  if [[ -n "${OPENTALKING_COSYVOICE_BOOTSTRAP_PYTHON:-}" ]]; then
+    printf '%s\n' "$OPENTALKING_COSYVOICE_BOOTSTRAP_PYTHON"
+    return 0
+  fi
+  if [[ -n "${PYTHON:-}" ]]; then
+    printf '%s\n' "$PYTHON"
+    return 0
+  fi
+  if command -v python3.11 >/dev/null 2>&1; then
+    command -v python3.11
+    return 0
+  fi
+  if command -v uv >/dev/null 2>&1; then
+    if uv python find 3.11 >/dev/null 2>&1; then
+      uv python find 3.11
+      return 0
+    fi
+    uv python install 3.11 >/dev/null
+    uv python find 3.11
+    return 0
+  fi
+  command -v python3
+}
+
+python_bin="$(resolve_bootstrap_python)"
+if [[ ! -x "$venv_dir/bin/python" ]]; then
+  echo "Creating CosyVoice venv: $venv_dir"
+  "$python_bin" -m venv "$venv_dir"
+fi
+
+venv_python="$venv_dir/bin/python"
+tmp_dir="${OPENTALKING_COSYVOICE_TMPDIR:-$venv_dir/.tmp}"
+pip_cache_dir="${OPENTALKING_COSYVOICE_PIP_CACHE_DIR:-$venv_dir/.pip-cache}"
+mkdir -p "$tmp_dir" "$pip_cache_dir"
+export TMPDIR="$tmp_dir"
+export PIP_CACHE_DIR="$pip_cache_dir"
+find "$tmp_dir" -mindepth 1 -maxdepth 1 -name 'pip-*' -exec rm -rf {} +
+
+pip_install_initial() {
+  "$venv_python" -m pip install \
+    --retries "${OPENTALKING_COSYVOICE_PIP_RETRIES:-10}" \
+    --timeout "${OPENTALKING_COSYVOICE_PIP_TIMEOUT:-120}" \
+    "$@"
+}
+
+echo "Installing CosyVoice runtime dependencies"
+pip_install_initial --upgrade "pip<26" "setuptools<81" wheel
+
+pip_common_args=(
+  --retries "${OPENTALKING_COSYVOICE_PIP_RETRIES:-10}"
+  --timeout "${OPENTALKING_COSYVOICE_PIP_TIMEOUT:-120}"
+)
+if "$venv_python" -m pip install --help | grep -q -- '--resume-retries'; then
+  pip_common_args+=(--resume-retries "${OPENTALKING_COSYVOICE_PIP_RESUME_RETRIES:-10}")
+fi
+
+pip_install() {
+  "$venv_python" -m pip install "${pip_common_args[@]}" "$@"
+}
+
+pip_install "numpy==1.26.4" "Cython>=3.0"
+filtered_requirements="$(mktemp)"
+trap 'rm -f "$filtered_requirements"' EXIT
+filter_pattern='^[[:space:]]*(openai-whisper|pyworld|torch|torchaudio)=='
+if [[ "${OPENTALKING_COSYVOICE_INSTALL_TENSORRT:-0}" != "1" ]]; then
+  filter_pattern='^[[:space:]]*(openai-whisper|pyworld|torch|torchaudio|tensorrt-cu12.*)=='
+fi
+grep -Ev "$filter_pattern" "$requirements_file" \
+  | grep -Ev '^[[:space:]]*--extra-index-url[[:space:]]+https://download\.pytorch\.org/' \
+  >"$filtered_requirements"
+pip_install \
+  "torch==2.3.1" \
+  "torchaudio==2.3.1"
+pip_install -r "$filtered_requirements"
+pip_install --no-build-isolation "openai-whisper==20231117"
+pip_install --no-build-isolation "pyworld==0.3.4"
+
+echo "Installing OpenTalking sidecar service dependencies"
+pip_install \
+  "fastapi>=0.109" \
+  "uvicorn[standard]>=0.27" \
+  "pydantic>=2" \
+  "numpy>=1.24,<2" \
+  "soundfile>=0.12" \
+  "transformers==4.51.3"
+
+"$venv_python" - <<'PY'
+import importlib.metadata as metadata
+
+for package in ("transformers", "tokenizers", "torch", "torchaudio", "onnxruntime-gpu", "onnxruntime"):
+    try:
+        print(f"{package}={metadata.version(package)}")
+    except metadata.PackageNotFoundError:
+        print(f"{package}=missing")
+PY
+
+echo "CosyVoice venv is ready: $venv_dir"
--- a/scripts/quickstart/env.example
+++ b/scripts/quickstart/env.example
@@ -45,6 +45,11 @@

 # Local CosyVoice3 sidecar. Keep TensorRT off until the CosyVoice runtime has
 # built/loaded compatible TRT engines for this GPU and model directory.
+# Prepare the sidecar venv with OPENTALKING_COSYVOICE_INSTALL_TENSORRT=1 before
+# setting OPENTALKING_TTS_LOCAL_COSYVOICE_LOAD_TRT=1. The sidecar starter adds
+# .venv-cosyvoice/site-packages/tensorrt_libs to LD_LIBRARY_PATH automatically.
+# For CosyVoice3 fp16 TRT, place the official flow.decoder.estimator.autocast_fp16.onnx
+# in the model directory; OpenTalking will build flow.decoder.estimator.autocast_fp16.mygpu.plan.
 # OPENTALKING_TTS_DEFAULT_PROVIDER=local_cosyvoice
 # OPENTALKING_TTS_ENABLED_PROVIDERS=local_cosyvoice,dashscope,edge
 # OPENTALKING_LOCAL_AUDIO_MODEL_ROOT=$DIGITAL_HUMAN_HOME/models/local-audio
--- a/scripts/quickstart/start_local_cosyvoice.sh
+++ b/scripts/quickstart/start_local_cosyvoice.sh
@@ -0,0 +1,205 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+script_dir="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
+repo_root="$(cd -- "$script_dir/../.." && pwd)"
+default_home="$(cd -- "$repo_root/.." && pwd)"
+# shellcheck disable=SC1091
+source "$script_dir/_helpers.sh"
+
+usage() {
+  cat <<'USAGE'
+Usage:
+  bash scripts/quickstart/start_local_cosyvoice.sh [--host HOST] [--port PORT] [--env FILE]
+
+Options:
+  --host HOST  Bind host for the local CosyVoice sidecar. Defaults to 127.0.0.1.
+  --port PORT  Bind port. Defaults to OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL or 19090.
+  --env FILE   Source a quickstart env file before starting the sidecar.
+  --help       Show this help.
+USAGE
+}
+
+env_file="${OPENTALKING_QUICKSTART_ENV:-$script_dir/env}"
+host="${OPENTALKING_TTS_LOCAL_COSYVOICE_HOST:-127.0.0.1}"
+port=""
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --host)
+      if [[ $# -lt 2 ]]; then
+        echo "Missing value for --host" >&2
+        exit 2
+      fi
+      host="$2"
+      shift 2
+      ;;
+    --port)
+      if [[ $# -lt 2 ]]; then
+        echo "Missing value for --port" >&2
+        exit 2
+      fi
+      port="$2"
+      shift 2
+      ;;
+    --env)
+      if [[ $# -lt 2 ]]; then
+        echo "Missing value for --env" >&2
+        exit 2
+      fi
+      env_file="$2"
+      export OPENTALKING_QUICKSTART_ENV="$env_file"
+      shift 2
+      ;;
+    --help|-h)
+      usage
+      exit 0
+      ;;
+    *)
+      echo "Unknown argument: $1" >&2
+      usage >&2
+      exit 2
+      ;;
+  esac
+done
+
+quickstart_source_env "$env_file"
+
+export DIGITAL_HUMAN_HOME="${DIGITAL_HUMAN_HOME:-$default_home}"
+run_dir="$DIGITAL_HUMAN_HOME/run"
+log_dir="$DIGITAL_HUMAN_HOME/logs"
+mkdir -p "$run_dir" "$log_dir"
+
+if [[ -z "$port" ]]; then
+  port="${OPENTALKING_TTS_LOCAL_COSYVOICE_PORT:-}"
+fi
+if [[ -z "$port" && -n "${OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL:-}" ]]; then
+  port="$(
+    python3 - <<'PY'
+import os
+from urllib.parse import urlparse
+
+url = os.environ.get("OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL", "")
+parsed = urlparse(url)
+print(parsed.port or "")
+PY
+  )"
+fi
+port="${port:-19090}"
+
+resolve_cosyvoice_python() {
+  if [[ -n "${OPENTALKING_COSYVOICE_PYTHON:-}" ]]; then
+    if [[ -x "$OPENTALKING_COSYVOICE_PYTHON" ]]; then
+      printf '%s\n' "$OPENTALKING_COSYVOICE_PYTHON"
+      return 0
+    fi
+    echo "OPENTALKING_COSYVOICE_PYTHON is not executable: $OPENTALKING_COSYVOICE_PYTHON" >&2
+    return 1
+  fi
+
+  local candidate_dir=""
+  for candidate_dir in \
+    "${OPENTALKING_COSYVOICE_VENV_DIR:-}" \
+    "$repo_root/.venv-cosyvoice" \
+    "$DIGITAL_HUMAN_HOME/.venv-cosyvoice" \
+    "/root/cosyvoice/.venv"
+  do
+    [[ -n "$candidate_dir" ]] || continue
+    if [[ -x "$candidate_dir/bin/python" ]]; then
+      printf '%s\n' "$candidate_dir/bin/python"
+      return 0
+    fi
+  done
+
+  echo "Missing CosyVoice sidecar venv." >&2
+  echo "Create it first: OPENTALKING_COSYVOICE_VENV_DIR=\"$repo_root/.venv-cosyvoice\" bash scripts/prepare_cosyvoice_venv.sh" >&2
+  return 1
+}
+
+cosy_python="$(resolve_cosyvoice_python)"
+case "$cosy_python" in
+  "$repo_root/.venv/"*)
+    echo "Refusing to start local CosyVoice from the OpenTalking main venv: $cosy_python" >&2
+    echo "Use OPENTALKING_COSYVOICE_VENV_DIR or OPENTALKING_COSYVOICE_PYTHON for the sidecar venv." >&2
+    exit 1
+    ;;
+esac
+
+cosy_site_packages="$($cosy_python - <<'PY'
+import sysconfig
+
+print(sysconfig.get_paths().get("purelib", ""))
+PY
+)"
+cosy_trt_lib_dir="$cosy_site_packages/tensorrt_libs"
+if [[ -d "$cosy_trt_lib_dir" ]]; then
+  export LD_LIBRARY_PATH="$cosy_trt_lib_dir${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
+fi
+
+pid_file="$run_dir/local-cosyvoice-$port.pid"
+log_file="$log_dir/local-cosyvoice-$port.log"
+
+if [[ -f "$pid_file" ]]; then
+  old_pid="$(cat "$pid_file" 2>/dev/null || true)"
+  if [[ -n "$old_pid" ]] && kill -0 "$old_pid" >/dev/null 2>&1; then
+    if curl --max-time 2 -fsS "http://127.0.0.1:$port/health" >/dev/null 2>&1; then
+      echo "Local CosyVoice is already running: pid=$old_pid port=$port"
+      echo "Log: $log_file"
+      exit 0
+    fi
+    echo "Stale Local CosyVoice pid file: pid=$old_pid port=$port" >&2
+  fi
+  rm -f "$pid_file"
+fi
+
+if quickstart_port_in_use "$port"; then
+  echo "Local CosyVoice port $port is already in use." >&2
+  quickstart_describe_port "$port" >&2 || true
+  exit 1
+fi
+
+echo "Starting Local CosyVoice"
+echo "  repo:    $repo_root"
+echo "  python:  $cosy_python"
+echo "  host:    $host"
+echo "  port:    $port"
+echo "  log:     $log_file"
+if [[ -d "$cosy_trt_lib_dir" ]]; then
+  echo "  trt lib: $cosy_trt_lib_dir"
+fi
+
+(
+  cd "$repo_root"
+  export PYTHONPATH="$repo_root${PYTHONPATH:+:$PYTHONPATH}"
+  export OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD="${OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD:-1}"
+  if declare -F quickstart_detach >/dev/null 2>&1; then
+    quickstart_detach "$log_file" "$cosy_python" scripts/local_cosyvoice_service.py --host "$host" --port "$port" >"$pid_file"
+  else
+    setsid "$cosy_python" scripts/local_cosyvoice_service.py --host "$host" --port "$port" >"$log_file" 2>&1 < /dev/null &
+    echo "$!" >"$pid_file"
+  fi
+)
+
+pid="$(cat "$pid_file" 2>/dev/null || true)"
+if [[ -z "$pid" ]]; then
+  echo "Failed to capture Local CosyVoice pid." >&2
+  exit 1
+fi
+
+for _ in {1..120}; do
+  if ! kill -0 "$pid" >/dev/null 2>&1; then
+    echo "Local CosyVoice exited during startup. Last log lines:" >&2
+    tail -80 "$log_file" >&2 || true
+    rm -f "$pid_file"
+    exit 1
+  fi
+  if curl --max-time 2 -fsS "http://127.0.0.1:$port/health" >/dev/null 2>&1; then
+    echo "Local CosyVoice is up: http://127.0.0.1:$port"
+    exit 0
+  fi
+  sleep 1
+done
+
+echo "Local CosyVoice did not become ready in 120s. Last log lines:" >&2
+tail -80 "$log_file" >&2 || true
+exit 1