mirror of
https://github.com/datascale-ai/opentalking.git
synced 2026-07-03 15:22:34 +08:00
feat: add local CosyVoice TRT sidecar deployment (#119)
This commit is contained in:
@@ -126,7 +126,7 @@ OpenTalking's **orchestration layer** (API / Worker / frontend) and **digital-hu
|
||||
| Fast trial | `mock` | CPU / no GPU | Validate API, LLM, TTS, WebRTC, and browser playback without downloading model weights | [Quickstart](docs/en/user-guide/quickstart.md) |
|
||||
| Entry validation | `quicktalk` / `wav2lip` | RTX 3050 Laptop, RTX 3060, RTX 4060 | Run real video rendering for demos and deployment validation; lower the resolution on low-memory devices | [QuickTalk](docs/en/model-deployment/quicktalk.md) / [Wav2Lip](docs/en/model-deployment/wav2lip-local.md) |
|
||||
| Consumer-GPU single machine | `quicktalk` / `wav2lip` / `musetalk` | RTX 3090, RTX 4090 | Closer to real-time local demos, private validation, and lightweight pre-production evaluation | [Model deployment](docs/en/model-deployment/index.md) |
|
||||
| Fully local private path | `sensevoice` + `local_cosyvoice` + `quicktalk` | RTX 3090 / 4090 or similar GPU | Run STT, TTS, and video driving with local models to reduce external dependencies | [Local STT/TTS + QuickTalk](docs/en/model-deployment/local-quicktalk-audio.md) |
|
||||
| Fully local private path | `sensevoice` + `local_cosyvoice` + `quicktalk` | RTX 3090 / 4090 or similar GPU | Run STT, TTS, and video driving locally; OpenTalking uses the main `.venv`, while CosyVoice runs in a dedicated sidecar venv | [Local STT/TTS + QuickTalk](docs/en/model-deployment/local-quicktalk-audio.md) |
|
||||
| High-quality remote inference | `flashtalk` / `flashhead` / `fasterliveportrait` + OmniRT | Multi-GPU, Ascend 910B2, remote GPU service | Multi-card, GPU/NPU, production isolation, higher visual quality, or video clone workflows | [FlashTalk](docs/en/model-deployment/flashtalk.md) / [FasterLivePortrait](docs/en/model-deployment/fasterliveportrait.md) |
|
||||
| Docker / production deployment | API, Web, Worker, external model services | Single GPU, remote GPU, distributed cluster | Service deployment, remote GPU, distributed runtime, and production validation | [Deployment](docs/en/user-guide/deployment.md) |
|
||||
|
||||
|
||||
@@ -126,7 +126,7 @@ OpenTalking's **orchestration layer** (API / Worker / frontend) and **digital-hu
|
||||
| Fast trial | `mock` | CPU / no GPU | Validate API, LLM, TTS, WebRTC, and browser playback without downloading model weights | [Quickstart](docs/en/user-guide/quickstart.md) |
|
||||
| Entry validation | `quicktalk` / `wav2lip` | RTX 3050 Laptop, RTX 3060, RTX 4060 | Run real video rendering for demos and deployment validation; lower the resolution on low-memory devices | [QuickTalk](docs/en/model-deployment/quicktalk.md) / [Wav2Lip](docs/en/model-deployment/wav2lip-local.md) |
|
||||
| Consumer-GPU single machine | `quicktalk` / `wav2lip` / `musetalk` | RTX 3090, RTX 4090 | Closer to real-time local demos, private validation, and lightweight pre-production evaluation | [Model deployment](docs/en/model-deployment/index.md) |
|
||||
| Fully local private path | `sensevoice` + `local_cosyvoice` + `quicktalk` | RTX 3090 / 4090 or similar GPU | Run STT, TTS, and video driving with local models to reduce external dependencies | [Local STT/TTS + QuickTalk](docs/en/model-deployment/local-quicktalk-audio.md) |
|
||||
| Fully local private path | `sensevoice` + `local_cosyvoice` + `quicktalk` | RTX 3090 / 4090 or similar GPU | Run STT, TTS, and video driving locally; OpenTalking uses the main `.venv`, while CosyVoice runs in a dedicated sidecar venv | [Local STT/TTS + QuickTalk](docs/en/model-deployment/local-quicktalk-audio.md) |
|
||||
| High-quality remote inference | `flashtalk` / `flashhead` / `fasterliveportrait` + OmniRT | Multi-GPU, Ascend 910B2, remote GPU service | Multi-card, GPU/NPU, production isolation, higher visual quality, or video clone workflows | [FlashTalk](docs/en/model-deployment/flashtalk.md) / [FasterLivePortrait](docs/en/model-deployment/fasterliveportrait.md) |
|
||||
| Docker / production deployment | API, Web, Worker, external model services | Single GPU, remote GPU, distributed cluster | Service deployment, remote GPU, distributed runtime, and production validation | [Deployment](docs/en/user-guide/deployment.md) |
|
||||
|
||||
|
||||
@@ -126,7 +126,7 @@ OpenTalking 的 **编排层**(API / Worker / 前端)和 **数字人合成后
|
||||
| 快速体验 | `mock` | CPU / 无 GPU | 不下载模型权重,先验证 API、LLM、TTS、WebRTC 与浏览器播放链路 | [快速开始](docs/zh/user-guide/quickstart.md) |
|
||||
| 入门验证 | `quicktalk` / `wav2lip` | RTX 3050 Laptop、RTX 3060、RTX 4060 | 能跑通真实视频渲染,适合功能演示和部署验证;低显存设备建议降低分辨率 | [QuickTalk](docs/zh/model-deployment/quicktalk.md) / [Wav2Lip](docs/zh/model-deployment/wav2lip-local.md) |
|
||||
| 消费级显卡单机 | `quicktalk` / `wav2lip` / `musetalk` | RTX 3090、RTX 4090 | 更接近实时体验,适合本地 demo、私有化验证和轻量生产前评估 | [模型部署](docs/zh/model-deployment/index.md) |
|
||||
| 全本地私有化 | `sensevoice` + `local_cosyvoice` + `quicktalk` | RTX 3090 / 4090 或同级 GPU | STT、TTS、视频驱动都走本地模型,减少外部依赖 | [本地 STT/TTS + QuickTalk](docs/zh/model-deployment/local-quicktalk-audio.md) |
|
||||
| 全本地私有化 | `sensevoice` + `local_cosyvoice` + `quicktalk` | RTX 3090 / 4090 或同级 GPU | STT、TTS、视频驱动都走本地;OpenTalking 使用主 `.venv`,CosyVoice 使用独立 sidecar venv | [本地 STT/TTS + QuickTalk](docs/zh/model-deployment/local-quicktalk-audio.md) |
|
||||
| 高质量远端推理 | `flashtalk` / `flashhead` / `fasterliveportrait` + OmniRT | 多卡 GPU、Ascend 910B2、远端 GPU 服务 | 多卡、GPU/NPU、生产隔离、更高画质或视频克隆 | [FlashTalk](docs/zh/model-deployment/flashtalk.md) / [FasterLivePortrait](docs/zh/model-deployment/fasterliveportrait.md) |
|
||||
| Docker / 生产部署 | API、Web、Worker、外部模型服务分离 | 单机 GPU、远端 GPU、分布式集群 | 服务化部署、远端 GPU、分布式和生产验证 | [部署文档](docs/zh/user-guide/deployment.md) |
|
||||
|
||||
|
||||
@@ -24,6 +24,7 @@ router = APIRouter(prefix="/tts", tags=["tts"])
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
MAX_PREVIEW_TEXT_CHARS = 1000
|
||||
LOCAL_COSYVOICE_PREVIEW_SECONDS = 3.0
|
||||
_INDEXTTS_PROVIDERS = {"indextts", "local_indextts", "omnirt_indextts"}
|
||||
PreviewUploadFile = UploadFile | StarletteUploadFile
|
||||
|
||||
@@ -36,6 +37,12 @@ class TTSPreviewRequest(BaseModel):
|
||||
indextts_config: dict[str, Any] | None = None
|
||||
|
||||
|
||||
def _preview_sample_limit(provider: str | None, sample_rate: int) -> int | None:
|
||||
if provider == "local_cosyvoice":
|
||||
return max(1, int(sample_rate * LOCAL_COSYVOICE_PREVIEW_SECONDS))
|
||||
return None
|
||||
|
||||
|
||||
def _wav_bytes(chunks: list[np.ndarray], sample_rate: int) -> bytes:
|
||||
pcm = np.concatenate(chunks) if chunks else np.zeros(0, dtype=np.int16)
|
||||
pcm = np.asarray(pcm, dtype="<i2").reshape(-1)
|
||||
@@ -215,12 +222,17 @@ async def preview_tts(request: Request) -> Response:
|
||||
)
|
||||
chunks: list[np.ndarray] = []
|
||||
effective_sample_rate = sample_rate
|
||||
sample_limit = _preview_sample_limit(provider, sample_rate)
|
||||
total_samples = 0
|
||||
try:
|
||||
async for chunk in tts.synthesize_stream(text, voice=voice):
|
||||
arr = np.asarray(chunk.data, dtype=np.int16).reshape(-1)
|
||||
if arr.size:
|
||||
chunks.append(arr.copy())
|
||||
total_samples += int(arr.size)
|
||||
effective_sample_rate = int(chunk.sample_rate or effective_sample_rate)
|
||||
if sample_limit is not None and total_samples >= sample_limit:
|
||||
break
|
||||
except Exception as exc:
|
||||
raise HTTPException(status_code=502, detail=f"TTS preview failed: {exc}") from exc
|
||||
finally:
|
||||
|
||||
@@ -246,6 +246,45 @@ def test_tts_preview_form_passes_indextts_emotion_audio_file(monkeypatch):
|
||||
assert calls[0]["emotion_audio_bytes"] == b"RIFFemotion"
|
||||
|
||||
|
||||
|
||||
def test_tts_preview_local_cosyvoice_returns_after_enough_preview_audio(monkeypatch):
|
||||
from apps.api.routes import tts_preview
|
||||
|
||||
yielded: list[int] = []
|
||||
|
||||
class FakeTTS:
|
||||
async def synthesize_stream(self, text: str, voice: str | None = None):
|
||||
for i in range(20):
|
||||
yielded.append(i)
|
||||
yield AudioChunk(
|
||||
data=np.ones(16000, dtype=np.int16),
|
||||
sample_rate=16000,
|
||||
duration_ms=1000.0,
|
||||
)
|
||||
|
||||
def fake_build_tts_adapter(**kwargs):
|
||||
return FakeTTS()
|
||||
|
||||
monkeypatch.setattr(tts_preview, 'build_tts_adapter', fake_build_tts_adapter)
|
||||
|
||||
app = FastAPI()
|
||||
app.include_router(tts_preview.router)
|
||||
client = TestClient(app)
|
||||
|
||||
response = client.post(
|
||||
'/tts/preview',
|
||||
json={
|
||||
'text': '你好,我正在测试音色。',
|
||||
'voice': 'local-office-serena',
|
||||
'tts_provider': 'local_cosyvoice',
|
||||
'tts_model': 'FunAudioLLM/Fun-CosyVoice3-0.5B-2512',
|
||||
},
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
assert response.content.startswith(b'RIFF')
|
||||
assert 1 <= len(yielded) < 20
|
||||
|
||||
def test_tts_preview_rejects_empty_text():
|
||||
from apps.api.routes import tts_preview
|
||||
|
||||
|
||||
@@ -60,13 +60,18 @@ OPENTALKING_TTS_DASHSCOPE_API_KEY=<dashscope-tts-key>
|
||||
## Install and Models
|
||||
|
||||
```bash title="terminal"
|
||||
uv sync --extra dev --extra models --extra local-audio --extra local-cosyvoice-service --extra quicktalk-cuda --python 3.11
|
||||
uv sync --extra dev --extra models --extra local-audio --extra quicktalk-cuda --python 3.11
|
||||
python scripts/download_local_audio_models.py \
|
||||
--root ./models/local-audio \
|
||||
--model sensevoice-small \
|
||||
--model fun-cosyvoice3-0.5b-2512
|
||||
```
|
||||
|
||||
Use the main `.venv` for OpenTalking, SenseVoice, and QuickTalk. Create a
|
||||
separate CosyVoice sidecar venv after the runtime checkout.
|
||||
|
||||
For CosyVoice3 model sources and the optional fp16 TensorRT ONNX files, see [TTS deployment](../tts.md#local-cosyvoice3-05b).
|
||||
|
||||
Prepare QuickTalk weights as described in [QuickTalk Local](../quicktalk/local.md). Put the CosyVoice runtime under the model directory:
|
||||
|
||||
```bash title="terminal"
|
||||
@@ -74,6 +79,9 @@ mkdir -p ./models/local-audio/runtime
|
||||
git clone https://github.com/FunAudioLLM/CosyVoice.git ./models/local-audio/runtime/CosyVoice
|
||||
cd ./models/local-audio/runtime/CosyVoice
|
||||
git submodule update --init --recursive
|
||||
cd "$DIGITAL_HUMAN_HOME/opentalking"
|
||||
OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
|
||||
bash scripts/prepare_cosyvoice_venv.sh
|
||||
```
|
||||
|
||||
## Start
|
||||
@@ -81,8 +89,7 @@ git submodule update --init --recursive
|
||||
Start the local TTS service first:
|
||||
|
||||
```bash title="terminal"
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD=1 \
|
||||
python scripts/local_cosyvoice_service.py --host 127.0.0.1 --port 19090
|
||||
bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
|
||||
```
|
||||
|
||||
Then start OpenTalking:
|
||||
|
||||
@@ -60,12 +60,41 @@ export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/s
|
||||
export UV_INDEX_URL="${UV_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
|
||||
export PIP_INDEX_URL="${PIP_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
|
||||
export UV_LINK_MODE=copy
|
||||
uv sync --extra dev --extra models --extra local-audio --extra local-cosyvoice-service --python 3.11
|
||||
uv sync --extra dev --extra models --extra local-audio --python 3.11
|
||||
.venv/bin/python scripts/download_local_audio_models.py \
|
||||
--root ./models/local-audio \
|
||||
--model fun-cosyvoice3-0.5b-2512
|
||||
```
|
||||
|
||||
This downloads the base CosyVoice3 model from ModelScope:
|
||||
|
||||
| Asset | Source | Destination |
|
||||
|---|---|---|
|
||||
| Base CosyVoice3 weights | ModelScope `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | `./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512/` |
|
||||
|
||||
The base model directory must include the files used by the sidecar runtime,
|
||||
including `cosyvoice3.yaml`, `llm.pt`, `flow.pt`, `hift.pt`,
|
||||
`speech_tokenizer_v3.onnx`, `speech_tokenizer_v3.batch.onnx`, `campplus.onnx`,
|
||||
and `flow.decoder.estimator.fp32.onnx`. The built-in zero-shot voice also needs
|
||||
a prompt wav configured by `OPENTALKING_TTS_LOCAL_COSYVOICE_PROMPT_AUDIO`; cloned
|
||||
voices store their own prompt wav under the local voice directory.
|
||||
|
||||
For fp16 TensorRT, download the extra ONNX assets from Hugging Face and place
|
||||
them in the same base model directory:
|
||||
|
||||
| Asset | Source | Required for |
|
||||
|---|---|---|
|
||||
| `flow.decoder.estimator.autocast_fp16.onnx` | Hugging Face `yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX` | `FP16 + LOAD_TRT=1`; OpenTalking builds `flow.decoder.estimator.autocast_fp16.mygpu.plan` from it. |
|
||||
| `flow.decoder.estimator.streaming.autocast_fp16.onnx` | Hugging Face `yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX` | Optional streaming fp16 ONNX asset; keep beside the estimator ONNX for runtime compatibility. |
|
||||
|
||||
The generated `*.mygpu.plan` files are machine-specific TensorRT engines. Do not
|
||||
copy them between different GPU / TensorRT / CUDA environments; rebuild them on
|
||||
the target host from the ONNX files.
|
||||
|
||||
This main `.venv` is for OpenTalking, SenseVoice, and the video backend. Keep
|
||||
CosyVoice in its own sidecar venv so its `transformers==4.51.3` runtime does not
|
||||
conflict with OpenTalking's `transformers>=4.57,<6`.
|
||||
|
||||
Prepare the CosyVoice runtime:
|
||||
|
||||
```bash title="Terminal"
|
||||
@@ -75,18 +104,50 @@ cd ./models/local-audio/runtime/CosyVoice
|
||||
git submodule update --init --recursive
|
||||
```
|
||||
|
||||
Create or update the CosyVoice sidecar venv:
|
||||
|
||||
```bash title="Terminal"
|
||||
OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
|
||||
bash scripts/prepare_cosyvoice_venv.sh
|
||||
```
|
||||
|
||||
If you need TensorRT, install the TRT dependencies into the CosyVoice sidecar venv, not into OpenTalking's main `.venv`:
|
||||
|
||||
```bash title="Terminal"
|
||||
PIP_EXTRA_INDEX_URL=https://pypi.nvidia.com/ OPENTALKING_COSYVOICE_INSTALL_TENSORRT=1 \
|
||||
OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice bash scripts/prepare_cosyvoice_venv.sh
|
||||
```
|
||||
|
||||
Start the local TTS service:
|
||||
|
||||
```bash title="Terminal"
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD=1 \
|
||||
python scripts/local_cosyvoice_service.py --host 127.0.0.1 --port 19090
|
||||
bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
|
||||
```
|
||||
|
||||
In prior GPU validation, the main CosyVoice3 issue was not a single TTFA number but seed-dependent output-length drift. The local CosyVoice service therefore keeps two stability guards on by default: `OPENTALKING_TTS_LOCAL_COSYVOICE_MASK_STOP_TOKENS=1` masks every stop token exposed by the CosyVoice LLM, and `OPENTALKING_TTS_LOCAL_COSYVOICE_MAX_TOKEN_TEXT_RATIO=6` bounds the token/text ratio so long prompts do not occasionally produce runaway audio. Keep these guards enabled for realtime use.
|
||||
|
||||
TensorRT is optional. Enable it only after the current CosyVoice runtime, CUDA, onnxruntime-gpu/TensorRT engines, and model directory are compatible:
|
||||
|
||||
```env title=".env"
|
||||
```bash title="Terminal"
|
||||
.venv-cosyvoice/bin/python -c "import tensorrt as trt; print(trt.__version__)"
|
||||
test -f ./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512/flow.decoder.estimator.fp32.onnx
|
||||
```
|
||||
|
||||
For CosyVoice3 fp16 TRT, prefer the official autocast fp16 ONNX asset. A TRT engine can be built from `flow.decoder.estimator.fp32.onnx`, but some GPU/TensorRT combinations can produce NaNs or silent audio. Before enabling `FP16 + LOAD_TRT`, place `flow.decoder.estimator.autocast_fp16.onnx` in the same model directory. If the server needs a proxy for Hugging Face, inject proxy variables only for the download command; do not add them to the OpenTalking service env:
|
||||
|
||||
```bash title="Terminal"
|
||||
env ALL_PROXY=socks5h://127.0.0.1:7890 HTTPS_PROXY=socks5h://127.0.0.1:7890 \
|
||||
HF_ENDPOINT=https://huggingface.co .venv-cosyvoice/bin/python - <<'PY'
|
||||
from huggingface_hub import hf_hub_download
|
||||
repo = "yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX"
|
||||
target = "./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512"
|
||||
for name in ["flow.decoder.estimator.autocast_fp16.onnx", "flow.decoder.estimator.streaming.autocast_fp16.onnx"]:
|
||||
hf_hub_download(repo_id=repo, filename=name, repo_type="model", local_dir=target)
|
||||
PY
|
||||
```
|
||||
|
||||
```env title="scripts/quickstart/env"
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_FP16=auto
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_LOAD_TRT=1
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_TRT_CONCURRENT=1
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_TOKEN_HOP_LEN=8
|
||||
@@ -94,12 +155,26 @@ OPENTALKING_TTS_LOCAL_COSYVOICE_TOKEN_MAX_HOP_LEN=16
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_STREAM_SCALE_FACTOR=1
|
||||
```
|
||||
|
||||
After startup, check the sidecar health payload and verify `runtime_flags.load_trt`, `streaming`, `llm_token_ratio`, and `llm_stop_token_patch`:
|
||||
`start_local_cosyvoice.sh` automatically adds the sidecar venv's `site-packages/tensorrt_libs` directory to `LD_LIBRARY_PATH`. On first startup with `FP16 + LOAD_TRT=1`, if `flow.decoder.estimator.autocast_fp16.onnx` exists in the model directory, OpenTalking builds the GPU-specific `flow.decoder.estimator.autocast_fp16.mygpu.plan` from it; this can take longer than a normal startup. SenseVoice still runs in the OpenTalking main `.venv` and should not follow the CosyVoice TRT settings.
|
||||
|
||||
After startup, check the sidecar health payload and verify `runtime_flags.load_trt`, `runtime.trt_autocast_fp16`, `streaming`, `llm_token_ratio`, and `llm_stop_token_patch`:
|
||||
|
||||
```bash title="Terminal"
|
||||
curl -fsS http://127.0.0.1:19090/health | python3 -m json.tool
|
||||
```
|
||||
|
||||
Measured on a Linux server with an NVIDIA RTX 3090, CosyVoice3 sidecar venv,
|
||||
`FP16 + LOAD_TRT=1`, and the autocast fp16 TensorRT plan loaded. The benchmark
|
||||
called the sidecar `/synthesize` endpoint directly and measured first PCM byte
|
||||
arrival as TTFB:
|
||||
|
||||
| Text length | TTFB | Wall time | Audio duration | RTF |
|
||||
|---:|---:|---:|---:|---:|
|
||||
| 43 chars | 0.683 s | 6.215 s | 7.200 s | 0.863 |
|
||||
| 42 chars | 0.642 s | 5.858 s | 6.960 s | 0.842 |
|
||||
| 29 chars | 0.639 s | 5.771 s | 6.520 s | 0.885 |
|
||||
| **Average** | **0.655 s** | **5.948 s** | **6.893 s** | **0.863** |
|
||||
|
||||
For the full local speech input, speech synthesis, and QuickTalk video chain, see [Local STT/TTS + QuickTalk](recipes/local-quicktalk-audio.md).
|
||||
|
||||
## IndexTTS Deployment (provider = indextts)
|
||||
|
||||
@@ -60,13 +60,18 @@ OPENTALKING_TTS_DASHSCOPE_API_KEY=<dashscope-tts-key>
|
||||
## 安装与模型
|
||||
|
||||
```bash title="终端"
|
||||
uv sync --extra dev --extra models --extra local-audio --extra local-cosyvoice-service --extra quicktalk-cuda --python 3.11
|
||||
uv sync --extra dev --extra models --extra local-audio --extra quicktalk-cuda --python 3.11
|
||||
python scripts/download_local_audio_models.py \
|
||||
--root ./models/local-audio \
|
||||
--model sensevoice-small \
|
||||
--model fun-cosyvoice3-0.5b-2512
|
||||
```
|
||||
|
||||
主 `.venv` 只负责 OpenTalking、SenseVoice 和 QuickTalk。CosyVoice runtime
|
||||
准备好后,创建独立 sidecar venv。
|
||||
|
||||
CosyVoice3 主权重来源和可选 fp16 TensorRT ONNX 文件见 [TTS 部署](../tts.md)。
|
||||
|
||||
QuickTalk 权重按 [QuickTalk Local](../quicktalk/local.md) 页面准备。CosyVoice runtime 放在模型目录下即可:
|
||||
|
||||
```bash title="终端"
|
||||
@@ -74,6 +79,9 @@ mkdir -p ./models/local-audio/runtime
|
||||
git clone https://github.com/FunAudioLLM/CosyVoice.git ./models/local-audio/runtime/CosyVoice
|
||||
cd ./models/local-audio/runtime/CosyVoice
|
||||
git submodule update --init --recursive
|
||||
cd "$DIGITAL_HUMAN_HOME/opentalking"
|
||||
OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
|
||||
bash scripts/prepare_cosyvoice_venv.sh
|
||||
```
|
||||
|
||||
## 启动
|
||||
@@ -81,8 +89,7 @@ git submodule update --init --recursive
|
||||
先启动本地 TTS service:
|
||||
|
||||
```bash title="终端"
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD=1 \
|
||||
python scripts/local_cosyvoice_service.py --host 127.0.0.1 --port 19090
|
||||
bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
|
||||
```
|
||||
|
||||
再启动 OpenTalking:
|
||||
|
||||
@@ -60,12 +60,39 @@ export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/s
|
||||
export UV_INDEX_URL="${UV_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
|
||||
export PIP_INDEX_URL="${PIP_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
|
||||
export UV_LINK_MODE=copy
|
||||
uv sync --extra dev --extra models --extra local-audio --extra local-cosyvoice-service --python 3.11
|
||||
uv sync --extra dev --extra models --extra local-audio --python 3.11
|
||||
.venv/bin/python scripts/download_local_audio_models.py \
|
||||
--root ./models/local-audio \
|
||||
--model fun-cosyvoice3-0.5b-2512
|
||||
```
|
||||
|
||||
这一步会从 ModelScope 下载 CosyVoice3 主模型:
|
||||
|
||||
| 资产 | 来源 | 目标目录 |
|
||||
|---|---|---|
|
||||
| CosyVoice3 主权重 | ModelScope `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | `./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512/` |
|
||||
|
||||
主模型目录至少需要包含 sidecar runtime 会加载的文件,包括 `cosyvoice3.yaml`、
|
||||
`llm.pt`、`flow.pt`、`hift.pt`、`speech_tokenizer_v3.onnx`、
|
||||
`speech_tokenizer_v3.batch.onnx`、`campplus.onnx` 和
|
||||
`flow.decoder.estimator.fp32.onnx`。内置 zero-shot 音色还需要
|
||||
`OPENTALKING_TTS_LOCAL_COSYVOICE_PROMPT_AUDIO` 指向一段 prompt wav;复刻音色会把
|
||||
自己的 prompt wav 保存在本地音色目录。
|
||||
|
||||
如果要启用 fp16 TensorRT,再从 Hugging Face 下载额外 ONNX 资产,并放到同一个主模型目录:
|
||||
|
||||
| 资产 | 来源 | 用途 |
|
||||
|---|---|---|
|
||||
| `flow.decoder.estimator.autocast_fp16.onnx` | Hugging Face `yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX` | `FP16 + LOAD_TRT=1` 必需;OpenTalking 会由它生成 `flow.decoder.estimator.autocast_fp16.mygpu.plan`。 |
|
||||
| `flow.decoder.estimator.streaming.autocast_fp16.onnx` | Hugging Face `yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX` | 可选 streaming fp16 ONNX 资产;建议和 estimator ONNX 放在一起,保持 runtime 兼容。 |
|
||||
|
||||
生成的 `*.mygpu.plan` 是和机器绑定的 TensorRT engine,不要跨 GPU / TensorRT /
|
||||
CUDA 环境复制;在目标机器上由 ONNX 重新构建。
|
||||
|
||||
这个主 `.venv` 负责 OpenTalking、SenseVoice 和视频后端。CosyVoice 需要独立
|
||||
sidecar venv,避免它的 `transformers==4.51.3` runtime 与 OpenTalking 的
|
||||
`transformers>=4.57,<6` 冲突。
|
||||
|
||||
准备 CosyVoice runtime:
|
||||
|
||||
```bash title="终端"
|
||||
@@ -75,18 +102,50 @@ cd ./models/local-audio/runtime/CosyVoice
|
||||
git submodule update --init --recursive
|
||||
```
|
||||
|
||||
创建或更新 CosyVoice 专用 sidecar venv:
|
||||
|
||||
```bash title="终端"
|
||||
OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
|
||||
bash scripts/prepare_cosyvoice_venv.sh
|
||||
```
|
||||
|
||||
如果要启用 TensorRT,把 TRT 依赖安装在 CosyVoice sidecar venv 中,不要安装进 OpenTalking 主 `.venv`:
|
||||
|
||||
```bash title="终端"
|
||||
PIP_EXTRA_INDEX_URL=https://pypi.nvidia.com/ OPENTALKING_COSYVOICE_INSTALL_TENSORRT=1 \
|
||||
OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice bash scripts/prepare_cosyvoice_venv.sh
|
||||
```
|
||||
|
||||
启动本地 TTS service:
|
||||
|
||||
```bash title="终端"
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD=1 \
|
||||
python scripts/local_cosyvoice_service.py --host 127.0.0.1 --port 19090
|
||||
bash scripts/quickstart/start_local_cosyvoice.sh --port 19090
|
||||
```
|
||||
|
||||
在既有 GPU 验证中,CosyVoice3 的关键问题不是单次 TTFA,而是随机种子导致的生成长度漂移。OpenTalking 的本地 CosyVoice service 因此默认保留两类稳定性保护:`OPENTALKING_TTS_LOCAL_COSYVOICE_MASK_STOP_TOKENS=1` 会屏蔽 CosyVoice LLM 暴露的全部 stop token,`OPENTALKING_TTS_LOCAL_COSYVOICE_MAX_TOKEN_TEXT_RATIO=6` 会限制 token/text 比例,避免长文本偶发生成过长音频。不要为了追求更快首包把这两个保护关掉。
|
||||
|
||||
TensorRT 是可选加速。只有当当前 CosyVoice runtime、CUDA、onnxruntime-gpu/TensorRT engine 与模型目录匹配时再开启:
|
||||
|
||||
```env title=".env"
|
||||
```bash title="终端"
|
||||
.venv-cosyvoice/bin/python -c "import tensorrt as trt; print(trt.__version__)"
|
||||
test -f ./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512/flow.decoder.estimator.fp32.onnx
|
||||
```
|
||||
|
||||
CosyVoice3 的 fp16 TRT 推荐使用官方 autocast fp16 ONNX 资产。普通 `flow.decoder.estimator.fp32.onnx` 可以生成 TRT engine,但在部分 GPU/TensorRT 组合上会出现 NaN 或静音;如果要开启 `FP16 + LOAD_TRT`,先把 `flow.decoder.estimator.autocast_fp16.onnx` 放到同一个模型目录。服务器需要代理访问 Hugging Face 时,只给下载命令临时注入代理变量即可,不要写入 OpenTalking 主服务环境:
|
||||
|
||||
```bash title="终端"
|
||||
env ALL_PROXY=socks5h://127.0.0.1:7890 HTTPS_PROXY=socks5h://127.0.0.1:7890 \
|
||||
HF_ENDPOINT=https://huggingface.co .venv-cosyvoice/bin/python - <<'PY'
|
||||
from huggingface_hub import hf_hub_download
|
||||
repo = "yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX"
|
||||
target = "./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512"
|
||||
for name in ["flow.decoder.estimator.autocast_fp16.onnx", "flow.decoder.estimator.streaming.autocast_fp16.onnx"]:
|
||||
hf_hub_download(repo_id=repo, filename=name, repo_type="model", local_dir=target)
|
||||
PY
|
||||
```
|
||||
|
||||
```env title="scripts/quickstart/env"
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_FP16=auto
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_LOAD_TRT=1
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_TRT_CONCURRENT=1
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_TOKEN_HOP_LEN=8
|
||||
@@ -94,12 +153,25 @@ OPENTALKING_TTS_LOCAL_COSYVOICE_TOKEN_MAX_HOP_LEN=16
|
||||
OPENTALKING_TTS_LOCAL_COSYVOICE_STREAM_SCALE_FACTOR=1
|
||||
```
|
||||
|
||||
启动后先检查 sidecar 健康信息,确认 `runtime_flags.load_trt`、`streaming`、`llm_token_ratio` 和 `llm_stop_token_patch` 符合预期:
|
||||
`start_local_cosyvoice.sh` 会自动把 sidecar venv 里的 `site-packages/tensorrt_libs` 加入 `LD_LIBRARY_PATH`。首次启动 `FP16 + LOAD_TRT=1` 时,如果模型目录里存在 `flow.decoder.estimator.autocast_fp16.onnx`,OpenTalking 会从它生成当前 GPU 对应的 `flow.decoder.estimator.autocast_fp16.mygpu.plan`;这个步骤可能比普通启动更久。SenseVoice 仍然运行在 OpenTalking 主 `.venv`,不需要也不应该跟随 CosyVoice TRT 配置。
|
||||
|
||||
启动后先检查 sidecar 健康信息,确认 `runtime_flags.load_trt`、`runtime.trt_autocast_fp16`、`streaming`、`llm_token_ratio` 和 `llm_stop_token_patch` 符合预期:
|
||||
|
||||
```bash title="终端"
|
||||
curl -fsS http://127.0.0.1:19090/health | python3 -m json.tool
|
||||
```
|
||||
|
||||
在 NVIDIA RTX 3090 Linux 服务器上实测,CosyVoice3 使用独立 sidecar venv,已加载
|
||||
`FP16 + LOAD_TRT=1` 和 autocast fp16 TensorRT plan。测试直接请求 sidecar 的
|
||||
`/synthesize`,TTFB 按第一批 PCM 字节到达时间计算:
|
||||
|
||||
| 文本长度 | TTFB | 总耗时 | 音频时长 | RTF |
|
||||
|---:|---:|---:|---:|---:|
|
||||
| 43 字 | 0.683 s | 6.215 s | 7.200 s | 0.863 |
|
||||
| 42 字 | 0.642 s | 5.858 s | 6.960 s | 0.842 |
|
||||
| 29 字 | 0.639 s | 5.771 s | 6.520 s | 0.885 |
|
||||
| **平均** | **0.655 s** | **5.948 s** | **6.893 s** | **0.863** |
|
||||
|
||||
完整本地语音输入、语音合成和 QuickTalk 视频链路见 [本地 STT/TTS + QuickTalk](recipes/local-quicktalk-audio.md)。
|
||||
|
||||
## IndexTTS 部署(provider = indextts)
|
||||
|
||||
@@ -104,7 +104,7 @@ quicktalk-cpu = [
|
||||
]
|
||||
quicktalk-cuda = [
|
||||
"imageio-ffmpeg>=0.5",
|
||||
"onnxruntime-gpu>=1.24.0",
|
||||
"onnxruntime-gpu>=1.24.0,<1.27",
|
||||
]
|
||||
local-cosyvoice-service = [
|
||||
"fastapi>=0.109",
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import importlib
|
||||
import io
|
||||
import os
|
||||
import sys
|
||||
@@ -18,6 +19,112 @@ from fastapi.responses import StreamingResponse
|
||||
from pydantic import BaseModel
|
||||
|
||||
|
||||
|
||||
def _soundfile_load_wav(wav: str, target_sr: int):
|
||||
import torch
|
||||
|
||||
audio, sr = sf.read(wav, dtype="float32", always_2d=False)
|
||||
arr = np.asarray(audio, dtype=np.float32)
|
||||
if arr.ndim > 1:
|
||||
arr = arr.mean(axis=1)
|
||||
tensor = torch.from_numpy(arr).unsqueeze(0)
|
||||
if int(sr) == int(target_sr):
|
||||
return tensor
|
||||
try:
|
||||
import torchaudio.functional as AF
|
||||
|
||||
return AF.resample(tensor, int(sr), int(target_sr))
|
||||
except Exception:
|
||||
import torch.nn.functional as F
|
||||
|
||||
n_dst = max(1, int(round(tensor.shape[-1] * int(target_sr) / int(sr))))
|
||||
return F.interpolate(
|
||||
tensor.unsqueeze(0),
|
||||
size=n_dst,
|
||||
mode="linear",
|
||||
align_corners=False,
|
||||
).squeeze(0)
|
||||
|
||||
|
||||
def _build_strongly_typed_trt(trt_model: str, trt_kwargs: dict[str, Any], onnx_model: str) -> None:
|
||||
import tensorrt as trt
|
||||
|
||||
logger = trt.Logger(trt.Logger.INFO)
|
||||
builder = trt.Builder(logger)
|
||||
network_flags = 1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED)
|
||||
network = builder.create_network(network_flags)
|
||||
parser = trt.OnnxParser(network, logger)
|
||||
config = builder.create_builder_config()
|
||||
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 32)
|
||||
profile = builder.create_optimization_profile()
|
||||
with open(onnx_model, "rb") as f:
|
||||
if not parser.parse(f.read()):
|
||||
errors = [str(parser.get_error(i)) for i in range(parser.num_errors)]
|
||||
raise RuntimeError(f"failed to parse {onnx_model}: {'; '.join(errors)}")
|
||||
for i, name in enumerate(trt_kwargs["input_names"]):
|
||||
profile.set_shape(name, trt_kwargs["min_shape"][i], trt_kwargs["opt_shape"][i], trt_kwargs["max_shape"][i])
|
||||
config.add_optimization_profile(profile)
|
||||
engine_bytes = builder.build_serialized_network(network, config)
|
||||
if engine_bytes is None:
|
||||
raise RuntimeError(f"failed to build TensorRT engine from {onnx_model}")
|
||||
with open(trt_model, "wb") as f:
|
||||
f.write(engine_bytes)
|
||||
|
||||
|
||||
def _patch_cosyvoice_autocast_fp16_trt() -> None:
|
||||
try:
|
||||
import cosyvoice.cli.model as cosy_model
|
||||
except Exception:
|
||||
return
|
||||
if getattr(cosy_model, "_opentalking_autocast_fp16_trt_patched", False):
|
||||
return
|
||||
|
||||
original_convert = cosy_model.convert_onnx_to_trt
|
||||
original_load_trt = cosy_model.CosyVoiceModel.load_trt
|
||||
|
||||
def convert_onnx_to_trt(trt_model, trt_kwargs, onnx_model, fp16):
|
||||
onnx_path = Path(str(onnx_model))
|
||||
if fp16 and onnx_path.name == "flow.decoder.estimator.autocast_fp16.onnx":
|
||||
print(f"building strongly-typed autocast fp16 TensorRT engine: {trt_model}", flush=True)
|
||||
return _build_strongly_typed_trt(str(trt_model), trt_kwargs, str(onnx_model))
|
||||
return original_convert(trt_model, trt_kwargs, onnx_model, fp16)
|
||||
|
||||
def load_trt(self, flow_decoder_estimator_model, flow_decoder_onnx_model, trt_concurrent, fp16):
|
||||
if fp16:
|
||||
model_dir = Path(str(flow_decoder_estimator_model)).parent
|
||||
autocast_onnx = model_dir / "flow.decoder.estimator.autocast_fp16.onnx"
|
||||
if autocast_onnx.exists():
|
||||
flow_decoder_estimator_model = str(model_dir / "flow.decoder.estimator.autocast_fp16.mygpu.plan")
|
||||
flow_decoder_onnx_model = str(autocast_onnx)
|
||||
setattr(self, "_opentalking_trt_autocast_fp16", True)
|
||||
setattr(self, "_opentalking_trt_plan", flow_decoder_estimator_model)
|
||||
setattr(self, "_opentalking_trt_onnx", flow_decoder_onnx_model)
|
||||
print(
|
||||
"using CosyVoice autocast fp16 TensorRT asset "
|
||||
f"onnx={flow_decoder_onnx_model} plan={flow_decoder_estimator_model}",
|
||||
flush=True,
|
||||
)
|
||||
return original_load_trt(self, flow_decoder_estimator_model, flow_decoder_onnx_model, trt_concurrent, fp16)
|
||||
|
||||
cosy_model.convert_onnx_to_trt = convert_onnx_to_trt
|
||||
cosy_model.CosyVoiceModel.load_trt = load_trt
|
||||
cosy_model._opentalking_autocast_fp16_trt_patched = True
|
||||
print("patched cosyvoice autocast fp16 TensorRT loader", flush=True)
|
||||
|
||||
|
||||
def _patch_cosyvoice_load_wav() -> None:
|
||||
patched: list[str] = []
|
||||
for module_name in ("cosyvoice.utils.file_utils", "cosyvoice.cli.frontend"):
|
||||
try:
|
||||
module = importlib.import_module(module_name)
|
||||
except Exception:
|
||||
continue
|
||||
setattr(module, "load_wav", _soundfile_load_wav)
|
||||
patched.append(module_name)
|
||||
if patched:
|
||||
print(f"patched cosyvoice load_wav via soundfile modules={','.join(patched)}", flush=True)
|
||||
|
||||
|
||||
class SynthesizeRequest(BaseModel):
|
||||
text: str
|
||||
voice: str | None = None
|
||||
@@ -79,6 +186,15 @@ def apply_streaming_tuning(
|
||||
return {"requested": requested, "applied": applied, "effective": effective}
|
||||
|
||||
|
||||
def ensure_cosyvoice_flow_half(cosyvoice: Any) -> bool:
|
||||
model = _cosyvoice_model(cosyvoice)
|
||||
flow = getattr(model, "flow", None)
|
||||
if flow is None or not hasattr(flow, "half"):
|
||||
return False
|
||||
flow.half()
|
||||
return True
|
||||
|
||||
|
||||
def reset_streaming_tuning(cosyvoice: Any) -> dict[str, Any]:
|
||||
model = _cosyvoice_model(cosyvoice)
|
||||
baseline = getattr(model, "_opentalking_streaming_tuning", None)
|
||||
@@ -207,6 +323,9 @@ def current_runtime_info(cosyvoice: Any) -> dict[str, Any]:
|
||||
"fp16": bool(getattr(cosyvoice, "fp16", False)),
|
||||
"flow_decoder_estimator": estimator_type,
|
||||
"flow_decoder_trt": estimator_type == "TrtContextWrapper",
|
||||
"trt_autocast_fp16": bool(getattr(model, "_opentalking_trt_autocast_fp16", False)),
|
||||
"trt_plan": getattr(model, "_opentalking_trt_plan", ""),
|
||||
"trt_onnx": getattr(model, "_opentalking_trt_onnx", ""),
|
||||
}
|
||||
|
||||
|
||||
@@ -293,8 +412,10 @@ class CosyVoiceService:
|
||||
for path in (runtime, matcha):
|
||||
if str(path) not in sys.path:
|
||||
sys.path.insert(0, str(path))
|
||||
_patch_cosyvoice_load_wav()
|
||||
try:
|
||||
from cosyvoice.cli.cosyvoice import AutoModel
|
||||
_patch_cosyvoice_autocast_fp16_trt()
|
||||
except ImportError as exc:
|
||||
raise RuntimeError(
|
||||
"CosyVoice runtime is not importable. Clone FunAudioLLM/CosyVoice and install its requirements in this service venv."
|
||||
@@ -318,6 +439,9 @@ class CosyVoiceService:
|
||||
"trt_concurrent": self.trt_concurrent,
|
||||
}
|
||||
self._model, self._loaded_model_kwargs = _instantiate_automodel(AutoModel, model_kwargs)
|
||||
flow_half_applied = False
|
||||
if self.load_trt and self.fp16:
|
||||
flow_half_applied = ensure_cosyvoice_flow_half(self._model)
|
||||
self._apply_runtime_tuning()
|
||||
# Keep the service zero-shot first so it does not require precomputed spk2info.pt.
|
||||
print(
|
||||
@@ -325,6 +449,7 @@ class CosyVoiceService:
|
||||
f"model={self.model_dir} runtime={runtime} device={self.device} "
|
||||
f"fp16={self.fp16} load_jit={self.load_jit} load_trt={self.load_trt} "
|
||||
f"load_vllm={self.load_vllm} trt_concurrent={self.trt_concurrent} "
|
||||
f"flow_half_applied={flow_half_applied} "
|
||||
f"seconds={time.perf_counter() - t0:.3f}",
|
||||
flush=True,
|
||||
)
|
||||
@@ -386,6 +511,7 @@ class CosyVoiceService:
|
||||
"torch",
|
||||
"torchaudio",
|
||||
"numpy",
|
||||
"onnxruntime-gpu",
|
||||
"onnxruntime",
|
||||
),
|
||||
}
|
||||
@@ -542,9 +668,10 @@ class CosyVoiceService:
|
||||
self.model()
|
||||
return
|
||||
req = SynthesizeRequest(text=warmup_text)
|
||||
# Exhaust the stream so CosyVoice releases its request state and model lock.
|
||||
stream, _sr = self.synthesize_pcm_stream(req)
|
||||
for _chunk in stream:
|
||||
break
|
||||
pass
|
||||
|
||||
|
||||
def create_app(service: CosyVoiceService) -> FastAPI:
|
||||
|
||||
121
scripts/prepare_cosyvoice_venv.sh
Executable file
121
scripts/prepare_cosyvoice_venv.sh
Executable file
@@ -0,0 +1,121 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
script_dir="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
|
||||
repo_root="$(cd -- "$script_dir/.." && pwd)"
|
||||
|
||||
venv_dir="${OPENTALKING_COSYVOICE_VENV_DIR:-$repo_root/.venv-cosyvoice}"
|
||||
runtime_dir="${OPENTALKING_TTS_LOCAL_COSYVOICE_RUNTIME_DIR:-$repo_root/models/local-audio/runtime/CosyVoice}"
|
||||
requirements_file="${OPENTALKING_COSYVOICE_REQUIREMENTS:-$runtime_dir/requirements.txt}"
|
||||
|
||||
export PIP_INDEX_URL="${PIP_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
|
||||
|
||||
if [[ ! -d "$runtime_dir" ]]; then
|
||||
echo "Missing CosyVoice runtime: $runtime_dir" >&2
|
||||
echo "Clone FunAudioLLM/CosyVoice there, or set OPENTALKING_TTS_LOCAL_COSYVOICE_RUNTIME_DIR." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ ! -f "$requirements_file" ]]; then
|
||||
echo "Missing CosyVoice requirements: $requirements_file" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
resolve_bootstrap_python() {
|
||||
if [[ -n "${OPENTALKING_COSYVOICE_BOOTSTRAP_PYTHON:-}" ]]; then
|
||||
printf '%s\n' "$OPENTALKING_COSYVOICE_BOOTSTRAP_PYTHON"
|
||||
return 0
|
||||
fi
|
||||
if [[ -n "${PYTHON:-}" ]]; then
|
||||
printf '%s\n' "$PYTHON"
|
||||
return 0
|
||||
fi
|
||||
if command -v python3.11 >/dev/null 2>&1; then
|
||||
command -v python3.11
|
||||
return 0
|
||||
fi
|
||||
if command -v uv >/dev/null 2>&1; then
|
||||
if uv python find 3.11 >/dev/null 2>&1; then
|
||||
uv python find 3.11
|
||||
return 0
|
||||
fi
|
||||
uv python install 3.11 >/dev/null
|
||||
uv python find 3.11
|
||||
return 0
|
||||
fi
|
||||
command -v python3
|
||||
}
|
||||
|
||||
python_bin="$(resolve_bootstrap_python)"
|
||||
if [[ ! -x "$venv_dir/bin/python" ]]; then
|
||||
echo "Creating CosyVoice venv: $venv_dir"
|
||||
"$python_bin" -m venv "$venv_dir"
|
||||
fi
|
||||
|
||||
venv_python="$venv_dir/bin/python"
|
||||
tmp_dir="${OPENTALKING_COSYVOICE_TMPDIR:-$venv_dir/.tmp}"
|
||||
pip_cache_dir="${OPENTALKING_COSYVOICE_PIP_CACHE_DIR:-$venv_dir/.pip-cache}"
|
||||
mkdir -p "$tmp_dir" "$pip_cache_dir"
|
||||
export TMPDIR="$tmp_dir"
|
||||
export PIP_CACHE_DIR="$pip_cache_dir"
|
||||
find "$tmp_dir" -mindepth 1 -maxdepth 1 -name 'pip-*' -exec rm -rf {} +
|
||||
|
||||
pip_install_initial() {
|
||||
"$venv_python" -m pip install \
|
||||
--retries "${OPENTALKING_COSYVOICE_PIP_RETRIES:-10}" \
|
||||
--timeout "${OPENTALKING_COSYVOICE_PIP_TIMEOUT:-120}" \
|
||||
"$@"
|
||||
}
|
||||
|
||||
echo "Installing CosyVoice runtime dependencies"
|
||||
pip_install_initial --upgrade "pip<26" "setuptools<81" wheel
|
||||
|
||||
pip_common_args=(
|
||||
--retries "${OPENTALKING_COSYVOICE_PIP_RETRIES:-10}"
|
||||
--timeout "${OPENTALKING_COSYVOICE_PIP_TIMEOUT:-120}"
|
||||
)
|
||||
if "$venv_python" -m pip install --help | grep -q -- '--resume-retries'; then
|
||||
pip_common_args+=(--resume-retries "${OPENTALKING_COSYVOICE_PIP_RESUME_RETRIES:-10}")
|
||||
fi
|
||||
|
||||
pip_install() {
|
||||
"$venv_python" -m pip install "${pip_common_args[@]}" "$@"
|
||||
}
|
||||
|
||||
pip_install "numpy==1.26.4" "Cython>=3.0"
|
||||
filtered_requirements="$(mktemp)"
|
||||
trap 'rm -f "$filtered_requirements"' EXIT
|
||||
filter_pattern='^[[:space:]]*(openai-whisper|pyworld|torch|torchaudio)=='
|
||||
if [[ "${OPENTALKING_COSYVOICE_INSTALL_TENSORRT:-0}" != "1" ]]; then
|
||||
filter_pattern='^[[:space:]]*(openai-whisper|pyworld|torch|torchaudio|tensorrt-cu12.*)=='
|
||||
fi
|
||||
grep -Ev "$filter_pattern" "$requirements_file" \
|
||||
| grep -Ev '^[[:space:]]*--extra-index-url[[:space:]]+https://download\.pytorch\.org/' \
|
||||
>"$filtered_requirements"
|
||||
pip_install \
|
||||
"torch==2.3.1" \
|
||||
"torchaudio==2.3.1"
|
||||
pip_install -r "$filtered_requirements"
|
||||
pip_install --no-build-isolation "openai-whisper==20231117"
|
||||
pip_install --no-build-isolation "pyworld==0.3.4"
|
||||
|
||||
echo "Installing OpenTalking sidecar service dependencies"
|
||||
pip_install \
|
||||
"fastapi>=0.109" \
|
||||
"uvicorn[standard]>=0.27" \
|
||||
"pydantic>=2" \
|
||||
"numpy>=1.24,<2" \
|
||||
"soundfile>=0.12" \
|
||||
"transformers==4.51.3"
|
||||
|
||||
"$venv_python" - <<'PY'
|
||||
import importlib.metadata as metadata
|
||||
|
||||
for package in ("transformers", "tokenizers", "torch", "torchaudio", "onnxruntime-gpu", "onnxruntime"):
|
||||
try:
|
||||
print(f"{package}={metadata.version(package)}")
|
||||
except metadata.PackageNotFoundError:
|
||||
print(f"{package}=missing")
|
||||
PY
|
||||
|
||||
echo "CosyVoice venv is ready: $venv_dir"
|
||||
@@ -45,6 +45,11 @@
|
||||
|
||||
# Local CosyVoice3 sidecar. Keep TensorRT off until the CosyVoice runtime has
|
||||
# built/loaded compatible TRT engines for this GPU and model directory.
|
||||
# Prepare the sidecar venv with OPENTALKING_COSYVOICE_INSTALL_TENSORRT=1 before
|
||||
# setting OPENTALKING_TTS_LOCAL_COSYVOICE_LOAD_TRT=1. The sidecar starter adds
|
||||
# .venv-cosyvoice/site-packages/tensorrt_libs to LD_LIBRARY_PATH automatically.
|
||||
# For CosyVoice3 fp16 TRT, place the official flow.decoder.estimator.autocast_fp16.onnx
|
||||
# in the model directory; OpenTalking will build flow.decoder.estimator.autocast_fp16.mygpu.plan.
|
||||
# OPENTALKING_TTS_DEFAULT_PROVIDER=local_cosyvoice
|
||||
# OPENTALKING_TTS_ENABLED_PROVIDERS=local_cosyvoice,dashscope,edge
|
||||
# OPENTALKING_LOCAL_AUDIO_MODEL_ROOT=$DIGITAL_HUMAN_HOME/models/local-audio
|
||||
|
||||
205
scripts/quickstart/start_local_cosyvoice.sh
Executable file
205
scripts/quickstart/start_local_cosyvoice.sh
Executable file
@@ -0,0 +1,205 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
script_dir="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
|
||||
repo_root="$(cd -- "$script_dir/../.." && pwd)"
|
||||
default_home="$(cd -- "$repo_root/.." && pwd)"
|
||||
# shellcheck disable=SC1091
|
||||
source "$script_dir/_helpers.sh"
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage:
|
||||
bash scripts/quickstart/start_local_cosyvoice.sh [--host HOST] [--port PORT] [--env FILE]
|
||||
|
||||
Options:
|
||||
--host HOST Bind host for the local CosyVoice sidecar. Defaults to 127.0.0.1.
|
||||
--port PORT Bind port. Defaults to OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL or 19090.
|
||||
--env FILE Source a quickstart env file before starting the sidecar.
|
||||
--help Show this help.
|
||||
USAGE
|
||||
}
|
||||
|
||||
env_file="${OPENTALKING_QUICKSTART_ENV:-$script_dir/env}"
|
||||
host="${OPENTALKING_TTS_LOCAL_COSYVOICE_HOST:-127.0.0.1}"
|
||||
port=""
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--host)
|
||||
if [[ $# -lt 2 ]]; then
|
||||
echo "Missing value for --host" >&2
|
||||
exit 2
|
||||
fi
|
||||
host="$2"
|
||||
shift 2
|
||||
;;
|
||||
--port)
|
||||
if [[ $# -lt 2 ]]; then
|
||||
echo "Missing value for --port" >&2
|
||||
exit 2
|
||||
fi
|
||||
port="$2"
|
||||
shift 2
|
||||
;;
|
||||
--env)
|
||||
if [[ $# -lt 2 ]]; then
|
||||
echo "Missing value for --env" >&2
|
||||
exit 2
|
||||
fi
|
||||
env_file="$2"
|
||||
export OPENTALKING_QUICKSTART_ENV="$env_file"
|
||||
shift 2
|
||||
;;
|
||||
--help|-h)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo "Unknown argument: $1" >&2
|
||||
usage >&2
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
quickstart_source_env "$env_file"
|
||||
|
||||
export DIGITAL_HUMAN_HOME="${DIGITAL_HUMAN_HOME:-$default_home}"
|
||||
run_dir="$DIGITAL_HUMAN_HOME/run"
|
||||
log_dir="$DIGITAL_HUMAN_HOME/logs"
|
||||
mkdir -p "$run_dir" "$log_dir"
|
||||
|
||||
if [[ -z "$port" ]]; then
|
||||
port="${OPENTALKING_TTS_LOCAL_COSYVOICE_PORT:-}"
|
||||
fi
|
||||
if [[ -z "$port" && -n "${OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL:-}" ]]; then
|
||||
port="$(
|
||||
python3 - <<'PY'
|
||||
import os
|
||||
from urllib.parse import urlparse
|
||||
|
||||
url = os.environ.get("OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL", "")
|
||||
parsed = urlparse(url)
|
||||
print(parsed.port or "")
|
||||
PY
|
||||
)"
|
||||
fi
|
||||
port="${port:-19090}"
|
||||
|
||||
resolve_cosyvoice_python() {
|
||||
if [[ -n "${OPENTALKING_COSYVOICE_PYTHON:-}" ]]; then
|
||||
if [[ -x "$OPENTALKING_COSYVOICE_PYTHON" ]]; then
|
||||
printf '%s\n' "$OPENTALKING_COSYVOICE_PYTHON"
|
||||
return 0
|
||||
fi
|
||||
echo "OPENTALKING_COSYVOICE_PYTHON is not executable: $OPENTALKING_COSYVOICE_PYTHON" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
local candidate_dir=""
|
||||
for candidate_dir in \
|
||||
"${OPENTALKING_COSYVOICE_VENV_DIR:-}" \
|
||||
"$repo_root/.venv-cosyvoice" \
|
||||
"$DIGITAL_HUMAN_HOME/.venv-cosyvoice" \
|
||||
"/root/cosyvoice/.venv"
|
||||
do
|
||||
[[ -n "$candidate_dir" ]] || continue
|
||||
if [[ -x "$candidate_dir/bin/python" ]]; then
|
||||
printf '%s\n' "$candidate_dir/bin/python"
|
||||
return 0
|
||||
fi
|
||||
done
|
||||
|
||||
echo "Missing CosyVoice sidecar venv." >&2
|
||||
echo "Create it first: OPENTALKING_COSYVOICE_VENV_DIR=\"$repo_root/.venv-cosyvoice\" bash scripts/prepare_cosyvoice_venv.sh" >&2
|
||||
return 1
|
||||
}
|
||||
|
||||
cosy_python="$(resolve_cosyvoice_python)"
|
||||
case "$cosy_python" in
|
||||
"$repo_root/.venv/"*)
|
||||
echo "Refusing to start local CosyVoice from the OpenTalking main venv: $cosy_python" >&2
|
||||
echo "Use OPENTALKING_COSYVOICE_VENV_DIR or OPENTALKING_COSYVOICE_PYTHON for the sidecar venv." >&2
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
|
||||
cosy_site_packages="$($cosy_python - <<'PY'
|
||||
import sysconfig
|
||||
|
||||
print(sysconfig.get_paths().get("purelib", ""))
|
||||
PY
|
||||
)"
|
||||
cosy_trt_lib_dir="$cosy_site_packages/tensorrt_libs"
|
||||
if [[ -d "$cosy_trt_lib_dir" ]]; then
|
||||
export LD_LIBRARY_PATH="$cosy_trt_lib_dir${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
|
||||
fi
|
||||
|
||||
pid_file="$run_dir/local-cosyvoice-$port.pid"
|
||||
log_file="$log_dir/local-cosyvoice-$port.log"
|
||||
|
||||
if [[ -f "$pid_file" ]]; then
|
||||
old_pid="$(cat "$pid_file" 2>/dev/null || true)"
|
||||
if [[ -n "$old_pid" ]] && kill -0 "$old_pid" >/dev/null 2>&1; then
|
||||
if curl --max-time 2 -fsS "http://127.0.0.1:$port/health" >/dev/null 2>&1; then
|
||||
echo "Local CosyVoice is already running: pid=$old_pid port=$port"
|
||||
echo "Log: $log_file"
|
||||
exit 0
|
||||
fi
|
||||
echo "Stale Local CosyVoice pid file: pid=$old_pid port=$port" >&2
|
||||
fi
|
||||
rm -f "$pid_file"
|
||||
fi
|
||||
|
||||
if quickstart_port_in_use "$port"; then
|
||||
echo "Local CosyVoice port $port is already in use." >&2
|
||||
quickstart_describe_port "$port" >&2 || true
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Starting Local CosyVoice"
|
||||
echo " repo: $repo_root"
|
||||
echo " python: $cosy_python"
|
||||
echo " host: $host"
|
||||
echo " port: $port"
|
||||
echo " log: $log_file"
|
||||
if [[ -d "$cosy_trt_lib_dir" ]]; then
|
||||
echo " trt lib: $cosy_trt_lib_dir"
|
||||
fi
|
||||
|
||||
(
|
||||
cd "$repo_root"
|
||||
export PYTHONPATH="$repo_root${PYTHONPATH:+:$PYTHONPATH}"
|
||||
export OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD="${OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD:-1}"
|
||||
if declare -F quickstart_detach >/dev/null 2>&1; then
|
||||
quickstart_detach "$log_file" "$cosy_python" scripts/local_cosyvoice_service.py --host "$host" --port "$port" >"$pid_file"
|
||||
else
|
||||
setsid "$cosy_python" scripts/local_cosyvoice_service.py --host "$host" --port "$port" >"$log_file" 2>&1 < /dev/null &
|
||||
echo "$!" >"$pid_file"
|
||||
fi
|
||||
)
|
||||
|
||||
pid="$(cat "$pid_file" 2>/dev/null || true)"
|
||||
if [[ -z "$pid" ]]; then
|
||||
echo "Failed to capture Local CosyVoice pid." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
for _ in {1..120}; do
|
||||
if ! kill -0 "$pid" >/dev/null 2>&1; then
|
||||
echo "Local CosyVoice exited during startup. Last log lines:" >&2
|
||||
tail -80 "$log_file" >&2 || true
|
||||
rm -f "$pid_file"
|
||||
exit 1
|
||||
fi
|
||||
if curl --max-time 2 -fsS "http://127.0.0.1:$port/health" >/dev/null 2>&1; then
|
||||
echo "Local CosyVoice is up: http://127.0.0.1:$port"
|
||||
exit 0
|
||||
fi
|
||||
sleep 1
|
||||
done
|
||||
|
||||
echo "Local CosyVoice did not become ready in 120s. Last log lines:" >&2
|
||||
tail -80 "$log_file" >&2 || true
|
||||
exit 1
|
||||
Reference in New Issue
Block a user