Avatar Format¶
An avatar bundle defines the visual identity of a digital human together with the metadata required to align mouth motion with audio. OpenTalking reads avatar bundles when a session is created and treats them as shared visual assets for the active talking-head model; model-specific caches, templates, or preprocessing artifacts are created by the corresponding deployment flow.
This page documents the directory layout, the manifest.json schema, the scripts that
generate avatar bundles, and the validation endpoints.
Directory layout¶
Each avatar bundle is a single subdirectory containing a manifest.json file:
examples/avatars/
├── demo-avatar/
│ ├── manifest.json
│ └── preview.png
├── singer-wav2lip/
│ ├── manifest.json
│ ├── preview.png
│ └── frames/
│ ├── frame_00000.png
│ ├── frame_00001.png
│ └── ...
└── singer-musetalk/
├── manifest.json
├── preview.png
└── full_frames/
└── ...
Common layout conventions:
| Content | Required | Description |
|---|---|---|
manifest.json |
Yes | Basic avatar information and optional metadata. |
preview.png |
Recommended | Preview image for the WebUI avatar library. |
frames/ |
Optional | Ordered image sequence, commonly used by Wav2Lip-style reference-frame flows. |
full_frames/ |
Optional | Video frame sequence, commonly used by MuseTalk preprocessing. |
prepared/ |
Optional | Preprocessing artifacts generated by models such as MuseTalk. |
| Template video | Optional | Derived or external asset that models such as QuickTalk may use at runtime. |
A preview.png file is recommended; the frontend uses it to populate the avatar picker.
manifest.json schema¶
| Field | Type | Required | Description |
|---|---|---|---|
id |
string | Yes | Globally unique identifier referenced by the client. |
name |
string | No | Display name. Defaults to id. |
model_type |
string | No | Legacy manifest type field; do not rely on it to bind an avatar to a model. |
fps |
number | Yes | Target output frame rate. Typical value: 25. |
sample_rate |
number | Yes | Audio sample rate aligned with the TTS output. Typical value: 16000. |
width |
number | Yes | Output video width in pixels. |
height |
number | Yes | Output video height in pixels. |
version |
string | No | Asset version string. |
metadata |
object | No | Arbitrary additional fields for upload provenance, derivatives, or runtime metadata. |
Mouth Metadata¶
When an avatar includes mouth localization data, store it under metadata.animation:
{
"source_image_hash": "<sha256>",
"animation": {
"mouth_center": [0.5, 0.56],
"mouth_rx": 0.06,
"mouth_ry": 0.02,
"outer_lip": [[0.45, 0.55], [0.5, 0.53], [0.55, 0.55]],
"inner_mouth": [[0.47, 0.55], [0.53, 0.55], [0.5, 0.57]]
}
}
Coordinates are normalized to the image dimensions. When a single-image avatar is
uploaded through /avatars/custom, OpenTalking attempts mouth detection using
MediaPipe locally. If detection fails, the upload succeeds without an animation
field; model backends fall back to their own built-in alignment when possible.
The wav2lip_postprocess_mode flag controls the server-side post-processing mode.
OpenTalking local Wav2Lip defaults to easy_improved; easy_enhanced is accepted
by the backend/API but requires GFPGAN dependencies and checkpoint assets.
Generic manifest example¶
{
"id": "demo-avatar",
"name": "Demo Avatar",
"fps": 25,
"sample_rate": 16000,
"width": 512,
"height": 512,
"version": "1.0",
"metadata": {}
}
Avatar bundle preparation¶
From a video file¶
python scripts/prepare_wav2lip_video_asset.py \
--source /path/to/source.mp4 \
--output-dir examples/avatars/my-avatar \
--avatar-id my-avatar \
--name "My Avatar" \
--fps 25
The script performs the following steps:
- Extracts frames using ffmpeg and writes them to
examples/avatars/my-avatar/frames/. - Runs MediaPipe mouth detection and records the results in
metadata.animation. - Generates
manifest.jsonandpreview.png.
From a single image¶
python scripts/prepare_wav2lip_image_asset.py \
--source /path/to/face.jpg \
--output-dir examples/avatars/my-avatar-static \
--avatar-id my-avatar-static
This produces a single frame at frames/frame_00000.png along with a complete manifest.
Interactive preparation¶
The script prompts for the source file, model type, and avatar identifier.
Validation¶
REST endpoint¶
curl -s http://127.0.0.1:8000/avatars | jq
# [
# {"id":"demo-avatar","name":"Demo","model_type":"mock","width":512,...},
# ...
# ]
Programmatic validation¶
from opentalking.avatar.validator import list_avatar_dirs
from opentalking.avatar.loader import load_avatar_bundle
for path in list_avatar_dirs("./examples/avatars"):
bundle = load_avatar_bundle(path, strict=True)
print(bundle.manifest.id, bundle.manifest.model_type)
The strict=True parameter raises an exception when required fields or subdirectories
are missing; this mode is appropriate for continuous integration.
Custom upload¶
The frontend's avatar creation flow invokes POST /avatars/custom with a multipart
request body:
| Field | Description |
|---|---|
name |
Display name. |
base_avatar_id |
Identifier of the avatar whose manifest serves as the template. |
image |
Portrait image uploaded by the user. |
The server copies the base avatar's manifest, overrides id and name, sets
metadata.custom_avatar=true, writes the uploaded image to frames/frame_00000.png,
and runs mouth detection.
Only avatars marked custom_avatar=true may be removed through
DELETE /avatars/{avatar_id}.
Source files¶
| File | Responsibility |
|---|---|
opentalking/avatar/loader.py |
Manifest parsing and bundle loading. |
opentalking/avatar/validator.py |
Directory traversal and strict validation. |
opentalking/avatar/mouth_metadata.py |
MediaPipe mouth detection. |
apps/api/routes/avatars.py |
REST endpoint implementations. |