Skip to content

Avatar Format

An avatar bundle defines the visual identity of a digital human together with the metadata required to align mouth motion with audio. OpenTalking reads avatar bundles when a session is created and treats them as shared visual assets for the active talking-head model; model-specific caches, templates, or preprocessing artifacts are created by the corresponding deployment flow.

This page documents the directory layout, the manifest.json schema, the scripts that generate avatar bundles, and the validation endpoints.

Directory layout

Each avatar bundle is a single subdirectory containing a manifest.json file:

examples/avatars/
├── demo-avatar/
│   ├── manifest.json
│   └── preview.png
├── singer-wav2lip/
│   ├── manifest.json
│   ├── preview.png
│   └── frames/
│       ├── frame_00000.png
│       ├── frame_00001.png
│       └── ...
└── singer-musetalk/
    ├── manifest.json
    ├── preview.png
    └── full_frames/
        └── ...

Common layout conventions:

Content Required Description
manifest.json Yes Basic avatar information and optional metadata.
preview.png Recommended Preview image for the WebUI avatar library.
frames/ Optional Ordered image sequence, commonly used by Wav2Lip-style reference-frame flows.
full_frames/ Optional Video frame sequence, commonly used by MuseTalk preprocessing.
prepared/ Optional Preprocessing artifacts generated by models such as MuseTalk.
Template video Optional Derived or external asset that models such as QuickTalk may use at runtime.

A preview.png file is recommended; the frontend uses it to populate the avatar picker.

manifest.json schema

Field Type Required Description
id string Yes Globally unique identifier referenced by the client.
name string No Display name. Defaults to id.
model_type string No Legacy manifest type field; do not rely on it to bind an avatar to a model.
fps number Yes Target output frame rate. Typical value: 25.
sample_rate number Yes Audio sample rate aligned with the TTS output. Typical value: 16000.
width number Yes Output video width in pixels.
height number Yes Output video height in pixels.
version string No Asset version string.
metadata object No Arbitrary additional fields for upload provenance, derivatives, or runtime metadata.

Mouth Metadata

When an avatar includes mouth localization data, store it under metadata.animation:

{
  "source_image_hash": "<sha256>",
  "animation": {
    "mouth_center": [0.5, 0.56],
    "mouth_rx": 0.06,
    "mouth_ry": 0.02,
    "outer_lip": [[0.45, 0.55], [0.5, 0.53], [0.55, 0.55]],
    "inner_mouth": [[0.47, 0.55], [0.53, 0.55], [0.5, 0.57]]
  }
}

Coordinates are normalized to the image dimensions. When a single-image avatar is uploaded through /avatars/custom, OpenTalking attempts mouth detection using MediaPipe locally. If detection fails, the upload succeeds without an animation field; model backends fall back to their own built-in alignment when possible. The wav2lip_postprocess_mode flag controls the server-side post-processing mode. OpenTalking local Wav2Lip defaults to easy_improved; easy_enhanced is accepted by the backend/API but requires GFPGAN dependencies and checkpoint assets.

Generic manifest example

{
  "id": "demo-avatar",
  "name": "Demo Avatar",
  "fps": 25,
  "sample_rate": 16000,
  "width": 512,
  "height": 512,
  "version": "1.0",
  "metadata": {}
}

Avatar bundle preparation

From a video file

terminal
python scripts/prepare_wav2lip_video_asset.py \
    --source /path/to/source.mp4 \
    --output-dir examples/avatars/my-avatar \
    --avatar-id my-avatar \
    --name "My Avatar" \
    --fps 25

The script performs the following steps:

  1. Extracts frames using ffmpeg and writes them to examples/avatars/my-avatar/frames/.
  2. Runs MediaPipe mouth detection and records the results in metadata.animation.
  3. Generates manifest.json and preview.png.

From a single image

terminal
python scripts/prepare_wav2lip_image_asset.py \
    --source /path/to/face.jpg \
    --output-dir examples/avatars/my-avatar-static \
    --avatar-id my-avatar-static

This produces a single frame at frames/frame_00000.png along with a complete manifest.

Interactive preparation

terminal
bash scripts/prepare-avatar.sh

The script prompts for the source file, model type, and avatar identifier.

Validation

REST endpoint

terminal
curl -s http://127.0.0.1:8000/avatars | jq
# [
#   {"id":"demo-avatar","name":"Demo","model_type":"mock","width":512,...},
#   ...
# ]

Programmatic validation

from opentalking.avatar.validator import list_avatar_dirs
from opentalking.avatar.loader import load_avatar_bundle

for path in list_avatar_dirs("./examples/avatars"):
    bundle = load_avatar_bundle(path, strict=True)
    print(bundle.manifest.id, bundle.manifest.model_type)

The strict=True parameter raises an exception when required fields or subdirectories are missing; this mode is appropriate for continuous integration.

Custom upload

The frontend's avatar creation flow invokes POST /avatars/custom with a multipart request body:

Field Description
name Display name.
base_avatar_id Identifier of the avatar whose manifest serves as the template.
image Portrait image uploaded by the user.

The server copies the base avatar's manifest, overrides id and name, sets metadata.custom_avatar=true, writes the uploaded image to frames/frame_00000.png, and runs mouth detection.

Only avatars marked custom_avatar=true may be removed through DELETE /avatars/{avatar_id}.

Source files

File Responsibility
opentalking/avatar/loader.py Manifest parsing and bundle loading.
opentalking/avatar/validator.py Directory traversal and strict validation.
opentalking/avatar/mouth_metadata.py MediaPipe mouth detection.
apps/api/routes/avatars.py REST endpoint implementations.