---
title: "AI Lip-Sync Tools for Filmmaking — Comparison Research"
project: ai-filmmaking-workshop
type: research-doc
created: 2026-04-29
status: draft
version: 1.0
---

# AI Lip-Sync Tools for Filmmaking — Comparison Research

**Purpose:** A decision-making guide for selecting lip-sync technology in an AI filmmaking pipeline. Written for a team workshop context. Prioritizes free and open-source options, with clear guidance on when to pay.

---

## Executive Summary

Lip sync is the **hardest single problem** in AI filmmaking today. No tool solves it perfectly across all shot types. The professional approach is hybrid: use different tools for different shot categories, and structure your edit so only 20–30% of runtime actually shows lips forming words.

**If you can only choose one tool for your workshop:** Use **Sync.so** (free tier). It requires zero setup, runs in the cloud, and produces the best quality for standard dialogue shots. For teams with GPU access, **MuseTalk** is the best free self-hosted alternative.

---

## The Tools at a Glance

| # | Tool | Type | Free? | GPU Needed? | Best For |
|---|------|------|-------|-------------|----------|
| 1 | **Sync.so** | Cloud API | Free tier | No | Production quality, instant start |
| 2 | **MuseTalk** | Open-source | Yes | Yes (6GB+) | Self-hosted, zero-cost-at-scale |
| 3 | **Wav2Lip (OS)** | Open-source | Yes | Yes (4GB+) | Academic reference, learning fundamentals |
| 4 | **Runway Act-One** | Cloud SaaS | Trial only | No | Emotional performance transfer |
| 5 | **HeyGen** | Cloud SaaS | Free tier | No | Talking head / corporate avatar |

---

## 1. Sync.so (Sync Labs) ★★★★★ — The Production Standard

### Overview
Sync.so is the commercial API from **Synchronicity Labs**, the original creators of Wav2Lip. It represents 5+ years of research iteration beyond the open-source Wav2Lip model. The current model, **Lipsync-2**, is zero-shot — upload any video + audio, receive a lip-synced output. No training, no fine-tuning, no GPU.

### Technical Architecture
- **Base:** MuseTalk v1.5 with custom enhancements
- **Audio encoding:** Whisper embeddings for multi-language phonetic understanding
- **Face encoding:** VAE latent space encoding for high-fidelity texture preservation
- **Face blending:** BiSeNet-based face parsing for seamless mouth integration
- **Infrastructure:** GPU-accelerated cloud (serverless scaling)

### Quality
Studio-grade for standard dialogue. Handles:
- ✓ Frontal and ¾-angle faces
- ✓ Multiple languages (phonetic, not language-specific)
- ✓ Different lighting conditions
- ✗ Extreme profiles (side-view)
- ✗ Very fast motion + dialogue simultaneously

### Pricing (as of April 2026)
| Plan | Cost | Credits | Best For |
|------|------|---------|----------|
| Free | $0/mo | Limited (trial) | Workshop demo, evaluation |
| Hobbyist | $5/mo | ~2 min video | Personal projects |
| Creator | $19/mo | ~10 min video | Independent creators |
| Growth | $49/mo | ~30 min video | Small studios |
| Scale | $249/mo | ~3 hrs video | Production companies |

### API Access
```python
# Python SDK
pip install syncsdk

from sync import Sync
from sync.common import Audio, GenerationOptions, Video

client = Sync(api_key="YOUR_KEY").generations
client.create(
    input=[Video(url="video.mp4"), Audio(url="audio.wav")],
    model="lipsync-2",
    options=GenerationOptions(sync_mode="cut_off")
)
```

Also available: TypeScript SDK, REST API, Web Studio (drag-and-drop).

### When to Use
- **Standard dialogue shots** where the character faces the camera
- **Quick iteration** — upload, get result in 2–5 minutes
- **Team workshops** — everyone can use it simultaneously, no GPU queue
- **Production deliverables** — client work, commercial output

### When NOT to Use
- High-volume batch processing (costs add up)
- Offline/air-gapped environments
- Side-profile or extreme-angle dialogue
- Full-body shots where face resolution is too low

### Workshop Fit: ★★★★★
Free tier is sufficient for demonstration. No installation. Immediate results. The best "first tool" for a workshop.

---

## 2. MuseTalk (Tencent Lyra Lab) ★★★★☆ — The Open-Source Champion

### Overview
Developed by **Lyra Lab at Tencent Music Entertainment**. Fully open-source: inference code, training code, and model weights are all public. Designed for real-time video dubbing — achieves 30fps+ inference speed on a single V100 GPU.

### Technical Architecture
- **Generative model:** UNet borrowed from Stable Diffusion v1.4 architecture (but NOT a diffusion model — single-step latent inpainting)
- **Image encoding:** Frozen VAE (ft-mse-vae), operating in latent space
- **Audio encoding:** Frozen Whisper-tiny model, cross-attention fusion
- **Face region:** 256×256 pixels centered on detected face
- **Training:** Two-stage strategy with spatio-temporal data sampling
- **Losses (v1.5):** Perceptual loss + GAN loss + sync loss

### Performance
| Metric | Value |
|--------|-------|
| Inference speed | 30fps+ (NVIDIA Tesla V100) |
| Face resolution | 256×256 |
| VRAM requirement | 6GB+ (RTX 3060, RTX 4060, A4000) |
| Languages | Chinese, English, Japanese (tested) |
| Real-time capable | Yes (with streaming pipeline) |

### Quality
Very good — best among fully open-source options. v1.5 (March 2025) significantly improved:
- **Clarity:** GAN loss sharpens facial details that L1 loss alone blurs
- **Identity preservation:** Perceptual loss maintains the actor's facial identity
- **Sync accuracy:** Dedicated sync loss improves lip-audio alignment

Limitations:
- Face-only (256×256 region) — neck/body not animated
- Requires well-lit, frontal faces for best results
- GPU-dependent; slower on CPU

### Setup
```bash
git clone https://github.com/TMElyralab/MuseTalk.git
cd MuseTalk
pip install -r requirements.txt
# Download pretrained weights from HuggingFace:
# https://huggingface.co/TMElyralab/MuseTalk
python -m scripts.inference --input_video input.mp4 --input_audio audio.wav
```

### When to Use
- **Self-hosted pipelines** — process unlimited videos at zero per-unit cost
- **High-volume processing** — dubbing 100+ short clips
- **Privacy-sensitive work** — no data leaves your infrastructure
- **Research and customization** — training code available for fine-tuning
- **Real-time applications** — live avatar dubbing

### When NOT to Use
- Quick one-off shots (Sync.so is faster)
- No GPU available
- Need above-neck animation (MuseTalk is face-only)
- Team members without technical setup skills

### Workshop Fit: ★★★☆☆
Requires GPU + 30-minute installation. Best as a **demo station** (one machine, projected), not per-person. Use Sync.so for hands-on; show MuseTalk as the free-at-scale alternative.

---

## 3. Wav2Lip (Open-Source Original) ★★★☆☆ — The Benchmark

### Overview
The original academic work (2020) from Prajwal et al. at IIIT Hyderabad that established the GAN-based lip-sync paradigm. **Critically: the open-source model is deliberately lower quality than the commercial Sync.so version from the same team.** The research has moved on.

### Technical Architecture
- GAN-based: generator produces synthetic lower-face, discriminator judges lip-sync quality
- Pretrained SyncNet expert discriminator for audio-visual coherence
- Face detection preprocessing (S3FD or similar)
- Operates in pixel space (not latent), 96×96 or 288×288 resolution

### Quality
Noticeably lower than modern alternatives:
- Visible seam between generated mouth and original face
- Struggles with non-frontal angles
- Color/lighting mismatch in the blending boundary
- Lower resolution output

### When to Use
- **Academic baseline** — cited in papers, well-understood behavior
- **Learning GAN-based video generation** — simplified architecture, good teaching tool
- **Historical context** — understanding how the field evolved

### When NOT to Use
- Any production work (use Sync.so or MuseTalk instead)
- Workshop hands-on (poor results are demotivating)
- Any project where visual quality matters

### Workshop Fit: ★★☆☆☆
Mention as historical reference + benchmark. Don't demo hands-on. Show a side-by-side comparison: Wav2Lip OS vs MuseTalk vs Sync.so — the quality gap tells the story of 5 years of progress.

---

## 4. Runway Act-One ★★★★★ — The Performance Tool

### Overview
Act-One is Runway's facial expression transfer system. Unlike standard lip-sync tools that only animate the mouth, Act-One **transfers a full facial performance** — eyes, brows, micro-expressions, head tilt — from a reference "driving" video to a target character.

### How It Works
1. Record a reference performance video (human actor delivering the line)
2. Provide a target character image or video
3. Act-One maps the performance to the target, preserving the emotional nuance

### Quality
Exceptional for character acting. The transferred expressions feel human because they come from a human performance. This is fundamentally different from audio-driven lip sync — it captures *how* a line is delivered, not just *that* it matches.

### Pricing
Part of Runway subscription (from $15/mo). Usage-based limits apply.

### When to Use
- **Emotional monologues** — where acting matters more than perfect sync
- **Character close-ups** — where micro-expressions tell the story
- **When you have a reference performance** — an actor or yourself performing the scene

### When NOT to Use
- Standard dialogue where expression is neutral
- Wide shots (face too small for expression transfer to read)
- Budget-constrained projects (expensive at scale)
- No reference performance available

### Workshop Fit: ★★★☆☆
Spectacular demo piece, but requires Runway subscription. Show a pre-made example. If budget allows, one person does a live demo.

---

## 5. HeyGen ★★★★☆ — The Talking Head Specialist

### Overview
HeyGen generates AI avatar videos: upload a photo or 1-second video clip, type or upload dialogue, and it produces a talking head video with lip-synced speech. Voice cloning is built in.

### Quality
Very good within its domain: locked-off, frontal, talking-head shots. The limitation is the domain itself — it's an avatar, not a cinematic character.

### Pricing
- Free: 1 minute/month
- Creator: $24/month (~15 minutes)
- Business: $72/month (~45 minutes)

### When to Use
- **Direct-to-camera monologue** — spokesperson, narration, address
- **Corporate/promotional** — product explainer, company announcement
- **Quick avatar creation** — no video source material needed

### When NOT to Use
- Cinematic scenes with camera movement
- Multi-character dialogue
- Complex emotional performances
- Shots where the character is doing anything other than talking to camera

### Workshop Fit: ★★★★☆
Instant gratification, easy demo. Everyone can create a talking avatar in 2 minutes. Good for the "wow factor" segment. Limited for actual filmmaking.

---

## Comparative Summary

| Dimension | Sync.so | MuseTalk | Wav2Lip OS | Runway Act-One | HeyGen |
|-----------|---------|----------|------------|----------------|--------|
| Lip sync accuracy | ★★★★★ | ★★★★☆ | ★★★☆☆ | ★★★★★ | ★★★★☆ |
| Visual quality | ★★★★★ | ★★★★☆ | ★★☆☆☆ | ★★★★★ | ★★★★☆ |
| Emotional expression | ★★★☆☆ | ★★★☆☆ | ★★☆☆☆ | ★★★★★ | ★★★☆☆ |
| Multi-angle support | ★★★★☆ | ★★★★☆ | ★★☆☆☆ | ★★★★☆ | ★☆☆☆☆ |
| Setup ease | ★★★★★ | ★★☆☆☆ | ★★☆☆☆ | ★★★★☆ | ★★★★★ |
| Free tier | ★★★☆☆ | ★★★★★ | ★★★★★ | ★☆☆☆☆ | ★★☆☆☆ |
| API / automation | ★★★★★ | ★★★☆☆ | ★★☆☆☆ | ★★☆☆☆ | ★★★★☆ |
| Offline capable | ✗ | ✓ | ✓ | ✗ | ✗ |

---

## Decision Matrix: Which Tool When?

```
Your shot type:

DIRECT-TO-CAMERA TALKING HEAD
└─ HeyGen (fastest) or Sync.so (higher quality)

CHARACTER CLOSE-UP, EMOTIONAL DELIVERY
└─ Runway Act-One (if you have reference performance)
└─ Sync.so (if no reference performance)

STANDARD DIALOGUE, FRONTAL/¾ ANGLE
└─ Sync.so (best quality, no setup)
└─ MuseTalk (if zero per-unit cost needed)

SIDE PROFILE or WIDE SHOT
└─ None — lip sync won't be visible
└─ Use voiceover over B-roll

BATCH PROCESSING (50+ shots)
└─ MuseTalk (self-hosted, free at scale)
└─ Sync.so API (pay-per-second, cloud scale)

PRIVACY-SENSITIVE (no cloud)
└─ MuseTalk (run locally, air-gapped)
```

---

## The Hybrid Strategy (Professional Approach)

No single tool handles every shot. A real production uses:

| Shot Type | Tool | % of Runtime |
|-----------|------|-------------|
| Close-up dialogue (emotional) | Runway Act-One | 10% |
| Standard dialogue | Sync.so | 15–20% |
| Voiceover over B-roll | No sync needed | 50–60% |
| Talking head / narration | HeyGen or Sync.so | 10–15% |
| Wide / action (no visible lips) | No sync needed | 10% |

**The insight:** Most AI filmmakers overspend on lip sync. A well-structured edit only needs perfect sync on ~20% of shots. The rest is voiceover, reaction shots, cutaways, and wide shots where mouths aren't visible. Structure your edit accordingly.

---

## Source References

| Source | Type | URL |
|--------|------|-----|
| Sync.so Docs | Primary | https://sync.so |
| Sync.so API Docs | Primary | https://docs.sync.so |
| MuseTalk GitHub | Primary | https://github.com/TMElyralab/MuseTalk |
| MuseTalk Paper | Academic | https://arxiv.org/abs/2410.10122 |
| Wav2Lip GitHub | Primary | https://github.com/Rudrabha/Wav2Lip |
| Wav2Lip Paper | Academic | https://arxiv.org/abs/2008.10010 |
| Runway Act-One | Primary | https://runwayml.com |
| HeyGen | Primary | https://heygen.com |
| Figma Weave | Primary | https://www.figma.com/weave/ |

---

*Research compiled April 2026. Tool pricing and capabilities change rapidly — verify before production use.*
