Model deep-dive
Kling 3.0 Omni native lip sync, explained
"Lip sync" used to mean a second tool: you generated a video, then ran the mouth through a separate model to match an audio track. Kling 3.0 Omni, Kuaishou's omni-modal video model, does it in one pass instead — the mouth motion is generated from the audio inside the model, at the same time as everything else in the shot. That sounds like a small distinction, but it changes how clean the result looks and how much fiddling you do afterward. Here's what native lip sync actually means, the Kling 3.0 Omni capabilities that make it useful, and how to make a talking clip in Renoise.
What "native" lip sync means
A post-processing lip sync pipeline works in two stages. First a video model generates the footage; then a second model takes an audio clip and re-warps the mouth region to match the phonemes. Because the mouth is edited after the fact, the seams show up: the lower face can look pasted on, jaw and cheek motion don't always follow, and the timing drifts on fast speech.
Native lip sync folds that into the generation itself. The model takes the audio as an input alongside the prompt and reference images, and produces the mouth, jaw, and facial motion that fits the words as part of the same render — not as an edit layered on top. Because the whole face is generated together, the mouth moves with the cheeks, the expression matches the line, and the timing is locked to the audio from the first frame.
That's the difference that matters: with a post step you're correcting a finished video; with native lip sync the talking is baked in.
Kling 3.0 Omni specs
Kling 3.0 Omni is built as an omni-modal model — lip sync is one capability among several that work together. Here's what it does, as integrated in Renoise:
| Capability | Kling 3.0 Omni |
|---|---|
| Clip length | 3–15s (≤10s when a reference video is included) |
| Resolution | 720p / 1080p |
| Aspect ratios | 5 (16:9 / 9:16 / 1:1 / 4:3 / 3:4) |
| Input modalities | 5+ (text, image, audio, video, and more) |
| Lip sync | Native, audio-driven |
| Multi-subject consistency | Yes — tracks multiple subjects in one shot |
| Storyboard | Up to 6 shots in a single job |
| Physics | Physical-dynamics simulation |
| References | Up to 7 images (≤4 with a reference video) + 1 video |
A few of these are worth unpacking, because they're what make lip sync usable beyond a single talking head.
Multi-subject consistency
A talking scene is rarely one face. Kling 3.0 Omni can hold multiple subjects consistent within a shot — two people in a dialogue, a presenter next to a product — so the right mouth moves for the right line and each subject keeps their appearance across the clip. As a model-layer improvement this is much tighter than older models, though like any AI video model it can still drift, so it's worth reviewing the result rather than assuming a perfect lock.
Up to 6 shots in one storyboard
Instead of generating clips one at a time and stitching them, you can describe up to 6 shots in a single storyboard job. That keeps a character and setting coherent across cuts — useful for a short dialogue scene or a multi-beat ad where each shot needs the same speaker.
5+ input modalities and physical dynamics
The model accepts 5+ input modalities — text, image, audio, video and more — which is exactly why native lip sync works: the audio is just another first-class input. On top of that, its physical-dynamics simulation keeps motion plausible (hair, cloth, gesture), so a talking subject still moves like a real one rather than a floating face.
Reference handling
You can attach up to 7 reference images to anchor a character, style, or scene. If you also supply a reference video (one video clip), the image budget drops to 4 and clip length caps at 10 seconds — a deliberate trade, because a reference video already carries a lot of motion and identity information.
How to make a lip-sync clip in Renoise
On the Renoise Canvas, a talking clip is a few steps:
- Open the video tool and pick Kling 3.0 Omni as the model (or go straight to
/videos?model=kling). - Add your subject. Upload a reference image of the character or scene, or write a prompt to generate one. You can attach up to 7 reference images to lock the look.
- Add the audio you want the subject to speak or sing — this is the track the native lip sync is driven from.
- Write the prompt: describe the scene, camera, and delivery (tone, energy), not the mouth shapes — the model handles those from the audio.
- Set length and aspect ratio (3–15s; pick 9:16 for Reels/TikTok, 16:9 for YouTube). If you also add a reference video, keep the clip to 10 seconds or less.
- Generate, then review the sync and the subjects before exporting. Watermark-free export is available on paid plans.
For a single still photo that talks, see the AI talking photo guide; for syncing to a real, authorized person, see celebrity-style lip sync; and for music-driven motion, the AI dance video guide.
Lip sync of real people
If you want a real person's likeness to speak on camera, that goes through a one-time likeness review for authorized real faces — you confirm you have the rights to that person before generating. It's a consent step, not a creative limit: once a face is authorized, Kling 3.0 Omni drives it from your audio like any other subject.
Tips for clean results
- Keep clips short with a reference video. With a reference video attached, you're capped at 10 seconds and 4 image references — plan the shot around that rather than fighting it.
- Match the aspect ratio to the platform up front (9:16 vertical, 16:9 horizontal, 1:1 square) so you're not re-cropping a face later.
- Feed clean audio. Native lip sync follows the track it's given; clearer speech with less background noise produces tighter mouth timing.
- Use the storyboard for dialogue. When you need multiple shots of the same speaker, the up-to-6-shot job keeps them consistent better than generating each clip separately.
- Need longer, audio-generating shots instead? Seedance 2.0 (ByteDance) is also live in Renoise and generates its own audio — a good alternative when lip sync to a specific track isn't the point.