https://www.youtube.com/watch?v=j5iJ2k05ltc
We built lipsync-2, the first in a new generation of zero-shot lipsyncing models. It seamlessly edits any person's lip movements in a video to match any audio without having to train or be fine-tuned on that person.
Zero-shot lipsync models are versatile because they edit any arbitrary person and voice without having to train or fine-tune on every speaker. But traditionally they can lose traits unique to the person, like their speaking style, skin textures, teeth, etc.
With lipsync-2, we introduce a new capability in zero-shot lipsync: style preservation. We learn a representation of how a person speaks by watching how they speak in the input video. We train a spatiotemporal transformer that encodes the different mouth shapes in the input video into a style representation. A generative transformer synthesizes new mouth movements by conditioning on the new target speech and the learned style representation.
We built a simple API that let’s you build workflows around our core lipsyncing models. You submit a video and an audio (or a script and voiceID to generate audio from), and get a response with the final output.
We see thousands of developers and businesses integrating our APIs to build generative video workflows into their products and services.
Notice how even across different languages, we preserve the speaking style of Nicolas Cage. We are the first zero-shot lipsyncing model to achieve this.
We can even handle long videos with multiple speakers — we built a state-of-the-art active speaker detection pipeline that associates a unique voice with a unique face, and only applies lipsync when we detect that person is actively speaking.
https://www.youtube.com/watch?v=ZaXbiKdoBz8
It also works across animated characters, from Pixar-level animations to AI generated characters.
https://www.youtube.com/watch?v=F_6lGFl6bcA
But translation is only the beginning, with the power to edit dialogue in any video in post-production we’re on the cusp of reimagining how we create, edit, and consume videos forever.
Imagine a world where you only ever have to hit record once. lipsync-2 is the only model that let’s you edit a dialogue while preserving the original speakers style, without needing to train or fine-tune beforehand.
In an age where we can generate any video by typing a few lines of text, we don’t have to limit ourselves to what we can capture with a camera.
https://youtube.com/shorts/KnzWtu3niKQ
For any YC company we’re giving away our Scale Plan for free for 4 months, plus $1000 to spend on usage.
With the scale plan you get access to up to 15 concurrent jobs processing at once and handle up to 30 minute video at a time — leveraging this maximally you have the ability to generate around ~90 minutes of video per hour every hour.
Launch an AI admaker, video translation tool, or any other content generation workflow you want and serve viral load with speed, reliability, and best-in-class quality.
Email us at yc@sync.so and we’ll get you set up.
At sync, AI lipsync is just the beginning.
We live in an extraordinary age.
A high schooler can craft a masterpiece with an iPhone. A studio can produce a movie at a tenth of the cost 10x faster. Every video can be distributed worldwide in any language, instantly. Video is becoming as malleable as text.
But we have two fundamental problems to tackle before this is a reality:
[1] Large video models are great at generating entirely new scenes and worlds, but struggle with precise control and fine grained edits. The ability to make subtle, intentional adjustments – the kind that separates good content from great content – doesn’t exist yet.
[2] If video generation is world modeling, each human is a world unto themselves. We each have idiosyncrasies that make us unique — building primitives to capture, express, and modify them with high precision is the key to breaking through the uncanny valley.
We’re excited about lipsync-2, and for what’s coming up next. Reach out to founders@sync.so if you have any questions or are curious about our roadmap.
lipsync video to audio in any language in one-shot
Prady + Pavan have been full-time on sync since June 2023
Rudrabha has been contributing greatly while finishing his PhD + joined full-time starting October 2023
Prajwal is finishing up his PhD and is joining fulltime once he completes in May of 2024 – his supervisor is Professor Andrew Zisserman (190+ citations / foremost expert in the field we are playing in. His proximity helps us stay sota + learn from the bleeding edge.
we're building generative models to modify / synthesize humans in video + hosting production APIs to let anyone plug them into their own apps / platforms / services.
today we're focused on visual dubbing – we built + launched an updated lip-synchronizing model to let anyone lip-sync a video to an audio in any language in near real-time for HD videos.
as part of the AI translation stack we're used as a post processing step to sync the lips in a video to the new dubbed audio track – this lets everyone around the world experience content like it was made for them in their native language (no more bad / misaligned dubs).
in the future we plan to build + host a suite of production ready models to modify + generate a full-human body digitally in video (ex. facial expressions, head + hand + eye movements, etc.) that can be used for anything from seamless localization of content (cross-language) to generative videos