Hunyuancustom Brings Single-image Video Deepfakes, With Audio And Lip Sync

Trending 1 week ago
ARTICLE AD BOX

This article discusses a caller merchandise of a multimodal Hunyuan Video world exemplary called ‘HunyuanCustom'. The caller paper's breadth of coverage, mixed pinch respective issues successful galore of nan supplied illustration videos astatine nan project page*, constrains america to much wide sum than usual, and to constricted reproduction of nan immense magnitude of video worldly accompanying this merchandise (since galore of nan videos require important re-editing and processing successful bid to amended nan readability of nan layout).

Please statement additionally that nan insubstantial refers to nan API-based generative strategy Kling arsenic ‘Keling'. For clarity, I mention to ‘Kling' alternatively throughout.

Tencent is successful nan process of releasing a caller type of its Hunyuan Video model, titled HunyuanCustom. The caller merchandise is apparently tin of making Hunyuan LoRA models redundant, by allowing nan personification to create ‘deepfake'-style video customization done a single image:

Click to play. Prompt: ‘A man is listening to euphony and cooking snail noodles successful nan kitchen'. The caller method compared to some close-source and open-source methods, including Kling, which is simply a important force successful this space. Source: https://hunyuancustom.github.io/ (warning: CPU/memory-intensive site!)

In nan left-most file of nan video above, we spot nan azygous root image supplied to HunyuanCustom, followed by nan caller system's mentation of nan punctual successful nan 2nd column, adjacent to it. The remaining columns show nan results from various proprietary and FOSS systems: Kling; Vidu; Pika; Hailuo; and nan Wan-based SkyReels-A2.

In nan video below, we spot renders of 3 scenarios basal to this release: respectively, person + object; single-character emulation; and virtual try-on (person + clothes):

Click to play. Three examples edited from nan worldly astatine nan supporting tract for Hunyuan Video.

We tin announcement a fewer things from these examples, mostly related to nan strategy relying connected a single root image, alternatively of aggregate images of nan aforesaid subject.

In nan first clip, nan man is fundamentally still facing nan camera. He dips his caput down and sideways astatine not overmuch much than 20-25 degrees of rotation, but, astatine an inclination successful excess of that, nan strategy would really person to commencement guessing what he looks for illustration successful profile. This is hard, astir apt intolerable to gauge accurately from a sole frontal image.

In nan 2nd example, we spot that nan small woman is smiling successful nan rendered video arsenic she is successful nan azygous fixed root image. Again, pinch this sole image arsenic reference, nan HunyuanCustom would person to make a comparatively uninformed conjecture astir what her ‘resting face' looks like. Additionally, her look does not deviate from camera-facing stance by much than nan anterior illustration (‘man eating crisps').

In nan past example, we spot that since nan root worldly – nan female and nan apparel she is prompted into wearing – are not complete images, nan render has cropped nan script to fresh – which is really alternatively a bully solution to a information issue!

The constituent is that though nan caller strategy tin grip aggregate images (such arsenic person + crisps, aliases person + clothes), it does not apparently let for aggregate angles aliases replacement views of a azygous character, truthful that divers expressions aliases different angles could beryllium accommodated. To this extent, nan strategy whitethorn truthful struggle to switch nan increasing ecosystem of LoRA models that person sprung up astir HunyuanVideo since its merchandise past December, since these tin thief HunyuanVideo to nutrient accordant characters from immoderate perspective and pinch immoderate facial look represented successful nan training dataset (20-60 images is typical).

Wired for Sound

For audio, HunyuanCustom leverages nan LatentSync strategy (notoriously difficult for hobbyists to group up and get bully results from) for obtaining articulator movements that are matched to audio and matter that nan personification supplies:

Features audio. Click to play. Various examples of lip-sync from nan HunyuanCustom supplementary site, edited together.

At nan clip of writing, location are nary English-language examples, but these look to beryllium alternatively bully – nan much truthful if nan method of creating them is easily-installable and accessible.

Editing Existing Video

The caller strategy offers what look to beryllium very awesome results for video-to-video (V2V, aliases Vid2Vid) editing, wherein a conception of an existing (real) video is masked disconnected and intelligently replaced by a taxable fixed successful a azygous reference image. Below is an illustration from nan supplementary materials site:

Click to play. Only nan cardinal entity is targeted, but what remains astir it besides gets altered successful a HunyuanCustom vid2vid pass.

As we tin see, and arsenic is modular successful a vid2vid scenario, nan entire video is to immoderate grade altered by nan process, though astir altered successful nan targeted region, i.e., nan plush toy. Presumably pipelines could beryllium developed to create specified transformations nether a garbage matte attack that leaves nan mostly of nan video contented identical to nan original. This is what Adobe Firefly does nether nan hood, and does rather good –  but it is an under-studied process successful nan FOSS generative scene.

That said, astir of nan replacement examples provided do a amended occupation of targeting these integrations, arsenic we tin spot successful nan assembled compilation below:

Click to play. Diverse examples of interjected contented utilizing vid2vid successful HunyuanCustom, exhibiting notable respect for nan untargeted material.

A New Start?

This inaugural is simply a improvement of nan Hunyuan Video project, not a difficult pivot distant from that improvement stream. The project's enhancements are introduced arsenic discrete architectural insertions alternatively than sweeping structural changes, aiming to let nan exemplary to support personality fidelity crossed frames without relying connected subject-specific fine-tuning, arsenic pinch LoRA aliases textual inversion approaches.

To beryllium clear, therefore, HunyuanCustom is not trained from scratch, but alternatively is simply a fine-tuning of nan December 2024 HunyuanVideo instauration model.

Those who person developed HunyuanVideo LoRAs whitethorn wonderment if they will still activity pinch this caller edition, aliases whether they will person to reinvent nan LoRA instrumentality yet again if they want much customization capabilities than are built into this caller release.

In general, a heavy fine-tuned merchandise of a hyperscale exemplary alters nan model weights capable that LoRAs made for nan earlier exemplary will not activity properly, aliases astatine all, pinch nan newly-refined model.

Sometimes, however, a fine-tune's fame tin situation its origins: 1 illustration of a fine-tune becoming an effective fork, pinch a dedicated ecosystem and followers of its own, is nan Pony Diffusion tuning of Stable Diffusion XL (SDXL). Pony presently has 592,000+ downloads connected nan ever-changing CivitAI domain, pinch a immense scope of LoRAs that person utilized Pony (and not SDXL) arsenic nan guidelines model, and which require Pony astatine conclusion time.

Releasing

The project page for nan new paper (which is titled HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation) features links to a GitHub site that, arsenic I write, conscionable became functional, and appears to incorporate each codification and basal weights for section implementation, together pinch a projected timeline (where nan only important point yet to travel is ComfyUI integration).

At nan clip of writing, nan project's Hugging Face presence is still a 404. There is, however, an API-based version of wherever 1 tin apparently demo nan system, truthful agelong arsenic you tin supply a WeChat scan code.

I person seldom seen specified an elaborate and extended usage of specified a wide assortment of projects successful 1 assembly, arsenic is evident successful HunyuanCustom – and presumably immoderate of nan licenses would successful immoderate lawsuit oblige a afloat release.

Two models are announced astatine nan GitHub page: a 720px1280px type requiring 8)GB of GPU Peak Memory, and a 512px896px type requiring 60GB of GPU Peak Memory.

The repository states ‘The minimum GPU representation required is 24GB for 720px1280px129f but very slow…We urge utilizing a GPU pinch 80GB of representation for amended procreation quality' – and iterates that nan strategy has only been tested truthful acold connected Linux.

The earlier Hunyuan Video exemplary has, since charismatic release, been quantized down to sizes wherever it tin beryllium tally connected little than 24GB of VRAM, and it seems reasonable to presume that nan caller exemplary will likewise beryllium adapted into much consumer-friendly forms by nan community, and that it will quickly beryllium adapted for usage connected Windows systems too.

Due to clip constraints and nan overwhelming magnitude of accusation accompanying this release, we tin only return a broader, alternatively than in-depth look astatine this release. Nonetheless, let's popular nan hood connected HunyuanCustom a little.

A Look astatine nan Paper

The information pipeline for HunyuanCustom, apparently compliant pinch nan GDPR framework, incorporates some synthesized and open-source video datasets, including OpenHumanVid, pinch 8 halfway categories represented: humans, animals, plants, landscapes, vehicles, objects, architecture, and anime.

//arxiv.org/pdf/2505.04512

From nan merchandise paper, an overview of nan divers contributing packages successful nan HunyuanCustom information building pipeline. Source: https://arxiv.org/pdf/2505.04512

Initial filtering originates pinch PySceneDetect, which segments videos into single-shot clips. TextBPN-Plus-Plus is past utilized to region videos containing excessive on-screen text, subtitles, watermarks, aliases logos.

To reside inconsistencies successful solution and duration, clips are standardized to 5 seconds successful magnitude and resized to 512 aliases 720 pixels connected nan short side. Aesthetic filtering is handled utilizing Koala-36M, pinch a civilization period of 0.06 applied for nan civilization dataset curated by nan caller paper's researchers.

The taxable extraction process combines nan Qwen7B Large Language Model (LLM), nan YOLO11X entity nickname framework, and nan celebrated InsightFace architecture, to place and validate quality identities.

For non-human subjects, QwenVL and Grounded SAM 2 are utilized to extract applicable bounding boxes, which are discarded if excessively small.

//github.com/IDEA-Research/Grounded-SAM-2

Examples of semantic segmentation pinch Grounded SAM 2, utilized successful nan Hunyuan Control project. Source: https://github.com/IDEA-Research/Grounded-SAM-2

Multi-subject extraction utilizes Florence2 for bounding container annotation, and Grounded SAM 2 for segmentation, followed by clustering and temporal segmentation of training frames.

The processed clips are further enhanced via annotation, utilizing a proprietary structured-labeling strategy developed by nan Hunyuan team, and which furnishes layered metadata specified arsenic descriptions and camera mobility cues.

Mask augmentation strategies, including conversion to bounding boxes, were applied during training to trim overfitting and guarantee nan exemplary adapts to divers entity shapes.

Audio information was synchronized utilizing nan aforementioned LatentSync, and clips discarded if synchronization scores autumn beneath a minimum threshold.

The unsighted image value appraisal model HyperIQA was utilized to exclude videos scoring nether 40 (on HyperIQA's bespoke scale). Valid audio tracks were past processed pinch Whisper to extract features for downstream tasks.

The authors incorporated nan LLaVA connection adjunct exemplary during nan note phase, and they stress nan cardinal position that this model has successful HunyuanCustom. LLaVA is utilized to make image captions and assistance successful aligning ocular contented pinch matter prompts, supporting nan building of a coherent training awesome crossed modalities:

The HunyuanCustom model supports identity-consistent video procreation conditioned connected text, image, audio, and video inputs.

The HunyuanCustom model supports identity-consistent video procreation conditioned connected text, image, audio, and video inputs.

By leveraging LLaVA’s vision-language alignment capabilities, nan pipeline gains an further furniture of semantic consistency betwixt ocular elements and their textual descriptions – particularly valuable successful multi-subject aliases complex-scene scenarios.

Custom Video

To let video procreation based connected a reference image and a prompt, nan 2 modules centered astir LLaVA were created, first adapting nan input building of HunyuanVideo truthful that it could judge an image on pinch text.

This progressive formatting nan punctual successful a measurement that embeds nan image straight aliases tags it pinch a short personality description. A separator token was utilized to extremity nan image embedding from overwhelming nan punctual content.

Since LLaVA's ocular encoder tends to compress aliases discard fine-grained spatial specifications during nan alignment of image and matter features (particularly erstwhile translating a azygous reference image into a wide semantic embedding), an identity enhancement module was incorporated. Since astir each video latent diffusion models person immoderate trouble maintaining an personality without an LoRA, moreover successful a five-second clip, nan capacity of this module successful organization testing whitethorn beryllium significant.

In immoderate case, nan reference image is past resized and encoded utilizing nan causal 3D-VAE from nan original HunyuanVideo model, and its latent inserted into nan video latent crossed nan temporal axis, pinch a spatial offset applied to forestall nan image from being straight reproduced successful nan output, while still guiding generation.

The exemplary was trained utilizing Flow Matching, pinch sound samples drawn from a logit-normal distribution – and nan web was trained to retrieve nan correct video from these noisy latents. LLaVA and nan video generator were some fine-tuned together truthful that nan image and punctual could guideline nan output much fluently and support nan taxable personality consistent.

For multi-subject prompts, each image-text brace was embedded separately and assigned a chopped temporal position, allowing identities to beryllium distinguished, and supporting nan procreation of scenes involving multiple interacting subjects.

Sound and Vision

HunyuanCustom conditions audio/speech procreation utilizing some user-input audio and a matter prompt, allowing characters to speak wrong scenes that bespeak nan described setting.

To support this, an Identity-disentangled AudioNet module introduces audio features without disrupting nan personality signals embedded from nan reference image and prompt. These features are aligned pinch nan compressed video timeline, divided into frame-level segments, and injected utilizing a spatial cross-attention system that keeps each framework isolated, preserving taxable consistency and avoiding temporal interference.

A 2nd temporal injection module provides finer power complete timing and motion, moving successful tandem pinch AudioNet, mapping audio features to circumstantial regions of nan latent sequence, and utilizing a Multi-Layer Perceptron (MLP) to person them into token-wise mobility offsets. This allows gestures and facial activity to travel nan hit and accent of nan spoken input pinch greater precision.

HunyuanCustom allows subjects successful existing videos to beryllium edited directly, replacing aliases inserting group aliases objects into a segment without needing to rebuild nan full clip from scratch. This makes it useful for tasks that impact altering quality aliases mobility successful a targeted way.

Click to play. A further illustration from nan supplementary site.

To facilitate businesslike subject-replacement successful existing videos, nan caller strategy avoids nan resource-intensive attack of caller methods specified arsenic nan currently-popular VACE, aliases those that merge full video sequences together, favoring alternatively nan compression  of a reference video utilizing nan pretrained causal 3D-VAE –  aligning it pinch nan procreation pipeline’s soul video latents, and past adding nan 2 together. This keeps nan process comparatively lightweight, while still allowing outer video contented to guideline nan output.

A mini neural web handles nan alignment betwixt nan cleanable input video and nan noisy latents utilized successful generation. The strategy tests 2 ways of injecting this information: merging nan 2 sets of features earlier compressing them again; and adding nan features framework by frame. The 2nd method useful better, nan authors found, and avoids value nonaccomplishment while keeping nan computational load unchanged.

Data and Tests

In tests, nan metrics utilized were: nan personality consistency module successful ArcFace, which extracts facial embeddings from some nan reference image and each framework of nan generated video, and past calculates nan mean cosine similarity betwixt them; subject similarity, via sending YOLO11x segments to Dino 2 for comparison; CLIP-B, text-video alignment, which measures similarity betwixt nan punctual and nan generated video; CLIP-B again, to cipher similarity betwixt each framework and some its neighboring frames and nan first frame, arsenic good arsenic temporal consistency; and dynamic degree, arsenic defined by VBench.

As indicated earlier, nan baseline closed root competitors were Hailuo; Vidu 2.0; Kling (1.6); and Pika. The competing FOSS frameworks were VACE and SkyReels-A2.

Model capacity information comparing HunyuanCustom pinch starring video customization methods crossed ID consistency (Face-Sim), taxable similarity (DINO-Sim), text-video alignment (CLIP-B-T), temporal consistency (Temp-Consis), and mobility strength (DD). Optimal and sub-optimal results are shown successful bold and underlined, respectively.

Model capacity information comparing HunyuanCustom pinch starring video customization methods crossed ID consistency (Face-Sim), taxable similarity (DINO-Sim), text-video alignment (CLIP-B-T), temporal consistency (Temp-Consis), and mobility strength (DD). Optimal and sub-optimal results are shown successful bold and underlined, respectively.

Of these results, nan authors state:

‘Our [HunyuanCustom] achieves nan champion ID consistency and taxable consistency. It besides achieves comparable results successful punctual pursuing and temporal consistency. [Hailuo] has nan champion clip people because it tin travel matter instructions good pinch only ID consistency, sacrificing nan consistency of non-human subjects (the worst DINO-Sim). In position of Dynamic-degree, [Vidu] and [VACE] execute poorly, which whitethorn beryllium owed to nan mini size of nan model.'

Though nan task tract is saturated pinch comparison videos (the layout of which seems to person been designed for website aesthetics alternatively than easy comparison), it does not presently characteristic a video balanced of nan fixed results crammed together successful nan PDF, successful respect to nan first qualitative tests. Though I see it here, I promote nan scholar to make a adjacent introspection of nan videos astatine nan task site, arsenic they springiness a amended belief of nan outcomes:

From nan paper, a comparison connected object-centered video customization. Though nan spectator  should (as always) mention to nan root PDF for amended resolution, nan videos astatine nan task tract mightiness beryllium a much illuminating resource.

From nan paper, a comparison connected object-centered video customization. Though nan spectator should (as always) mention to nan root PDF for amended resolution, nan videos astatine nan task tract mightiness beryllium a much illuminating assets successful this case.

The authors remark here:

‘It tin beryllium seen that [Vidu], [Skyreels A2] and our method execute comparatively bully results successful punctual alignment and taxable consistency, but our video value is amended than Vidu and Skyreels, acknowledgment to nan bully video procreation capacity of our guidelines model, i.e., [Hunyuanvideo-13B].

‘Among commercialized products, though [Kling] has a bully video quality, nan first framework of nan video has a copy-paste [problem], and sometimes nan taxable moves excessively accelerated and [blurs], starring a mediocre viewing experience.'

The authors further remark that Pika performs poorly successful position of temporal consistency, introducing subtitle artifacts (effects from mediocre information curation, wherever matter elements successful video clips person been allowed to pollute nan halfway concepts).

Hailuo maintains facial identity, they state, but fails to sphere full-body consistency. Among open-source methods, VACE, nan researchers assert, is incapable to support personality consistency, whereas they contend that HunyuanCustom produces videos pinch beardown personality preservation, while retaining value and diversity.

Next, tests were conducted for multi-subject video customization, against nan aforesaid contenders. As successful nan erstwhile example, nan flattened PDF results are not people equivalents of videos disposable astatine nan task site, but are unsocial among nan results presented:

Comparisons utilizing multi-subject video customizations. Please spot PDF for amended item and resolution.

Comparisons utilizing multi-subject video customizations. Please spot PDF for amended item and resolution.

The insubstantial states:

‘[Pika] tin make nan specified subjects but exhibits instability successful video frames, pinch instances of a man disappearing successful 1 script and a female failing to unfastened a doorway arsenic prompted. [Vidu] and [VACE] partially seizure quality personality but suffer important specifications of non-human objects, indicating a limitation successful representing non-human subjects.

‘[SkyReels A2] experiences terrible framework instability, pinch noticeable changes successful chips and galore artifacts successful nan correct scenario.

‘In contrast, our HunyuanCustom efficaciously captures some quality and non-human taxable identities, generates videos that adhere to nan fixed prompts, and maintains precocious ocular value and stability.'

A further research was ‘virtual quality advertisement', wherein nan frameworks were tasked to merge a merchandise pinch a person:

From nan qualitative testing round, examples of neural 'product placement'. Please spot PDF for amended item and resolution.

From nan qualitative testing round, examples of neural ‘product placement'. Please spot PDF for amended item and resolution.

For this round, nan authors state:

‘The [results] show that HunyuanCustom efficaciously maintains nan personality of nan quality while preserving nan specifications of nan target product, including nan matter connected it.

‘Furthermore, nan relationship betwixt nan quality and nan merchandise appears natural, and nan video adheres intimately to nan fixed prompt, highlighting nan important imaginable of HunyuanCustom successful generating advertisement videos.'

One area wherever video results would person been very useful was nan qualitative information for audio-driven taxable customization, wherever nan characteristic speaks nan corresponding audio from a text-described segment and posture.

Partial results fixed for nan audio information – though video results mightiness person been preferable successful this case. Only nan apical half of nan PDF fig is reproduced here, arsenic it is ample and difficult to accommodate successful this article. Please mention to root PDF for amended item and resolution.

Partial results fixed for nan audio information – though video results mightiness person been preferable successful this case. Only nan apical half of nan PDF fig is reproduced here, arsenic it is ample and difficult to accommodate successful this article. Please mention to root PDF for amended item and resolution.

The authors assert:

‘Previous audio-driven quality animation methods input a quality image and an audio, wherever nan quality posture, attire, and situation stay accordant pinch nan fixed image and cannot make videos successful different motion and environment, which whitethorn [restrict] their application.

‘…[Our] HunyuanCustom enables audio-driven quality customization, wherever nan characteristic speaks nan corresponding audio successful a text-described segment and posture, allowing for much elastic and controllable audio-driven quality animation.'

Further tests (please spot PDF for each details) included a information pitting nan caller strategy against VACE and Kling 1.6 for video taxable replacement:

Testing taxable replacement successful video-to-video mode. Please mention to root PDF for amended item and resolution.

Testing taxable replacement successful video-to-video mode. Please mention to root PDF for amended item and resolution.

Of these, nan past tests presented successful nan caller paper, nan researchers opine:

‘VACE suffers from bound artifacts owed to strict adherence to nan input masks, resulting successful unnatural taxable shapes and disrupted mobility continuity. [Kling], successful contrast, exhibits a copy-paste effect, wherever subjects are straight overlaid onto nan video, starring to mediocre integration pinch nan background.

‘In comparison, HunyuanCustom efficaciously avoids bound artifacts, achieves seamless integration pinch nan video background, and maintains beardown personality preservation—demonstrating its superior capacity successful video editing tasks.'

Conclusion

This is simply a fascinating release, not slightest because it addresses thing that nan ever-discontent hobbyist segment has been complaining astir much lately – nan deficiency of lip-sync, truthful that nan accrued realism tin successful systems specified arsenic Hunyuan Video and Wan 2.1 mightiness beryllium fixed a caller magnitude of authenticity.

Though nan layout of astir each nan comparative video examples astatine nan task tract makes it alternatively difficult to comparison HunyuanCustom's capabilities against anterior contenders, it must beryllium noted that very, very fewer projects successful nan video synthesis abstraction person nan courageousness to pit themselves successful tests against Kling, nan commercialized video diffusion API which is ever hovering astatine aliases adjacent nan apical of nan leader-boards; Tencent appears to person made headway against this incumbent successful a alternatively awesome manner.

* The rumor being that immoderate of nan videos are truthful wide, short, and high-resolution that they will not play successful modular video players specified arsenic VLC aliases Windows Media Player, showing achromatic screens.

First published Thursday, May 8, 2025

More