Hunyuanvideo15 #12696

yiyixuxu · 2025-11-22T01:07:38Z

https://huggingface.co/collections/hunyuanvideo-community/hunyuanvideo-15

testing script

import torch

dtype = torch.bfloat16
device = "cuda:0"
from diffusers import HunyuanVideo15Pipeline, HunyuanVideo15ImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

t2v_names = ["480p_t2v", "720p_t2v", "480p_t2v_distilled"]
num_frames = 31  # use a minimum number for testing, 121 is default

# test t2v
prompt="A close-up shot captures a scene on a polished, light-colored granite kitchen counter, illuminated by soft natural light from an unseen window. Initially, the frame focuses on a tall, clear glass filled with golden, translucent apple juice standing next to a single, shiny red apple with a green leaf still attached to its stem. The camera moves horizontally to the right. As the shot progresses, a white ceramic plate smoothly enters the frame, revealing a fresh arrangement of about seven or eight more apples, a mix of vibrant reds and greens, piled neatly upon it. A shallow depth of field keeps the focus sharply on the fruit and glass, while the kitchen backsplash in the background remains softly blurred. The scene is in a realistic style."
seed = 1
for name in t2v_names:
    print(f"Testing {name}...")
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

    pipe = HunyuanVideo15Pipeline.from_pretrained(f"hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-{name}", torch_dtype=dtype)
    pipe.enable_model_cpu_offload()
    pipe.vae.enable_tiling()

    generator = torch.Generator(device=device).manual_seed(seed)
    video = pipe(
        prompt=prompt,
        generator=generator,
        num_frames=num_frames,
        num_inference_steps=50,
    ).frames[0]
    export_to_video(video, f"yiyi_test_hy15_{name}_output.mp4", fps=24)
    max_allocated = torch.cuda.max_memory_allocated() / 1024**3  # GB
    print(f"Max Allocated Memory: {max_allocated:.2f} GB")
    
# test i2v
i2v_names = ["480p_i2v", "720p_i2v", "480p_i2v_distilled", "720p_i2v_distilled"]

image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG")
prompt="Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."
seed = 1
for name in i2v_names:
    print(f"Testing {name}...")
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

    pipe = HunyuanVideo15ImageToVideoPipeline.from_pretrained(f"hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-{name}", torch_dtype=dtype)
    pipe.enable_model_cpu_offload()
    pipe.vae.enable_tiling()

    generator = torch.Generator(device=device).manual_seed(seed)
    video = pipe(
        prompt=prompt,
        generator=generator,
        image=image,
        num_frames=num_frames,
        num_inference_steps=50,
    ).frames[0]
    export_to_video(video, f"yiyi_test_hy15_{name}_output.mp4", fps=24)
    max_allocated = torch.cuda.max_memory_allocated() / 1024**3  # GB
    print(f"Max Allocated Memory: {max_allocated:.2f} GB")

HuggingFaceDocBuilderDev · 2025-11-22T01:16:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

… original checkpoint

tin2tin · 2025-11-28T12:35:44Z

Thank you for working on this. Eagerly awaiting this one (Wan doesn't work for me).

yiyixuxu · 2025-11-29T06:42:35Z

@tin2tin
i will merge this soon, do you want to test it out? all the checkpoints are uploaded, you can find scripts in the PR description

tin2tin · 2025-11-29T10:17:23Z

I don't have time right now, but I'll definitely check it out later.

tin2tin · 2025-11-29T12:47:34Z

720p_t2v seems to be loading, but it's too heavy for me to run (using your example code) - it ended with a crash.

480p_t2v

HunyuanVideo 1.5 runs just fine on my setup in ComfyUI, so they must have found out how to optimize it seriously.

src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py

yiyixuxu · 2025-11-30T20:18:16Z

@tin2tin do you want to try with group offloading?

tin2tin · 2025-11-30T20:34:41Z

I only have 32 GB RAM, that's usually not enough for group offloading.

I did try using single file pre-quantized file, but I couldn't get it working.

sayakpaul

My comments are mostly nits! I can help with the tests.

docs/source/en/api/pipelines/hunyuan_video15.md

src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py

sayakpaul · 2025-11-30T21:52:44Z

src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py

+        self.tile_latent_min_width = tile_latent_min_width or self.tile_latent_min_width
+        self.tile_overlap_factor = tile_overlap_factor or self.tile_overlap_factor
+
+    def disable_tiling(self) -> None:


We could have the model class subclass from AutoencoderMixin to get rid of the common methods. Example:

diffusers/src/diffusers/models/autoencoders/autoencoder_kl.py

Line 38 in c3f4598

class AutoencoderKL(ModelMixin, AutoencoderMixin, ConfigMixin, FromOriginalModelMixin, PeftAdapterMixin):

sayakpaul · 2025-11-30T22:03:29Z

src/diffusers/models/transformers/transformer_hunyuan_video15.py

+        rope_theta: float = 256.0,
+        rope_axes_dim: Tuple[int, ...] = (16, 56, 56),
+        # YiYi Notes: config based on target_size_config https://github.com/yiyixuxu/hy15/blob/main/hyvideo/pipelines/hunyuan_video_pipeline.py#L205
+        target_size: int = 640,  # did not name sample_size since it is in pixel spaces


Not that important but we're still doing the AE decoding according to:

diffusers/src/diffusers/pipelines/hunyuan_video1_5/pipeline_hunyuan_video1_5.py

Line 826 in c3f4598

video = self.vae.decode(latents, return_dict=False)[0]

I got the impression that the DiT might directly predicting the pixel space.

sayakpaul · 2025-11-30T22:04:55Z

src/diffusers/models/transformers/transformer_hunyuan_video15.py

+
+    @property
+    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.attn_processors
+    def attn_processors(self) -> Dict[str, AttentionProcessor]:


We can get rid of these common attention related methods if we subclass the model from AttentionMixin:

diffusers/src/diffusers/models/transformers/transformer_flux2.py

Lines 631 to 639 in 152f7ca

class Flux2Transformer2DModel(

ModelMixin,

ConfigMixin,

PeftAdapterMixin,

FromOriginalModelMixin,

FluxTransformer2DLoadersMixin,

CacheMixin,

AttentionMixin,

):

sayakpaul · 2025-11-30T22:09:39Z

src/diffusers/models/transformers/transformer_hunyuan_video15.py

+        encoder_hidden_states_cond_emb = self.cond_type_embed(
+            torch.zeros_like(encoder_hidden_states[:, :, 0], dtype=torch.long)
+        )
+        encoder_hidden_states = encoder_hidden_states + encoder_hidden_states_cond_emb
+
+        # byt5 text embedding
+        encoder_hidden_states_2 = self.context_embedder_2(encoder_hidden_states_2)
+
+        encoder_hidden_states_2_cond_emb = self.cond_type_embed(
+            torch.ones_like(encoder_hidden_states_2[:, :, 0], dtype=torch.long)
+        )
+        encoder_hidden_states_2 = encoder_hidden_states_2 + encoder_hidden_states_2_cond_emb


Interesting that for encoder_hidden_states_cond_emb, we do zeros_likes and for encoder_hidden_states_2_cond_emb, we do ones_like.

sayakpaul · 2025-11-30T22:11:09Z

src/diffusers/models/transformers/transformer_hunyuan_video15.py

+
+        # image embed
+        encoder_hidden_states_3 = self.image_embedder(image_embeds)
+        is_t2v = torch.all(image_embeds == 0)


(nit):

I would imagine that in case we're on pure T2V, image_embeds would be None. But if that's not the case (the current code suggests we might need image_embeds to be zeroes), we could change the type hint of image_embeds to image_embeds: torch.Tensor and make it a positional argument?

they actually create it all zero image_embeds for t2v

sayakpaul · 2025-11-30T22:15:18Z

src/diffusers/models/transformers/transformer_hunyuan_video15.py

+                device=encoder_attention_mask.device,
+            )
+        encoder_hidden_states_3_cond_emb = self.cond_type_embed(
+            2


I see. So, it's probably like:

First condition embedding type: 0

Second condition embedding type: 1

Third condition embedding type: 2

docs/source/en/api/pipelines/hunyuan_video15.md

src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py

sayakpaul · 2025-11-30T23:20:19Z

Started tests in #12759 :) Completing now.

Co-authored-by: Sayak Paul <[email protected]>

docs/source/en/api/pipelines/hunyuan_video15.md

sayakpaul · 2025-12-01T00:57:33Z

docs/source/en/api/pipelines/hunyuan_video15.md

+
+- HunyuanVideo1.5 use attention masks with variable-length sequences. For best performance, we recommend using an attention backend that handles padding efficiently.
+
+    - **H100/H800:** `_flash_3_hub` or `_flash_varlen_3`


Okay I will work on adding the Hub variant for FA3 varlen so that we can ease the user-experience a bit here.

sayakpaul · 2025-12-01T00:58:37Z

docs/source/en/api/pipelines/hunyuan_video15.md

+    - **A100/A800/RTX 4090:** `flash` or `flash_varlen`
+    - **Other GPUs:** `sage`
+


(nit): We could recommend flash_hub and sage_hub here as backends instead to promote more the usage of the kernels-based backends. It will also keep things central to the Hub.

docs/source/en/api/pipelines/hunyuan_video15.md

* start tests. * up * up * style. * up

yiyixuxu · 2025-12-01T06:02:50Z

@bot /style

github-actions · 2025-12-01T06:03:18Z

Style bot fixed some files and pushed the changes.

tin2tin · 2025-12-01T07:49:47Z

Congratulations on the commit. Do you have any suggestions on what I could try to get it working on 24 GB VRAM (RTX 4090) and 32 GB RAM?

yiyixuxu added 3 commits November 20, 2025 22:19

add

8b7ea81

add first pipeline draft

56d57c3

add text encoders to conversion script

b282ac1

yiyixuxu and others added 7 commits November 22, 2025 06:37

fix a bit more, remove print lines

5732d60

fix more, system prompt etc

76bb607

update conversion script

c739ee9

up up

2f6914d

update conversion script: remove dtype, always keep same precision as…

a0b2fe0

… original checkpoint

fix

db0127c

conversion scripts

38c42b4

[email protected] and others added 5 commits November 29, 2025 00:33

remove dtype from the _get_ encodeing methods

090ceb5

add i2v pipeline

e3301cb

add image to video pipeline

753d407

remove use_meanflow

0687a40

up up

c22915d

yiyixuxu commented Nov 30, 2025

View reviewed changes

src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py Outdated Show resolved Hide resolved

[email protected] and others added 6 commits November 30, 2025 18:38

a few small fix: proprocess, cpu_offloading, attention backend

f9cb82b

simplify transformer

e319d72

clean up a bit more pipelines

e194034

Merge branch 'main' into hunyuanvideo15

3980f97

style

5029dbf

copies

8aa458e

add docs

50abf50

yiyixuxu requested a review from dg845 November 30, 2025 21:30

sayakpaul approved these changes Nov 30, 2025

View reviewed changes

yiyixuxu commented Nov 30, 2025

View reviewed changes

docs/source/en/api/pipelines/hunyuan_video15.md Outdated Show resolved Hide resolved

Update docs/source/en/api/pipelines/hunyuan_video15.md

54f008e

yiyixuxu commented Nov 30, 2025

View reviewed changes

src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py Outdated Show resolved Hide resolved

yiyixuxu commented Nov 30, 2025

View reviewed changes

src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py Outdated Show resolved Hide resolved

yiyixuxu commented Nov 30, 2025

View reviewed changes

src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py Outdated Show resolved Hide resolved

yiyixuxu and others added 3 commits November 30, 2025 13:22

Apply suggestions from code review

237d318

Co-authored-by: Sayak Paul <[email protected]>

add a notes on the doc about attention backend

d7f399d

up

bdfab30

yiyixuxu commented Dec 1, 2025

View reviewed changes

docs/source/en/api/pipelines/hunyuan_video15.md Outdated Show resolved Hide resolved

Update docs/source/en/api/pipelines/hunyuan_video15.md

2c018f8

sayakpaul reviewed Dec 1, 2025

View reviewed changes

add a note on changing guidance_scale on doc

c715470

yiyixuxu commented Dec 1, 2025

View reviewed changes

docs/source/en/api/pipelines/hunyuan_video15.md Outdated Show resolved Hide resolved

yiyixuxu commented Dec 1, 2025

View reviewed changes

docs/source/en/api/pipelines/hunyuan_video15.md Outdated Show resolved Hide resolved

Apply suggestions from code review

0dae8f9

yiyixuxu commented Dec 1, 2025

View reviewed changes

docs/source/en/api/pipelines/hunyuan_video15.md Outdated Show resolved Hide resolved

Update docs/source/en/api/pipelines/hunyuan_video15.md

5989014

yiyixuxu commented Dec 1, 2025

View reviewed changes

docs/source/en/api/pipelines/hunyuan_video15.md Outdated Show resolved Hide resolved

yiyixuxu and others added 2 commits November 30, 2025 15:24

Update docs/source/en/api/pipelines/hunyuan_video15.md

404d3fa

tests for Hunyuan 1.5 (#12759)

0869b22

* start tests. * up * up * style. * up

Apply style fixes

6bfb75a

yiyixuxu merged commit 6156cf8 into main Dec 1, 2025
14 of 15 checks passed

sayakpaul mentioned this pull request Dec 1, 2025

[core] start varlen variants for attn backend kernels. #12765

Open

	class Flux2Transformer2DModel(
	ModelMixin,
	ConfigMixin,
	PeftAdapterMixin,
	FromOriginalModelMixin,
	FluxTransformer2DLoadersMixin,
	CacheMixin,
	AttentionMixin,
	):


		- HunyuanVideo1.5 use attention masks with variable-length sequences. For best performance, we recommend using an attention backend that handles padding efficiently.

		- H100/H800: `_flash_3_hub` or `_flash_varlen_3`

		- A100/A800/RTX 4090: `flash` or `flash_varlen`
		- Other GPUs: `sage`

Hunyuanvideo15 #12696

Hunyuanvideo15 #12696

Conversation

yiyixuxu commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 22, 2025

Uh oh!

tin2tin commented Nov 28, 2025

Uh oh!

yiyixuxu commented Nov 29, 2025

Uh oh!

tin2tin commented Nov 29, 2025

Uh oh!

tin2tin commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yiyixuxu commented Nov 30, 2025

Uh oh!

tin2tin commented Nov 30, 2025

Uh oh!

sayakpaul left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayakpaul commented Nov 30, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiyixuxu commented Dec 1, 2025

Uh oh!

github-actions bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tin2tin commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yiyixuxu commented Nov 22, 2025 •

edited

Loading

tin2tin commented Nov 29, 2025 •

edited

Loading

sayakpaul left a comment •

edited

Loading

github-actions bot commented Dec 1, 2025 •

edited

Loading