-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Hunyuanvideo15 #12696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hunyuanvideo15 #12696
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Thank you for working on this. Eagerly awaiting this one (Wan doesn't work for me). |
|
@tin2tin |
|
I don't have time right now, but I'll definitely check it out later. |
src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py
Outdated
Show resolved
Hide resolved
|
@tin2tin do you want to try with group offloading? |
|
I only have 32 GB RAM, that's usually not enough for group offloading. I did try using single file pre-quantized file, but I couldn't get it working. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments are mostly nits! I can help with the tests.
src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py
Outdated
Show resolved
Hide resolved
| self.tile_latent_min_width = tile_latent_min_width or self.tile_latent_min_width | ||
| self.tile_overlap_factor = tile_overlap_factor or self.tile_overlap_factor | ||
|
|
||
| def disable_tiling(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have the model class subclass from AutoencoderMixin to get rid of the common methods. Example:
| class AutoencoderKL(ModelMixin, AutoencoderMixin, ConfigMixin, FromOriginalModelMixin, PeftAdapterMixin): |
| rope_theta: float = 256.0, | ||
| rope_axes_dim: Tuple[int, ...] = (16, 56, 56), | ||
| # YiYi Notes: config based on target_size_config https://github.com/yiyixuxu/hy15/blob/main/hyvideo/pipelines/hunyuan_video_pipeline.py#L205 | ||
| target_size: int = 640, # did not name sample_size since it is in pixel spaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that important but we're still doing the AE decoding according to:
| video = self.vae.decode(latents, return_dict=False)[0] |
I got the impression that the DiT might directly predicting the pixel space.
|
|
||
| @property | ||
| # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.attn_processors | ||
| def attn_processors(self) -> Dict[str, AttentionProcessor]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get rid of these common attention related methods if we subclass the model from AttentionMixin:
diffusers/src/diffusers/models/transformers/transformer_flux2.py
Lines 631 to 639 in 152f7ca
| class Flux2Transformer2DModel( | |
| ModelMixin, | |
| ConfigMixin, | |
| PeftAdapterMixin, | |
| FromOriginalModelMixin, | |
| FluxTransformer2DLoadersMixin, | |
| CacheMixin, | |
| AttentionMixin, | |
| ): |
| encoder_hidden_states_cond_emb = self.cond_type_embed( | ||
| torch.zeros_like(encoder_hidden_states[:, :, 0], dtype=torch.long) | ||
| ) | ||
| encoder_hidden_states = encoder_hidden_states + encoder_hidden_states_cond_emb | ||
|
|
||
| # byt5 text embedding | ||
| encoder_hidden_states_2 = self.context_embedder_2(encoder_hidden_states_2) | ||
|
|
||
| encoder_hidden_states_2_cond_emb = self.cond_type_embed( | ||
| torch.ones_like(encoder_hidden_states_2[:, :, 0], dtype=torch.long) | ||
| ) | ||
| encoder_hidden_states_2 = encoder_hidden_states_2 + encoder_hidden_states_2_cond_emb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that for encoder_hidden_states_cond_emb, we do zeros_likes and for encoder_hidden_states_2_cond_emb, we do ones_like.
|
|
||
| # image embed | ||
| encoder_hidden_states_3 = self.image_embedder(image_embeds) | ||
| is_t2v = torch.all(image_embeds == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit):
I would imagine that in case we're on pure T2V, image_embeds would be None. But if that's not the case (the current code suggests we might need image_embeds to be zeroes), we could change the type hint of image_embeds to image_embeds: torch.Tensor and make it a positional argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they actually create it all zero image_embeds for t2v
| device=encoder_attention_mask.device, | ||
| ) | ||
| encoder_hidden_states_3_cond_emb = self.cond_type_embed( | ||
| 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So, it's probably like:
- First condition embedding type: 0
- Second condition embedding type: 1
- Third condition embedding type: 2
src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py
Outdated
Show resolved
Hide resolved
|
Started tests in #12759 :) Completing now. |
Co-authored-by: Sayak Paul <[email protected]>
|
|
||
| - HunyuanVideo1.5 use attention masks with variable-length sequences. For best performance, we recommend using an attention backend that handles padding efficiently. | ||
|
|
||
| - **H100/H800:** `_flash_3_hub` or `_flash_varlen_3` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I will work on adding the Hub variant for FA3 varlen so that we can ease the user-experience a bit here.
| - **A100/A800/RTX 4090:** `flash` or `flash_varlen` | ||
| - **Other GPUs:** `sage` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit): We could recommend flash_hub and sage_hub here as backends instead to promote more the usage of the kernels-based backends. It will also keep things central to the Hub.
* start tests. * up * up * style. * up
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
|
Congratulations on the commit. Do you have any suggestions on what I could try to get it working on 24 GB VRAM (RTX 4090) and 32 GB RAM? |


https://huggingface.co/collections/hunyuanvideo-community/hunyuanvideo-15
testing script