Skip to content

[feat]: support hy3d-2.1#1088

Open
Watebear wants to merge 1 commit into
mainfrom
hy3d
Open

[feat]: support hy3d-2.1#1088
Watebear wants to merge 1 commit into
mainfrom
hy3d

Conversation

@Watebear
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates the Hunyuan3D-2.1 image-to-3D-mesh generation pipeline, introducing new model runners, DiT network components with Mixture-of-Experts (MoE) support, flow-matching schedulers, and VAE-based mesh decoding utilities. It also includes image preprocessing with background removal and a torchvision compatibility fix. The review feedback identifies several implementation issues, including a dictionary iteration bug in the configuration logic, unreachable code in the conditioner, and potential crashes in the image processor. Furthermore, several performance optimizations are recommended for the MoE inference and runner modules to avoid unnecessary device synchronizations and excessive memory management overhead.

else:
raise ValueError(f"{config_file} must be `.yaml` file or it contains `base_config` key.")

config_file = {key: value for key, value in config_file if key != "base_config"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Iterating directly over a DictConfig (or any dict-like object) yields only the keys. Attempting to unpack it as key, value will raise a ValueError. You should use .items() to iterate over key-value pairs.

Suggested change
config_file = {key: value for key, value in config_file if key != "base_config"}
config_file = {key: value for key, value in config_file.items() if key != "base_config"}

if self.disable_drop:
return outputs
else:
random_p = torch.rand(len(image), device="cuda")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The device is hardcoded to "cuda". This will fail on other devices (e.g., CPU, MPS) or if the model is on a specific GPU index. It's better to use the device of the input tensor.

Suggested change
random_p = torch.rand(len(image), device="cuda")
random_p = torch.rand(len(image), device=image.device)

Comment on lines +255 to +258
outputs = {
"main": self.main_image_encoder(image, mask=mask, **kwargs),
}
return outputs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code is unreachable because the function returns on line 248 or 253.

Comment on lines +124 to +129
image = Image.open(image_path)
if image.mode == "RGB" and self.enable_rembg:
if self.rembg is None:
self.rembg = BackgroundRemover()
return self.rembg(image.convert("RGB"))
return image.convert("RGBA")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The image file is opened but not explicitly closed. Using a context manager ensures the file handle is released promptly.

Suggested change
image = Image.open(image_path)
if image.mode == "RGB" and self.enable_rembg:
if self.rembg is None:
self.rembg = BackgroundRemover()
return self.rembg(image.convert("RGB"))
return image.convert("RGBA")
def __call__(self, image_path: str) -> Image.Image:
with Image.open(image_path) as image:
if image.mode == "RGB" and self.enable_rembg:
if self.rembg is None:
self.rembg = BackgroundRemover()
return self.rembg(image.convert("RGB"))
return image.convert("RGBA")

Comment on lines +61 to +62
x_min, x_max = coords[0].min(), coords[0].max()
y_min, y_max = coords[1].min(), coords[1].max()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the mask is all zeros (e.g., if background removal fails or the image is empty), np.nonzero(mask) will return empty arrays. Calling .min() or .max() on an empty array will raise a ValueError. You should check if the mask contains any non-zero elements first.

Suggested change
x_min, x_max = coords[0].min(), coords[0].max()
y_min, y_max = coords[1].min(), coords[1].max()
coords = np.nonzero(mask)
if coords[0].size == 0:
raise ValueError("input image is empty")
x_min, x_max = coords[0].min(), coords[0].max()


expert_cache = torch.zeros_like(flat)
idxs = flat_topk_idx.argsort()
tokens_per_expert = flat_topk_idx.bincount(minlength=moe_weights.num_experts).cpu().numpy().cumsum(0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Moving data from GPU to CPU (.cpu().numpy()) inside the inference loop causes a device synchronization, which can significantly degrade performance. Consider performing these operations on the GPU using PyTorch operations like torch.cumsum.

expert_tokens = flat[exp_token_idx]
expert_out = infer_moe_ffn(moe_weights.experts[expert_idx], expert_tokens)
expert_out.mul_(flat_topk_weight[idxs[start_idx:end_idx]])
expert_cache = expert_cache.to(expert_out.dtype)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Re-casting expert_cache to expert_out.dtype inside the loop is inefficient as it creates a new tensor in every iteration. It's better to initialize expert_cache with the desired dtype or cast it once outside the loop.

Comment on lines +140 to +141
torch_device_module.empty_cache()
gc.collect()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling torch.cuda.empty_cache() and gc.collect() at the end of every encoder run is extremely expensive and can lead to significant performance overhead due to GPU synchronization and memory fragmentation. These should generally be avoided unless there is a specific, verified memory leak that cannot be addressed otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant