Skip to content

[Bug Fix] Fix_muti_backend_bugs#2226

Draft
Galaxy1458 wants to merge 3 commits intomasterfrom
fix_muti_backend_bugs
Draft

[Bug Fix] Fix_muti_backend_bugs#2226
Galaxy1458 wants to merge 3 commits intomasterfrom
fix_muti_backend_bugs

Conversation

@Galaxy1458
Copy link
Copy Markdown
Collaborator

PR Category

Type of Change

Description

Issue

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance

ops_module = importlib.import_module(f"_{vendor_name}.ops")
return importlib.import_module(module_name)
except ModuleNotFoundError:
print(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider change this to logger.info ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If except is entered, this [note] need to be prompted to the user in any case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in that case, the error message are not supposed to be printed on stdout.
This module is in the core of the project, and the whole project can be scripted.
I have noticed this unconditional print when writing test scripts.

By the way, it should be fine (not good though ...) if a vendor has no operators implemented.

Comment on lines +199 to +200
if _state.device_name in ("musa", "aipu", "npu", "txda", "ptpu", "gcu"):
_state.torch_device_fn_device = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this logic ... it is hard to maintain.
Please consider using a try ... except struct here...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a try ... except struct here doesn‘t conform to the original intention of the code. torch_device_fn_device is not necessarily non-existent. It could be that the vendor doesn't want to use it, or it might not be needed at all

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean 'except: pass'.
Don't enumerate the device name here ...
no one can remember that we planted some hardcoded device specific code here.

vendor_module = get_module("_" + vendor_name)
return vendor_module
if _state.vendor_module is None:
_state.vendor_module = get_module("_" + vendor_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this get_module is exception free?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in principle, there is no exception. If there is an exception for unknown reasons, it must be reported.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are in general two approaches ...
either you catch all exceptions in the get_module so the function never throws an exception, and you may get an empty string here (check it);
or you allow it to throw an exception and you will catch that exception in the call chain.

Galaxy1458 and others added 2 commits April 3, 2026 16:06
Refactor error messages for clarity and consistency.

Signed-off-by: Galaxy1458 <[email protected]>
try:
_state.heuristic_config_module = importlib.import_module(mod_name)
except Exception:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be 'pass'?

Suggested change
continue
pass

@@ -1,148 +1,104 @@
import os
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File mode changed from 644 to 755, pls fix.

UNSUPPORT_FP64,
UNSUPPORT_INT64,
vendors,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. This should be extracted.

Comment on lines +29 to +30
if hasattr(self, "initialized"):
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. personally I prefer your previous implementation that make the instance always have an initialized attribute. We only test if that attribute is true or not, rather than whether the instance has a specific attribute.

self.info = self.get_vendor(vendor_name)
self.vendor_name = self.info.vendor_name
self.name = self.info.device_name
self.vendor = vendors.get_all_vendors()[self.vendor_name]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always be careful when using [] ... it is disgusting that it may throw exceptions, use get(key, default) whenever possible.

self.device_count = backend.gen_torch_device_object(
self.vendor_name
).device_count()
self.support_fp64 = self.vendor not in UNSUPPORT_FP64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh ... this feature should be pushed down to vendor layers ... let's leave it to a future PR

if hasattr(torch, attr):
return vendor_name
try:
import torch_npu
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This damn import is still here ...

except Exception:
return None

with ThreadPoolExecutor() as executor:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) Cannot imagine that we are using multiple threads to do this.

Comment on lines +29 to +37
vendors.CAMBRICON,
vendors.ILUVATAR,
vendors.KUNLUNXIN,
vendors.MTHREADS,
vendors.AIPU,
vendors.ASCEND,
vendors.TSINGMICRO,
vendors.SUNRISE,
vendors.ENFLAME,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind sort the set by names?

vendors.AIPU,
vendors.TSINGMICRO,
vendors.SUNRISE,
vendors.ENFLAME,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

"iluvatar": "corex",
"ascend": "npu",
"sunrise": "ptpu",
"enflame": "gcu",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort this dict?

self.arch_specialized_yaml_config = archEvent.autotune_configs
self.arch_heuristics_config = archEvent.heuristics_configs
except Exception as err:
print(f"[INFO] : {err}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestion here. Log an error or an warning please.

Comment on lines +188 to +189
if self.device.vendor_name == "hygon":
kwargs["num_ldmatrixes"] = current_config["num_ldmatrixes"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a more generic ways is something like:

Suggested change
if self.device.vendor_name == "hygon":
kwargs["num_ldmatrixes"] = current_config["num_ldmatrixes"]
kwargs["num_ldmatrixes"] = current_config.get("num_ldmatrixes", None)

The thing is that in this core module, we don't treat any vendor as a special beast.

num_stages=current_config["num_stages"],
num_ctas=current_config["num_ctas"],
)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants