Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/GGA: filter out unsupported resources from component list #10372

Merged
merged 1 commit into from
Jan 7, 2025

Conversation

evgeny-leksikov
Copy link
Contributor

What?

Filter out unsupported resources from GGA component

Why?

# Memory domain: mlx5_0
#     Component: gga
#             register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
#           remote key: 14 bytes
#           local memory handle is required for zcopy
#         memory types: host (access,reg,cache)
#   < no supported devices found >

full log: https://dev.azure.com/ucfconsort/0b36e3f0-8ab9-4a48-b68b-4b2350e02c88/_apis/build/builds/92757/logs/1012

How?

check capabilities

src/uct/ib/mlx5/gga/gga_mlx5.c Outdated Show resolved Hide resolved
src/uct/ib/mlx5/gga/gga_mlx5.c Outdated Show resolved Hide resolved
src/uct/ib/mlx5/gga/gga_mlx5.c Outdated Show resolved Hide resolved
src/uct/ib/mlx5/gga/gga_mlx5.c Outdated Show resolved Hide resolved
src/uct/ib/mlx5/gga/gga_mlx5.c Outdated Show resolved Hide resolved
tvegas1
tvegas1 previously approved these changes Dec 13, 2024
iyastreb
iyastreb previously approved these changes Dec 13, 2024
}

for (num_resources = 0, i = 0; i < num_ib_resources; ++i) {
status = uct_ib_mlx5_gga_md_open(component, resources[i].md_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opening the md during query is a too heavy operation, can we just do mlx5dv_open_device and call UCT_IB_MLX5_CMD_OP_QUERY_HCA_CAP? perhaps extract some common code with uct_ib_mlx5_devx_md_open to a helper function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU mlx5dv_open_device is a heaviest operation of md_open, do you think such refactoring makes sense to avoid a couple of malloc/free?

Comment on lines 849 to 857
mlx5_md = ucs_derived_of(md, uct_ib_mlx5_md_t);
if (!ucs_test_all_flags(mlx5_md->flags, UCT_GGA_MLX5_MD_CAPS)) {
ucs_debug("device %s does not match capabilities "
"required for GGA(%"PRIx32"), md_flags=%"PRIx32,
md_name, UCT_GGA_MLX5_MD_CAPS, mlx5_md->flags);
status = UCS_ERR_NO_DEVICE;
uct_ib_mlx5_devx_md_close(&md->super);
goto out_free_dev_list;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed?

src/uct/ib/mlx5/dv/ib_mlx5dv_md.c Outdated Show resolved Hide resolved
src/uct/ib/base/ib_md.h Outdated Show resolved Hide resolved
src/uct/ib/base/ib_md.c Outdated Show resolved Hide resolved
src/uct/ib/mlx5/dv/ib_mlx5dv_md.c Outdated Show resolved Hide resolved
src/uct/ib/mlx5/gga/gga_mlx5.c Outdated Show resolved Hide resolved
src/uct/ib/mlx5/gga/gga_mlx5.c Outdated Show resolved Hide resolved
src/uct/ib/mlx5/gga/gga_mlx5.c Outdated Show resolved Hide resolved
Comment on lines -727 to -729
!ucs_test_all_flags(mlx5_md->flags, UCT_IB_MLX5_MD_FLAG_DEVX |
UCT_IB_MLX5_MD_FLAG_INDIRECT_XGVMI |
UCT_IB_MLX5_MD_FLAG_MMO_DMA)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert to assertion?

@evgeny-leksikov
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@yosefe
Copy link
Contributor

yosefe commented Jan 1, 2025

/azp run perf

@openucx openucx deleted a comment from azure-pipelines bot Jan 1, 2025
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yosefe yosefe merged commit 55290bb into openucx:master Jan 7, 2025
146 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants