Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dashboard] Remove ReportHead usage of DataSource.agents. #49878

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented Jan 16, 2025

ReportHead used to subscribe to DataSource.agents changes, to maintain a connection to every single O(#node) agents, just to make grpc calls when needed. This PR removes it, now it only make on demand connections when a profiling request comes.

Changes:

  • No longer listen to DataSource.agents.
  • On profiling request, make 1 rpc to InternalKV, to get IP and agent port. Then create a stub on the fly and use it to send outbound requests.
  • Changed InternalKV:
    • key: DASHBOARD_AGENT_PORT_PREFIX -> DASHBOARD_AGENT_ADDR_PREFIX
    • value: json.dumps([http_port, grpc_port]) -> json.dumps([ip, http_port, grpc_port])
  • Changed all profiling requests from giving param ip to give param node_id.
  • Added NodeID filter to GcsClient API.
  • Moved updates of DataSource.node_physical_stats to NodeHead.

After this PR, ReportHead no longer needs DataSource.

Smoke tested UI locally.

@rynewang rynewang requested a review from a team as a code owner January 16, 2025 00:58
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
If either of them are not found, return None.
"""
node_info, agent_port_json = await asyncio.gather(
self.gcs_aio_client.get_all_node_info(node_id=node_id),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as you have a gcs client, you have all nodes cached, we can just use that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That thing only have value if you subscribe to node change:

/// Note, the local cache is only available if `AsyncSubscribeToNodeChange`
/// is called before.

a python side gcs client by default does not have that value.

Comment on lines 382 to 384
virtual Status AsyncGetAll(std::optional<NodeID> node_id,
const MultiItemCallback<rpc::GcsNodeInfo> &callback,
int64_t timeout_ms);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: given filters are optional we can do

virtual Status AsyncGetAll(
                             const MultiItemCallback<rpc::GcsNodeInfo> &callback,
                             int64_t timeout_ms,
                             std::optional<NodeID> node_id = std::nullopt);

so you don't need to update all callers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we already changed, and there's only a few of them, let's just keep them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right we need to do the defaults; cpp tests have a lot of callsites

python/ray/dashboard/modules/reporter/reporter_head.py Outdated Show resolved Hide resolved
python/ray/dashboard/modules/reporter/reporter_head.py Outdated Show resolved Hide resolved
python/ray/dashboard/modules/reporter/reporter_head.py Outdated Show resolved Hide resolved
@rynewang rynewang changed the title [dashboard] remove DataSource usage from ReportHead. [dashboard] Remove ReportHead usage of DataSource.agents. Jan 17, 2025
Signed-off-by: Ruiyang Wang <[email protected]>
@rynewang rynewang added the go add ONLY when ready to merge, run all tests label Jan 17, 2025
@rynewang
Copy link
Contributor Author

@alanwguo this PR changes some HTTP endpoints, and I have changed callsites in frontend accordingly. PTAL!

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

python/ray/dashboard/modules/reporter/reporter_head.py Outdated Show resolved Hide resolved
python/ray/dashboard/modules/reporter/reporter_head.py Outdated Show resolved Hide resolved
node_id: The ID of the node.
"""
if "pid" not in req.query:
raise ValueError("pid is required")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems in the code code it's not required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pid is required, otherwise we don't know what pid to traceback

python/ray/dashboard/modules/reporter/reporter_head.py Outdated Show resolved Hide resolved
python/ray/dashboard/modules/reporter/reporter_head.py Outdated Show resolved Hide resolved
python/ray/dashboard/modules/reporter/reporter_head.py Outdated Show resolved Hide resolved
python/ray/dashboard/modules/reporter/reporter_head.py Outdated Show resolved Hide resolved
rynewang and others added 5 commits January 16, 2025 17:21
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants