-
Notifications
You must be signed in to change notification settings - Fork 687
[Test] [history server] [collector] Ensure event type coverage #4343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Test] [history server] [collector] Ensure event type coverage #4343
Conversation
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Jia-Wei Jiang <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
| type rayEvent struct { | ||
| EventID string `json:"eventId"` | ||
| EventType string `json:"eventType"` | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We define this custom type (with specific fields) to represent the decoded Ray event for future extensibility. For example, we might want to do more fine-grained verification after the history server becomes stable.
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
| func verifyS3SessionDirs(test Test, g *WithT, s3Client *s3.S3, sessionPrefix string, nodeID string) { | ||
| dirs := []string{"logs", "node_events"} | ||
| // TODO(jwj): Separate verification for logs and events. | ||
| dirs := []string{"logs"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to parse the --ray-root-dir argument? If not, it would be helpful to add a comment explain.
kuberay/historyserver/config/raycluster.yaml
Line 101 in 79b5c30
| - --ray-root-dir=log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LogWithTimestamp(test.T(), "Verifying all %d event types are covered: %v", len(rayEventTypes), rayEventTypes) | ||
| g.Eventually(func(gg Gomega) { | ||
| uploadedEvents := []rayEvent{} | ||
| for _, dir := range []string{"node_events", "job_events/AgAAAA==", "job_events/AQAAAA=="} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it deterministic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Not in this case. We've fixed this in 9f70a21 by listing all subdirectories under job_events/.
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
|
combined to 1 PR. |
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
|
Hi @Future-Outlier, Please do a final pass! We've merged the master branch and I think it's ready for merge. Local E2E Test
|
| time.sleep(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you help me try ray.shutdown(), maybe it will work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue here
ray-project/ray#60218
Signed-off-by: JiangJiaWei1103 <[email protected]>
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, cc @rueian to merge
| // rayEventTypes includes all potential event types defined in Ray: | ||
| // https://github.com/ray-project/ray/blob/3b41c97fa90c58b0b72c0026f57005b92310160d/src/ray/protobuf/public/events_base_event.proto#L49-L61 | ||
| var rayEventTypes = []string{ | ||
| "ACTOR_DEFINITION_EVENT", | ||
| "ACTOR_LIFECYCLE_EVENT", | ||
| "ACTOR_TASK_DEFINITION_EVENT", | ||
| "DRIVER_JOB_DEFINITION_EVENT", | ||
| "DRIVER_JOB_LIFECYCLE_EVENT", | ||
| "TASK_DEFINITION_EVENT", | ||
| "TASK_LIFECYCLE_EVENT", | ||
| "TASK_PROFILE_EVENT", | ||
| "NODE_DEFINITION_EVENT", | ||
| "NODE_LIFECYCLE_EVENT", | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define in the event.go? Similar to how AllJobStatuses is defined here:
kuberay/ray-operator/apis/ray/v1/rayjob_types.go
Lines 17 to 33 in 910223a
| const ( | |
| JobStatusNew JobStatus = "" | |
| JobStatusPending JobStatus = "PENDING" | |
| JobStatusRunning JobStatus = "RUNNING" | |
| JobStatusStopped JobStatus = "STOPPED" | |
| JobStatusSucceeded JobStatus = "SUCCEEDED" | |
| JobStatusFailed JobStatus = "FAILED" | |
| ) | |
| var AllJobStatuses = []JobStatus{ | |
| JobStatusNew, | |
| JobStatusPending, | |
| JobStatusRunning, | |
| JobStatusStopped, | |
| JobStatusSucceeded, | |
| JobStatusFailed, | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>


Why are these changes needed?
This PR improves the robustness of collector e2e tests by verifying that all 10 potential event types are present in the aggregated
node_events/andjob_events/directories. Previously, tests only confirmed the existence of non-empty files.NOTE: This PR follows up on #4342, which fixed
getJobIDfor job event collection.Key Changes
node_events/andjob_events/testCollectorSeparatesFilesBySession: Force kill theray-workercontainer, which triggers event flushing instead of relying on automatic deletionsleep(5)in the RayJob manifest entrypointRelated issue number
N/A
Related PR
#4342
Test Results
Checks