Skip to content

Conversation

@JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Jan 5, 2026

Why are these changes needed?

This PR improves the robustness of collector e2e tests by verifying that all 10 potential event types are present in the aggregated node_events/ and job_events/ directories. Previously, tests only confirmed the existence of non-empty files.

NOTE: This PR follows up on #4342, which fixed getJobID for job event collection.

Key Changes

  • Event type coverage: Ensure all 10 defined Ray event types are collected across node_events/ and job_events/
    • Events from these 2 directories are aggregated for verification due to their non-deterministic nature.
  • Improved testCollectorSeparatesFilesBySession: Force kill the ray-worker container, which triggers event flushing instead of relying on automatic deletion
  • Add sleep(5) in the RayJob manifest entrypoint
    • We need to wait for the core worker to send events to the aggregator agent via async gRPC. Without this delay, the event transmission may be cut off prematurely.

Related issue number

N/A

Related PR

#4342

Test Results

Screenshot 2026-01-06 at 11 46 37 PM

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Future-Outlier and others added 8 commits January 5, 2026 21:10
Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Jia-Wei Jiang <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Comment on lines +59 to +63
type rayEvent struct {
EventID string `json:"eventId"`
EventType string `json:"eventType"`
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We define this custom type (with specific fields) to represent the decoded Ray event for future extensibility. For example, we might want to do more fine-grained verification after the history server becomes stable.

@JiangJiaWei1103 JiangJiaWei1103 marked this pull request as ready for review January 6, 2026 15:47
@JiangJiaWei1103 JiangJiaWei1103 changed the title [history server] [collector] Ensure event type coverage [Test] [history server] [collector] Ensure event type coverage Jan 7, 2026
func verifyS3SessionDirs(test Test, g *WithT, s3Client *s3.S3, sessionPrefix string, nodeID string) {
dirs := []string{"logs", "node_events"}
// TODO(jwj): Separate verification for logs and events.
dirs := []string{"logs"}
Copy link
Collaborator

@win5923 win5923 Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to parse the --ray-root-dir argument? If not, it would be helpful to add a comment explain.

- --ray-root-dir=log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @win5923,

Thanks for the review! We have another PR that focuses on log file verification: #4351. I’ll make the clarification there instead.

LogWithTimestamp(test.T(), "Verifying all %d event types are covered: %v", len(rayEventTypes), rayEventTypes)
g.Eventually(func(gg Gomega) {
uploadedEvents := []rayEvent{}
for _, dir := range []string{"node_events", "job_events/AgAAAA==", "job_events/AQAAAA=="} {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Not in this case. We've fixed this in 9f70a21 by listing all subdirectories under job_events/.

@Future-Outlier Future-Outlier self-assigned this Jan 13, 2026
@Future-Outlier
Copy link
Member

Future-Outlier commented Jan 14, 2026

combined to 1 PR.
it's actually better to separate, but due to limited review bandwidth, let's combine to 1
#4351

@JiangJiaWei1103
Copy link
Contributor Author

JiangJiaWei1103 commented Jan 14, 2026

Hi @Future-Outlier,

Please do a final pass! We've merged the master branch and I think it's ready for merge.

Local E2E Test

Screenshot 2026-01-14 at 8 34 20 PM

Comment on lines 39 to 40
time.sleep(5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help me try ray.shutdown(), maybe it will work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it doesn't work either:

Screenshot 2026-01-15 at 7 59 52 PM

Maybe the error is on the Ray side, let's keep tracking this issue!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue here
ray-project/ray#60218

Signed-off-by: JiangJiaWei1103 <[email protected]>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cc @rueian to merge

Comment on lines 42 to 55
// rayEventTypes includes all potential event types defined in Ray:
// https://github.com/ray-project/ray/blob/3b41c97fa90c58b0b72c0026f57005b92310160d/src/ray/protobuf/public/events_base_event.proto#L49-L61
var rayEventTypes = []string{
"ACTOR_DEFINITION_EVENT",
"ACTOR_LIFECYCLE_EVENT",
"ACTOR_TASK_DEFINITION_EVENT",
"DRIVER_JOB_DEFINITION_EVENT",
"DRIVER_JOB_LIFECYCLE_EVENT",
"TASK_DEFINITION_EVENT",
"TASK_LIFECYCLE_EVENT",
"TASK_PROFILE_EVENT",
"NODE_DEFINITION_EVENT",
"NODE_LIFECYCLE_EVENT",
}
Copy link
Collaborator

@win5923 win5923 Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define in the event.go? Similar to how AllJobStatuses is defined here:

const (
JobStatusNew JobStatus = ""
JobStatusPending JobStatus = "PENDING"
JobStatusRunning JobStatus = "RUNNING"
JobStatusStopped JobStatus = "STOPPED"
JobStatusSucceeded JobStatus = "SUCCEEDED"
JobStatusFailed JobStatus = "FAILED"
)
var AllJobStatuses = []JobStatus{
JobStatusNew,
JobStatusPending,
JobStatusRunning,
JobStatusStopped,
JobStatusSucceeded,
JobStatusFailed,
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @win5923, thanks for reviewing! Fixed at d94d02a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants