Add waiting time metric to Wandb #779

finbarrtimbers · 2025-07-13T21:23:16Z

Add eval/wait_time_between_evals metric to Wandb to measure the idle time between evaluation runs.

Co-authored-by: finbarrtimbers <[email protected]>

hamishivi

I don't quite understand the purpose of tracking this? See comments :)

hamishivi · 2025-07-14T17:26:21Z

EVAL_WAIT_TIME_IMPLEMENTATION.md

@@ -0,0 +1,109 @@
+# Eval Wait Time Metric Implementation


Could we put this file into a documentation folder or similar? I feel like also we don't necessarily need the full file, maybe just an explanation of the flag.

Ah, yeah, sorry. I'm trying out Cursor's background agents, and it put this here. Let me mark the PR (and the other Cursor ones) as draft until I clean it up.

I totally agree with you.

hamishivi · 2025-07-14T17:30:09Z

open_instruct/grpo_fast.py

+                try:
+                    import wandb
+                    if hasattr(wandb, 'run') and wandb.run is not None:
+                        wandb.log({"eval/wait_time_between_evals": eval_wait_time}, step=training_step)


I don't quite understand why we want to track time between evals. We can get an arb number of training steps inbetween evals, so isn't this usually just telling us how long n training steps is taking?

yeah, that's fair. I'm trying to understand how we're doing evals generally. I think this is the wrong thing to measure.

Yeah fair its a bit unclear. There are sort of two eval loops that run (copying from a slack message i wrote recently):

in-loop evals, which just re-use the generation and reward/verifier code. I’m not crazy about these but they can be useful to observe how model generations change over training, and can give you an idea of val reward. Controlled by num_evals, which works out the total number of training steps via the episode count and then sets eval_freq accordingly. https://github.com/allenai/open-instruct/blob/main/open_instruct/grpo_fast.py#L1459. I’m actually not really a fan of this way of doing it but don’t feel strongly enough to change it haha.

oe-eval evals, which are downstream and launched as separate jobs, and the actual final numbers we usually care about. This is tied to save_freq AND requires you setting try_launch_beaker_eval_jobs_on_weka to True (even if on augusta, we should rename this arg). We could add a further check at e.g. https://github.com/allenai/open-instruct/blob/main/open_instruct/grpo_fast.py#L1785. Note that we can’t really untie this since the oe-eval jobs need some sort of path.

cursoragent and others added 2 commits July 13, 2025 20:14

Add eval wait time tracking metric to measure time between evals

914da0d

Co-authored-by: finbarrtimbers <[email protected]>

Refactor eval wait time tracking without global variables

712a680

Co-authored-by: finbarrtimbers <[email protected]>

hamishivi reviewed Jul 14, 2025

View reviewed changes

finbarrtimbers marked this pull request as draft July 14, 2025 17:36

finbarrtimbers closed this Oct 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add waiting time metric to Wandb #779

Add waiting time metric to Wandb #779

Uh oh!

finbarrtimbers commented Jul 13, 2025

Uh oh!

hamishivi left a comment

Uh oh!

hamishivi Jul 14, 2025

Uh oh!

finbarrtimbers Jul 14, 2025

Uh oh!

hamishivi Jul 14, 2025

Uh oh!

finbarrtimbers Jul 14, 2025 •

edited

Loading

Uh oh!

hamishivi Jul 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add waiting time metric to Wandb #779

Add waiting time metric to Wandb #779

Uh oh!

Conversation

finbarrtimbers commented Jul 13, 2025

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

hamishivi Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

hamishivi Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hamishivi Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

finbarrtimbers Jul 14, 2025 •

edited

Loading