Skip to content

Conversation

@finbarrtimbers
Copy link
Collaborator

Add eval/wait_time_between_evals metric to Wandb to measure the idle time between evaluation runs.

Copy link
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the purpose of tracking this? See comments :)

@@ -0,0 +1,109 @@
# Eval Wait Time Metric Implementation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put this file into a documentation folder or similar? I feel like also we don't necessarily need the full file, maybe just an explanation of the flag.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, sorry. I'm trying out Cursor's background agents, and it put this here. Let me mark the PR (and the other Cursor ones) as draft until I clean it up.

I totally agree with you.

try:
import wandb
if hasattr(wandb, 'run') and wandb.run is not None:
wandb.log({"eval/wait_time_between_evals": eval_wait_time}, step=training_step)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why we want to track time between evals. We can get an arb number of training steps inbetween evals, so isn't this usually just telling us how long n training steps is taking?

Copy link
Collaborator Author

@finbarrtimbers finbarrtimbers Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's fair. I'm trying to understand how we're doing evals generally. I think this is the wrong thing to measure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair its a bit unclear. There are sort of two eval loops that run (copying from a slack message i wrote recently):

  1. in-loop evals, which just re-use the generation and reward/verifier code. I’m not crazy about these but they can be useful to observe how model generations change over training, and can give you an idea of val reward. Controlled by num_evals, which works out the total number of training steps via the episode count and then sets eval_freq accordingly. https://github.com/allenai/open-instruct/blob/main/open_instruct/grpo_fast.py#L1459. I’m actually not really a fan of this way of doing it but don’t feel strongly enough to change it haha.

  2. oe-eval evals, which are downstream and launched as separate jobs, and the actual final numbers we usually care about. This is tied to save_freq AND requires you setting try_launch_beaker_eval_jobs_on_weka to True (even if on augusta, we should rename this arg). We could add a further check at e.g. https://github.com/allenai/open-instruct/blob/main/open_instruct/grpo_fast.py#L1785. Note that we can’t really untie this since the oe-eval jobs need some sort of path.

@finbarrtimbers finbarrtimbers marked this pull request as draft July 14, 2025 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants