-
Notifications
You must be signed in to change notification settings - Fork 453
Add waiting time metric to Wandb #779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: finbarrtimbers <[email protected]>
Co-authored-by: finbarrtimbers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand the purpose of tracking this? See comments :)
| @@ -0,0 +1,109 @@ | |||
| # Eval Wait Time Metric Implementation | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put this file into a documentation folder or similar? I feel like also we don't necessarily need the full file, maybe just an explanation of the flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, sorry. I'm trying out Cursor's background agents, and it put this here. Let me mark the PR (and the other Cursor ones) as draft until I clean it up.
I totally agree with you.
| try: | ||
| import wandb | ||
| if hasattr(wandb, 'run') and wandb.run is not None: | ||
| wandb.log({"eval/wait_time_between_evals": eval_wait_time}, step=training_step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand why we want to track time between evals. We can get an arb number of training steps inbetween evals, so isn't this usually just telling us how long n training steps is taking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's fair. I'm trying to understand how we're doing evals generally. I think this is the wrong thing to measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah fair its a bit unclear. There are sort of two eval loops that run (copying from a slack message i wrote recently):
-
in-loop evals, which just re-use the generation and reward/verifier code. I’m not crazy about these but they can be useful to observe how model generations change over training, and can give you an idea of val reward. Controlled by num_evals, which works out the total number of training steps via the episode count and then sets eval_freq accordingly. https://github.com/allenai/open-instruct/blob/main/open_instruct/grpo_fast.py#L1459. I’m actually not really a fan of this way of doing it but don’t feel strongly enough to change it haha.
-
oe-eval evals, which are downstream and launched as separate jobs, and the actual final numbers we usually care about. This is tied to save_freq AND requires you setting try_launch_beaker_eval_jobs_on_weka to True (even if on augusta, we should rename this arg). We could add a further check at e.g. https://github.com/allenai/open-instruct/blob/main/open_instruct/grpo_fast.py#L1785. Note that we can’t really untie this since the oe-eval jobs need some sort of path.
Add
eval/wait_time_between_evalsmetric to Wandb to measure the idle time between evaluation runs.