-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add a new arg for HFResourceScanner callback #397
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Angel Luu <[email protected]>
Signed-off-by: Angel Luu <[email protected]>
Signed-off-by: Angel Luu <[email protected]>
Thanks for making a pull request! 😃 |
pyproject.toml
Outdated
@@ -38,6 +38,7 @@ dependencies = [ | |||
"protobuf>=5.28.0,<6.0.0", | |||
"datasets>=2.15.0,<3.0", | |||
"simpleeval>=0.9.13,<1.0", | |||
"HFResourceScanner", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aluu317 where is this HFResourceScanner
residing? HF Trainer already has some internal tools to measure resource, so want to know what is the delta this provides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabianlim https://github.com/foundation-model-stack/hf-resource-scanner tthat's the scanner.
Could you send me link to the HF trainer internal tool link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabianlim this is more of a light weight solution (very little overhead in terms of run time) and also reports richer information than the internal trainer tool like memory breakup. Furthermore, there is more control through this repo to incrementally add new metrics as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aluu317 can we have this as part of optional dependencies rather a default dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aluu317 you can see it via this args for time and TrainerMemoryTracker.
I see that hf-resource-scanner also uses torch.cuda calls to measure memory. This is the same as TrainerMemoryTracker
, and it is known that there is some overhead. ? So why is this new implementation more lightweight? @kmehant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmehant yes i noticed after i posted my original message, but I have edited it to ask a differnet thing now, please see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabianlim Scanner is more general purpose (not just time and memory, but other things as needed) and lightweight in that it instruments for a single target step (not the whole run).
In general, it is not targeted for prod (though it can be) - mainly for testing/debugging etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChanderG ok then all the more it should not be the default behavior (cc. @aluu317 )
Do you mind sharing some thoughs on my comment on torch.cuda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure exactly what the HFTrainer version is doing, but I have benchmarked Scanner overheads and they are negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChanderG it depends on model. for the larger ones that consume alot of memory in the GPU I have observed them to be not. In particular when doing multi-gpu
training. For memory measurement HFTrainer
will use torch.cuda
, just like your package.
Signed-off-by: Angel Luu <[email protected]>
tuning/sft_trainer.py
Outdated
if add_scanner_callback: | ||
# Third Party | ||
# pylint: disable=import-outside-toplevel | ||
from HFResourceScanner import Scanner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check if the package exists and only then import it?
If the package does not exist, shall we fall back to not adding the scanner callback and reporting it to the user through a warning? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes 100%. it must be optional. We need to do a check like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a commit to make the HFResourceScanner a .[scanner-dev]
optional package. Let me know if this is what you meant @kmehant @fabianlim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is ok @aluu317
@kmehant @fabianlim @ChanderG Should we proceed with this PR or revisit another way to scan for memory and time from tuning? Please review if it's ok. |
@aluu317 i leave that to others to comment. I am not very clear what are the use cases for this resource scanner. Im particular why the current HF metrics do not suffice and why this is needed. But if its optional I am fine. |
tuning/sft_trainer.py
Outdated
|
||
if training_args.output_dir: | ||
os.makedirs(training_args.output_dir, exist_ok=True) | ||
logger.info("using the output directory at %s", training_args.output_dir) | ||
if add_scanner_callback: | ||
output_fmt = os.path.join(training_args.output_dir, "scanner_output.json") | ||
sc_callback = [Scanner(output_fmt=output_fmt)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aluu317
Can we check again if the package exists here? if exists we add the callback if not we can raise a ValueError or warn the user and fall back not adding the scanner. Either ways is fine.
Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. I added another check. sorry for delay, i was OOO for a few days. Please review again
There are certain metrics like memory at certain points and a few others that are required for improving the estimator. It would be good to keep this optional like it is now, so that we can enable for the environments where the data collection is required. I would suggest lets move go ahead with this and merge, if we are okay with this optional feature. |
822e255
to
b510c2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you.
tuning/sft_trainer.py
Outdated
training_args.output_dir, "scanner_output.json" | ||
) | ||
sc_callback = [Scanner(output_fmt=output_fmt)] | ||
logging.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we use logger
instead of logging
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh my bad, I didnt see this. Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
Signed-off-by: Angel Luu <[email protected]>
@aluu317 @kmehant @ashokponkumar just a thought...why not use the same You can disable the |
@dushyantbehl +1 to this approach. Apologies @aluu317 for multiple rounds of requests :( WDYT about @dushyantbehl design? Wish to know your thoughts, thank you. |
Description of the change
This PR adds a new argument to sft_trainer
add_scanner_callback
which is False by default.When enabled, we add a callback to use this HFResourceScanner to measure memory usage, tps, and time consumed for the training. This is actual usage measurement, not an estimation.
Q: I wonder if
target_step
should be 1 here to make scanner fire on all number of train epochs?Related issue number
How to verify the PR
Was the PR tested