Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add a new arg for HFResourceScanner callback #397

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

aluu317
Copy link
Collaborator

@aluu317 aluu317 commented Nov 27, 2024

Description of the change

This PR adds a new argument to sft_trainer add_scanner_callback which is False by default.
When enabled, we add a callback to use this HFResourceScanner to measure memory usage, tps, and time consumed for the training. This is actual usage measurement, not an estimation.

Q: I wonder if target_step should be 1 here to make scanner fire on all number of train epochs?

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the feat label Nov 27, 2024
pyproject.toml Outdated
@@ -38,6 +38,7 @@ dependencies = [
"protobuf>=5.28.0,<6.0.0",
"datasets>=2.15.0,<3.0",
"simpleeval>=0.9.13,<1.0",
"HFResourceScanner",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aluu317 where is this HFResourceScanner residing? HF Trainer already has some internal tools to measure resource, so want to know what is the delta this provides

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianlim https://github.com/foundation-model-stack/hf-resource-scanner tthat's the scanner.

Could you send me link to the HF trainer internal tool link?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianlim this is more of a light weight solution (very little overhead in terms of run time) and also reports richer information than the internal trainer tool like memory breakup. Furthermore, there is more control through this repo to incrementally add new metrics as needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aluu317 can we have this as part of optional dependencies rather a default dependency?

Copy link
Collaborator

@fabianlim fabianlim Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aluu317 you can see it via this args for time and TrainerMemoryTracker.

I see that hf-resource-scanner also uses torch.cuda calls to measure memory. This is the same as TrainerMemoryTracker, and it is known that there is some overhead. ? So why is this new implementation more lightweight? @kmehant

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmehant yes i noticed after i posted my original message, but I have edited it to ask a differnet thing now, please see

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianlim Scanner is more general purpose (not just time and memory, but other things as needed) and lightweight in that it instruments for a single target step (not the whole run).

In general, it is not targeted for prod (though it can be) - mainly for testing/debugging etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChanderG ok then all the more it should not be the default behavior (cc. @aluu317 )

Do you mind sharing some thoughs on my comment on torch.cuda.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure exactly what the HFTrainer version is doing, but I have benchmarked Scanner overheads and they are negligible.

Copy link
Collaborator

@fabianlim fabianlim Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChanderG it depends on model. for the larger ones that consume alot of memory in the GPU I have observed them to be not. In particular when doing multi-gpu training. For memory measurement HFTrainer will use torch.cuda, just like your package.

if add_scanner_callback:
# Third Party
# pylint: disable=import-outside-toplevel
from HFResourceScanner import Scanner
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check if the package exists and only then import it?

If the package does not exist, shall we fall back to not adding the scanner callback and reporting it to the user through a warning? WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes 100%. it must be optional. We need to do a check like this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a commit to make the HFResourceScanner a .[scanner-dev] optional package. Let me know if this is what you meant @kmehant @fabianlim

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is ok @aluu317

@aluu317
Copy link
Collaborator Author

aluu317 commented Dec 4, 2024

@kmehant @fabianlim @ChanderG Should we proceed with this PR or revisit another way to scan for memory and time from tuning? Please review if it's ok.

@fabianlim
Copy link
Collaborator

fabianlim commented Dec 4, 2024

@aluu317 i leave that to others to comment. I am not very clear what are the use cases for this resource scanner. Im particular why the current HF metrics do not suffice and why this is needed. But if its optional I am fine.


if training_args.output_dir:
os.makedirs(training_args.output_dir, exist_ok=True)
logger.info("using the output directory at %s", training_args.output_dir)
if add_scanner_callback:
output_fmt = os.path.join(training_args.output_dir, "scanner_output.json")
sc_callback = [Scanner(output_fmt=output_fmt)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aluu317
Can we check again if the package exists here? if exists we add the callback if not we can raise a ValueError or warn the user and fall back not adding the scanner. Either ways is fine.

Thank you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. I added another check. sorry for delay, i was OOO for a few days. Please review again

@ashokponkumar
Copy link
Collaborator

@aluu317 i leave that to others to comment. I am not very clear what are the use cases for this resource scanner. Im particular why the current HF metrics do not suffice and why this is needed. But if its optional I am fine.

There are certain metrics like memory at certain points and a few others that are required for improving the estimator. It would be good to keep this optional like it is now, so that we can enable for the environments where the data collection is required.

I would suggest lets move go ahead with this and merge, if we are okay with this optional feature.

kmehant
kmehant previously approved these changes Dec 11, 2024
Copy link
Collaborator

@kmehant kmehant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you.

training_args.output_dir, "scanner_output.json"
)
sc_callback = [Scanner(output_fmt=output_fmt)]
logging.info(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use logger instead of logging?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh my bad, I didnt see this. Fixed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@kmehant kmehant self-requested a review December 11, 2024 03:28
@kmehant kmehant dismissed their stale review December 11, 2024 03:29

nitpick, once resolved we can merge.

Signed-off-by: Angel Luu <[email protected]>
@dushyantbehl
Copy link
Contributor

@aluu317 @kmehant @ashokponkumar just a thought...why not use the same tracker backend for Scanner?

You can disable the track and set_params API of the tracker object by making them NOOP or pass like they are
but this would mean all the initialisation, package finding etc will move to a separate file making the main code cleaner.

@kmehant
Copy link
Collaborator

kmehant commented Dec 12, 2024

#397 (comment)

@dushyantbehl +1 to this approach.

Apologies @aluu317 for multiple rounds of requests :( WDYT about @dushyantbehl design? Wish to know your thoughts, thank you.

@aluu317 aluu317 mentioned this pull request Dec 17, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants