feat: add a new arg for HFResourceScanner callback #397

aluu317 · 2024-11-27T23:52:12Z

Description of the change

This PR adds a new argument to sft_trainer add_scanner_callback which is False by default.
When enabled, we add a callback to use this HFResourceScanner to measure memory usage, tps, and time consumed for the training. This is actual usage measurement, not an estimation.

Q: I wonder if target_step should be 1 here to make scanner fire on all number of train epochs?

Related issue number

How to verify the PR

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Angel Luu <[email protected]>

github-actions · 2024-11-27T23:52:24Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

fabianlim · 2024-11-28T00:53:36Z

pyproject.toml

@@ -38,6 +38,7 @@ dependencies = [
 "protobuf>=5.28.0,<6.0.0",
 "datasets>=2.15.0,<3.0",
 "simpleeval>=0.9.13,<1.0",
+"HFResourceScanner",


@aluu317 where is this HFResourceScanner residing? HF Trainer already has some internal tools to measure resource, so want to know what is the delta this provides

@fabianlim https://github.com/foundation-model-stack/hf-resource-scanner tthat's the scanner.

Could you send me link to the HF trainer internal tool link?

@fabianlim this is more of a light weight solution (very little overhead in terms of run time) and also reports richer information than the internal trainer tool like memory breakup. Furthermore, there is more control through this repo to incrementally add new metrics as needed.

@aluu317 can we have this as part of optional dependencies rather a default dependency?

@aluu317 you can see it via this args for time and TrainerMemoryTracker.

I see that hf-resource-scanner also uses torch.cuda calls to measure memory. This is the same as TrainerMemoryTracker, and it is known that there is some overhead. ? So why is this new implementation more lightweight? @kmehant

@kmehant yes i noticed after i posted my original message, but I have edited it to ask a differnet thing now, please see

@fabianlim Scanner is more general purpose (not just time and memory, but other things as needed) and lightweight in that it instruments for a single target step (not the whole run).

In general, it is not targeted for prod (though it can be) - mainly for testing/debugging etc.

@ChanderG ok then all the more it should not be the default behavior (cc. @aluu317 )

Do you mind sharing some thoughs on my comment on torch.cuda.

Not sure exactly what the HFTrainer version is doing, but I have benchmarked Scanner overheads and they are negligible.

@ChanderG it depends on model. for the larger ones that consume alot of memory in the GPU I have observed them to be not. In particular when doing multi-gpu training. For memory measurement HFTrainer will use torch.cuda, just like your package.

Signed-off-by: Angel Luu <[email protected]>

kmehant · 2024-11-28T04:44:28Z

tuning/sft_trainer.py

+        if add_scanner_callback:
+            # Third Party
+            # pylint: disable=import-outside-toplevel
+            from HFResourceScanner import Scanner


Can we check if the package exists and only then import it?

If the package does not exist, shall we fall back to not adding the scanner callback and reporting it to the user through a warning? WDYT?

yes 100%. it must be optional. We need to do a check like this

I added a commit to make the HFResourceScanner a .[scanner-dev] optional package. Let me know if this is what you meant @kmehant @fabianlim

yes this is ok @aluu317

Signed-off-by: Angel Luu <[email protected]>

aluu317 · 2024-12-04T16:51:55Z

@kmehant @fabianlim @ChanderG Should we proceed with this PR or revisit another way to scan for memory and time from tuning? Please review if it's ok.

fabianlim · 2024-12-04T17:16:00Z

@aluu317 i leave that to others to comment. I am not very clear what are the use cases for this resource scanner. Im particular why the current HF metrics do not suffice and why this is needed. But if its optional I am fine.

kmehant · 2024-12-06T05:24:01Z

tuning/sft_trainer.py


    if training_args.output_dir:
        os.makedirs(training_args.output_dir, exist_ok=True)
        logger.info("using the output directory at %s", training_args.output_dir)
+        if add_scanner_callback:
+            output_fmt = os.path.join(training_args.output_dir, "scanner_output.json")
+            sc_callback = [Scanner(output_fmt=output_fmt)]


@aluu317
Can we check again if the package exists here? if exists we add the callback if not we can raise a ValueError or warn the user and fall back not adding the scanner. Either ways is fine.

Thank you.

sounds good. I added another check. sorry for delay, i was OOO for a few days. Please review again

ashokponkumar · 2024-12-06T05:44:35Z

@aluu317 i leave that to others to comment. I am not very clear what are the use cases for this resource scanner. Im particular why the current HF metrics do not suffice and why this is needed. But if its optional I am fine.

There are certain metrics like memory at certain points and a few others that are required for improving the estimator. It would be good to keep this optional like it is now, so that we can enable for the environments where the data collection is required.

I would suggest lets move go ahead with this and merge, if we are okay with this optional feature.

Signed-off-by: Angel Luu <[email protected]>

kmehant

LGTM! Thank you.

kmehant · 2024-12-11T03:25:41Z

tuning/sft_trainer.py

+                    training_args.output_dir, "scanner_output.json"
+                )
+                sc_callback = [Scanner(output_fmt=output_fmt)]
+                logging.info(


nit: can we use logger instead of logging?

ahh my bad, I didnt see this. Fixed

nitpick, once resolved we can merge.

Signed-off-by: Angel Luu <[email protected]>

dushyantbehl · 2024-12-12T05:10:12Z

@aluu317 @kmehant @ashokponkumar just a thought...why not use the same tracker backend for Scanner?

You can disable the track and set_params API of the tracker object by making them NOOP or pass like they are
but this would mean all the initialisation, package finding etc will move to a separate file making the main code cleaner.

kmehant · 2024-12-12T05:14:08Z

#397 (comment)

@dushyantbehl +1 to this approach.

Apologies @aluu317 for multiple rounds of requests :( WDYT about @dushyantbehl design? Wish to know your thoughts, thank you.

aluu317 added 3 commits November 25, 2024 16:29

feat: add an arg for add_scanner_callback

a119371

Signed-off-by: Angel Luu <[email protected]>

chore: use json format for scanner output

f24723b

Signed-off-by: Angel Luu <[email protected]>

chore: remove unnecessary import

f8d710c

Signed-off-by: Angel Luu <[email protected]>

aluu317 requested review from anhuong, Ssukriti, fabianlim and kmehant as code owners November 27, 2024 23:52

github-actions bot added the feat label Nov 27, 2024

fabianlim reviewed Nov 28, 2024

View reviewed changes

fix: run lint and fix tests

dc71bda

Signed-off-by: Angel Luu <[email protected]>

kmehant reviewed Nov 28, 2024

View reviewed changes

aluu317 and others added 2 commits December 2, 2024 13:52

refactor: Move HFResourceScanner to an optional install

4b11eda

Signed-off-by: Angel Luu <[email protected]>

Merge branch 'main' into scanner_callback

57213df

kmehant reviewed Dec 6, 2024

View reviewed changes

aluu317 added 2 commits December 10, 2024 13:58

Merge remote-tracking branch 'upstream/main' into scanner_callback

6994068

additional check for package

b510c2e

Signed-off-by: Angel Luu <[email protected]>

aluu317 force-pushed the scanner_callback branch from 822e255 to b510c2e Compare December 10, 2024 21:25

kmehant previously approved these changes Dec 11, 2024

View reviewed changes

kmehant reviewed Dec 11, 2024

View reviewed changes

kmehant self-requested a review December 11, 2024 03:28

Use logger, not logging

ba76db3

Signed-off-by: Angel Luu <[email protected]>

aluu317 mentioned this pull request Dec 17, 2024

feat: add scanner tracker #422

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add a new arg for HFResourceScanner callback #397

feat: add a new arg for HFResourceScanner callback #397

aluu317 commented Nov 27, 2024

github-actions bot commented Nov 27, 2024

fabianlim Nov 28, 2024

aluu317 Nov 28, 2024

kmehant Nov 28, 2024

kmehant Nov 28, 2024

fabianlim Nov 28, 2024 •

edited

Loading

fabianlim Nov 28, 2024

ChanderG Nov 28, 2024

fabianlim Nov 28, 2024

ChanderG Dec 3, 2024

fabianlim Dec 3, 2024 •

edited

Loading

kmehant Nov 28, 2024

fabianlim Nov 28, 2024

aluu317 Dec 2, 2024

fabianlim Dec 3, 2024

aluu317 commented Dec 4, 2024

fabianlim commented Dec 4, 2024 •

edited

Loading

kmehant Dec 6, 2024

aluu317 Dec 10, 2024

ashokponkumar commented Dec 6, 2024

kmehant left a comment

kmehant Dec 11, 2024

aluu317 Dec 11, 2024

kmehant Dec 12, 2024

dushyantbehl commented Dec 12, 2024

kmehant commented Dec 12, 2024

feat: add a new arg for HFResourceScanner callback #397

Are you sure you want to change the base?

feat: add a new arg for HFResourceScanner callback #397

Conversation

aluu317 commented Nov 27, 2024

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabianlim Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabianlim Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aluu317 commented Dec 4, 2024

fabianlim commented Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashokponkumar commented Dec 6, 2024

kmehant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dushyantbehl commented Dec 12, 2024

kmehant commented Dec 12, 2024

fabianlim Nov 28, 2024 •

edited

Loading

fabianlim Dec 3, 2024 •

edited

Loading

fabianlim commented Dec 4, 2024 •

edited

Loading