Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git Theta Low Memory Mode #234

Merged
merged 1 commit into from
Apr 24, 2024
Merged

Conversation

blester125
Copy link
Collaborator

This PR adds a low memory mode to git theta where some concurrency is sacrificed to keep the memory footprint as low as possible.

The main issues it fixes are:

  1. During the clean filter the dl native checkpoint data is piped into the filter by git. It is then passed into the checkpoint loader (for example torch.load) and read. This case can cause a transient issue where there is ~2x the model size memory used, the bytes in the stdin buffer and the actual model as tensors. When using GIT_THETA_LOW_MEMORY=True the stdin is first written to a temp file before being read from disk.

  2. During the parameter cleaning process, there is a map from parameter name to parameter value, as a parameter value is cleaned it is not removed from this map and therefore not garbage collected. This can cause issues as when the parameter is serialized it can case a transient doubling of memory (the tensor itself and the serialized version). This is especially apparent on things like embedding tables. This change removes the parameter values from the map once they are serialized so the memory usage goes down as more of the model is written out.

One future update might be to move from a boolean to a numerical values where something like level 1 low memory just does the checkpoint temp file and higher levels do things like reducing the concurrency to allow for releasing model parameters.

These kind of changes (only loading then releasing subsets of parameters) will be be important if we want to support really big models that have streamable/lazy loading of checkpoints.

There are also a few small changes, git_theta.py -> git_theta_cli.py as the naming was causing weird import issues (but the cli command is still git-theta as it is set by the console script entry point), getting blobs from git was bugged for things like lived in subdirs, and checking in real torch checkpoints that had parameters last on cuda devices was bugged.

@blester125 blester125 requested review from craffel and nkandpa2 March 26, 2024 17:47
@blester125
Copy link
Collaborator Author

closes #230
closes #231

This was linked to issues Mar 26, 2024
@blester125
Copy link
Collaborator Author

This also skips some tensorflow tests because of #235

The original file, piped into the filter from git, it stored in a
temporary file before being read by the checkpoint plugin.

When cleaning parameter groups, we free the memory for the group after
it has been written to disk and converted to metadata.

This is the most everything we can do until the dl native formats support
streaming.
@blester125 blester125 merged commit d067dd8 into r-three:main Apr 24, 2024
19 checks passed
@blester125 blester125 deleted the feat/low-memory branch April 24, 2024 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Git Add can have high memory usage. Pytorch Checkpoint reading
1 participant