Git Theta Low Memory Mode #234

blester125 · 2024-03-26T17:47:51Z

This PR adds a low memory mode to git theta where some concurrency is sacrificed to keep the memory footprint as low as possible.

The main issues it fixes are:

During the clean filter the dl native checkpoint data is piped into the filter by git. It is then passed into the checkpoint loader (for example torch.load) and read. This case can cause a transient issue where there is ~2x the model size memory used, the bytes in the stdin buffer and the actual model as tensors. When using GIT_THETA_LOW_MEMORY=True the stdin is first written to a temp file before being read from disk.
During the parameter cleaning process, there is a map from parameter name to parameter value, as a parameter value is cleaned it is not removed from this map and therefore not garbage collected. This can cause issues as when the parameter is serialized it can case a transient doubling of memory (the tensor itself and the serialized version). This is especially apparent on things like embedding tables. This change removes the parameter values from the map once they are serialized so the memory usage goes down as more of the model is written out.

One future update might be to move from a boolean to a numerical values where something like level 1 low memory just does the checkpoint temp file and higher levels do things like reducing the concurrency to allow for releasing model parameters.

These kind of changes (only loading then releasing subsets of parameters) will be be important if we want to support really big models that have streamable/lazy loading of checkpoints.

There are also a few small changes, git_theta.py -> git_theta_cli.py as the naming was causing weird import issues (but the cli command is still git-theta as it is set by the console script entry point), getting blobs from git was bugged for things like lived in subdirs, and checking in real torch checkpoints that had parameters last on cuda devices was bugged.

blester125 · 2024-03-26T17:48:52Z

closes #230
closes #231

blester125 · 2024-03-26T18:30:57Z

This also skips some tensorflow tests because of #235

The original file, piped into the filter from git, it stored in a temporary file before being read by the checkpoint plugin. When cleaning parameter groups, we free the memory for the group after it has been written to disk and converted to metadata. This is the most everything we can do until the dl native formats support streaming.

blester125 requested review from craffel and nkandpa2 March 26, 2024 17:47

This was linked to issues Mar 26, 2024

Git Add can have high memory usage. #231

Closed

Pytorch Checkpoint reading #230

Closed

blester125 force-pushed the feat/low-memory branch from 4fdf32b to 18aa417 Compare April 24, 2024 20:46

blester125 force-pushed the feat/low-memory branch from 18aa417 to 3225476 Compare April 24, 2024 21:05

blester125 merged commit d067dd8 into r-three:main Apr 24, 2024

blester125 deleted the feat/low-memory branch April 24, 2024 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Git Theta Low Memory Mode #234

Git Theta Low Memory Mode #234

Uh oh!

blester125 commented Mar 26, 2024

Uh oh!

blester125 commented Mar 26, 2024

Uh oh!

blester125 commented Mar 26, 2024

Uh oh!

Uh oh!

Git Theta Low Memory Mode #234

Git Theta Low Memory Mode #234

Uh oh!

Conversation

blester125 commented Mar 26, 2024

Uh oh!

blester125 commented Mar 26, 2024

Uh oh!

blester125 commented Mar 26, 2024

Uh oh!

Uh oh!