-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds fast gradient clipping support for the Embedding layer. #694
base: main
Are you sure you want to change the base?
Conversation
…rithm is described in the 'A Unified Fast Gradient Clipping Framework for DP-SGD' paper: https://proceedings.neurips.cc/paper_files/paper/2023/file/a45d344b28179c8da7646bc38ff50ad8-Paper-Conference.pdf. This reduces the memory needed to run DP-SGD over embedding layers, significantly reducing OOMs over large embedding layers.
Hi @pagarwl! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Thank you for the PR! Overall, the implementation looks good to me -- I left some minor comments. Additionally, I suggest to merge the content of the file |
diff = flat_norms_normal - flat_norms_gc | ||
|
||
logging.info(f"Diff = {diff}") | ||
msg = "Fail: Gradients from vanilla DP-SGD and from fast gradient clipping are different" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msg should be "Fail: Gradient norms from vanilla DP-SGD and from fast gradient clipping are different"
expected_norms = torch.tensor( | ||
[0.0150, 0.0071, 0.0005, 0.0081, 0.0039], dtype=torch.float32 | ||
) | ||
print("expected_norms: ", expected_norms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the print statement
# Manually set weights for the embedding layer for testing | ||
embedding_layer.weight = nn.Parameter( | ||
torch.tensor([[0.1], [0.2], [0.3]], dtype=torch.float32) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking: although it will not be used in calculation, could we change the shape from [3,1] to [3,2] to match the embedding_dim
=2? @pagarwl
input_ids = torch.tensor([[1, 1], [2, 0], [2, 0]], dtype=torch.long) | ||
|
||
# Example gradients with respect to the embedding output (backprops). | ||
# Shape: [6, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking: we should have size shape [3,2,1] for backprops, correct? [6, 1] is for grad_values.
|
||
# Example gradients per input id, with embedding_dim=2. | ||
# Shape: [6, 1, 1, 2] | ||
grad_values = torch.tensor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here, maybe we should distinguish between grad_values and backprops, since they have different shapes?
activations: [tensor([[1, 1], | ||
[2, 0], | ||
[2, 0]])] | ||
backprops: tensor([[0.2000], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong shape for backprops, which should be [3,2,1]. [6,1] is for grad_values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind also fixing the lint error due to isort? https://github.com/pytorch/opacus/actions/runs/12342666465/job/34611577396
The algorithm used is described in the 'A Unified Fast Gradient Clipping Framework for DP-SGD' paper: https://proceedings.neurips.cc/paper_files/paper/2023/file/a45d344b28179c8da7646bc38ff50ad8-Paper-Conference.pdf.
Types of changes
Motivation and Context / Related issue
Previously, Ghost clipping was not supported in Opacus for embedding layer. With default DP-SGD implementation, the training OOMs out on large embedding layers over large physical batch size (useful for privacy). The regular DP-SGD needs O(Bnd), where B=physical batch size, n=vocab size, d=embedding dimension. To give an example on memory needed: we've seen embeddings with [vocab size=1000000, dim=5] (and higher) in real-world differential privacy applications. With a physical batch size of 16,000, memory needed: 16000 × 1000000 × 5 x 4 = 298.02 GB.
With this change, we need significantly smaller memory: O(Br) where B is physical batch size, and r is number of unique indices in the embedding sequence. We could successfully run DP-SGD over above example, using < 8GiB.
This is a good add to Opacus, enabling larger embedding layers over larger physical batch sizes to be trained with DP-SGD.
How Has This Been Tested (if it applies)
Unit tests. Runs over large embedding layer training over realworld DP application.
Checklist