-
Notifications
You must be signed in to change notification settings - Fork 349
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Functorch gradients: investigation and fix (#510)
Summary: *The investigation part for this PR was done by alexandresablayrolles, thanks for figuring out the reason the tests were failing* ## Background Current implementation of functorch-based per sample gradients fails on modules which have both trainable non-recursive parameters and standard submodules, e.g. below ``` class LinearWithExtraParam(nn.Module): def __init__(self, in_features: int, out_features: int, hidden_dim: int = 8): super().__init__() self.fc = nn.Linear(in_features, hidden_dim) self.extra_param = nn.Parameter(torch.randn(hidden_dim, out_features)) def forward(self, x): x = self.fc(x) x = x.matmul(self.extra_param) return x ``` The reason is - functorch hook actually computes gradients for recursive submodules too. The problem is, normal hooks are also attached to these submodules. GradSampleModule then sees two grad_sample tensors, thinks it needs to accumulate and adds them up together ## Solution(s) There are essentially two ways we can fix this: either make functorch compute per sample gradients for non-recursive parameters only or don't attach normal hooks to submodules where the parent module is handled by functorch. This diff implements the latter option (reasoning below), for demo purposes the former option can be seen in #531 For the pure code perspective the former option (let's call it "non-recursive functorch") is more appealing to me. It better fits the existing paradigm and matches normal hooks behaviour - all of the existing code only deals with the immediate non-recursive parameters. However, it doesn't make much sense from the efficiency perspective. "non-recursive functorch" would do all the work to compute per-sample gradients for its submodules, only for them to be filtered out at the very last stage. Alternative option (a.k.a. "functorch for subtrees") does involve a bit more convoluted This has a noticeable effect on performance. Below is the results of MNIST benchmarks with different configurations. I've tested this with different configurations, because at the end of the day, the impact on performance depends on how deep are subtrees * Standard model- our model from MNIST example, standard layers only (2 conv + 2 linear). No overhead expected, functorch doesn't kick in * Mid-level model - leaf nodes (two linear layers) have one extra param and are computed with functorch. Overhead: 2x Linear hook * Extreme model - root model have one extra param and needs to be handled by functorch. Overhead: 2x linear hook + 2x conv hook | Mode | non-recursive functorch | functorch for subtrees | |:-----------------------:|:------------------------:|:-----------------------:| | Standard model (CPU) | 138s | 136s | | Standard model (GPU) | 149s | 150s | | Mid-level model (CPU) | 157s | 150s | | Mid-level model (GPU) | 100s | 97s | | Extreme model (CPU) | 207s | 172s | | Extreme model (GPU) | 101s | 94s | Pull Request resolved: #510 Reviewed By: alexandresablayrolles Differential Revision: D39579487 Pulled By: ffuuugor fbshipit-source-id: 1b089bd04ab110174a1f2ebb371380eb2ce76054
- Loading branch information
1 parent
9ff6839
commit 7393ae4
Showing
6 changed files
with
151 additions
and
66 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
import torch | ||
import torch.nn as nn | ||
import torch.nn.functional as F | ||
|
||
|
||
class BasicSupportedModule(nn.Module): | ||
def __init__(self): | ||
super().__init__() | ||
self.conv = nn.Conv1d(in_channels=16, out_channels=8, kernel_size=2) | ||
self.gn = nn.GroupNorm(num_groups=2, num_channels=8) | ||
self.fc = nn.Linear(in_features=4, out_features=8) | ||
self.ln = nn.LayerNorm([8, 8]) | ||
|
||
def forward(self, x): | ||
x = self.conv(x) | ||
x = self.gn(x) | ||
x = self.fc(x) | ||
x = self.ln(x) | ||
return x | ||
|
||
|
||
class CustomLinearModule(nn.Module): | ||
def __init__(self, in_features: int, out_features: int): | ||
super().__init__() | ||
self._weight = nn.Parameter(torch.randn(out_features, in_features)) | ||
self._bias = nn.Parameter(torch.randn(out_features)) | ||
|
||
def forward(self, x): | ||
return F.linear(x, self._weight, self._bias) | ||
|
||
|
||
class MatmulModule(nn.Module): | ||
def __init__(self, input_features: int, output_features: int): | ||
super().__init__() | ||
self.weight = nn.Parameter(torch.randn(input_features, output_features)) | ||
|
||
def forward(self, x): | ||
return torch.matmul(x, self.weight) | ||
|
||
|
||
class LinearWithExtraParam(nn.Module): | ||
def __init__(self, in_features: int, out_features: int, hidden_dim: int = 8): | ||
super().__init__() | ||
self.fc = nn.Linear(in_features, hidden_dim) | ||
self.extra_param = nn.Parameter(torch.randn(hidden_dim, out_features)) | ||
|
||
def forward(self, x): | ||
x = self.fc(x) | ||
x = x.matmul(self.extra_param) | ||
return x |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters