In LightningDDTBlock, when use_rmsnorm=False, the code uses LayerNorm with elementwise_affine=False, so no learnable scale/bias parameters are present in the normalization itself. The AdaLN modulation then provides both shift and scale terms, which makes sense—they effectively serve as the conditional affine transformation replacing the standard LayerNorm parameters.
However, when use_rmsnorm=True, the RMSNorm layer typically includes its own learnable per-channel scale parameter (i.e., an affine weight). At the same time, the AdaLN modulation still applies an additional learned scale (and optionally shift) via DDTModulate on top of the RMSNorm output.
This seems to introduce a form of "double scaling": one static (but learnable) scale from RMSNorm itself, and another dynamic, conditionally modulated scale from AdaLN. In contrast, the LayerNorm path avoids this by disabling its internal affine parameters.
Could the authors clarify the motivation behind keeping both the RMSNorm’s internal weight and the AdaLN scale modulation? For example, was this empirically found to improve performance, or is there another architectural consideration we might be missing?
Thanks for sharing your work!
In LightningDDTBlock, when use_rmsnorm=False, the code uses LayerNorm with elementwise_affine=False, so no learnable scale/bias parameters are present in the normalization itself. The AdaLN modulation then provides both shift and scale terms, which makes sense—they effectively serve as the conditional affine transformation replacing the standard LayerNorm parameters.
However, when use_rmsnorm=True, the RMSNorm layer typically includes its own learnable per-channel scale parameter (i.e., an affine weight). At the same time, the AdaLN modulation still applies an additional learned scale (and optionally shift) via DDTModulate on top of the RMSNorm output.
This seems to introduce a form of "double scaling": one static (but learnable) scale from RMSNorm itself, and another dynamic, conditionally modulated scale from AdaLN. In contrast, the LayerNorm path avoids this by disabling its internal affine parameters.
Could the authors clarify the motivation behind keeping both the RMSNorm’s internal weight and the AdaLN scale modulation? For example, was this empirically found to improve performance, or is there another architectural consideration we might be missing?
Thanks for sharing your work!