Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support fuse layer norm grad for npu #10614

Merged
merged 13 commits into from
Jan 23, 2025

Conversation

ShawnXuan
Copy link
Collaborator

No description provided.

@ShawnXuan ShawnXuan requested a review from hjchen2 as a code owner January 8, 2025 07:02
Copy link
Contributor

github-actions bot commented Jan 8, 2025

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

@ShawnXuan ShawnXuan changed the base branch from fuse_layer_norm_grad_for_npu to master January 22, 2025 01:54
@ShawnXuan ShawnXuan changed the title fix support fuse layer norm grad for npu Jan 22, 2025
Copy link
Contributor

@crazy-JiangDongHua crazy-JiangDongHua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得没啥问题了,合并 master 改动也不大

…low-Inc/oneflow into update_fuse_layer_norm_grad_for_npu
Copy link
Contributor

Copy link
Contributor

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

@ShawnXuan ShawnXuan requested review from oneflow-ci-bot and removed request for jackalcooper, hjchen2 and oneflow-ci-bot January 23, 2025 01:43
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce RTX 3080 Ti 

❌ OneFlow resnet50 time: 43.2ms (= 4320.7ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 57.3ms (= 5729.9ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.33 (= 57.3ms / 43.2ms)

OneFlow resnet50 time: 26.5ms (= 2650.2ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.6ms (= 3764.1ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.42 (= 37.6ms / 26.5ms)

OneFlow resnet50 time: 18.4ms (= 3672.1ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 34.8ms (= 6961.7ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.90 (= 34.8ms / 18.4ms)

OneFlow resnet50 time: 18.1ms (= 3613.2ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 31.4ms (= 6288.8ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.74 (= 31.4ms / 18.1ms)

OneFlow resnet50 time: 17.3ms (= 3460.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 29.8ms (= 5954.9ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.72 (= 29.8ms / 17.3ms)

OneFlow swin dataloader time: 0.200s (= 40.072s / 200, num_workers=1)
PyTorch swin dataloader time: 0.128s (= 25.526s / 200, num_workers=1)
Relative speed: 0.637 (= 0.128s / 0.200s)

OneFlow swin dataloader time: 0.055s (= 10.965s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.585s / 200, num_workers=4)
Relative speed: 0.601 (= 0.033s / 0.055s)

OneFlow swin dataloader time: 0.031s (= 6.224s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.343s / 200, num_workers=8)
Relative speed: 0.537 (= 0.017s / 0.031s)

❌ OneFlow resnet50 time: 49.3ms (= 4934.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.0ms (= 6496.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 65.0ms / 49.3ms)

OneFlow resnet50 time: 36.6ms (= 3662.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 47.0ms (= 4695.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 47.0ms / 36.6ms)

OneFlow resnet50 time: 27.9ms (= 5573.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.9ms (= 7975.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.43 (= 39.9ms / 27.9ms)

OneFlow resnet50 time: 25.2ms (= 5044.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 38.5ms (= 7700.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.53 (= 38.5ms / 25.2ms)

OneFlow resnet50 time: 24.7ms (= 4948.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 38.5ms (= 7702.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.56 (= 38.5ms / 24.7ms)

Copy link
Contributor

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

@ShawnXuan ShawnXuan requested review from oneflow-ci-bot and removed request for oneflow-ci-bot January 23, 2025 05:25
Copy link
Contributor

Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce RTX 3080 Ti 

❌ OneFlow resnet50 time: 43.8ms (= 4376.2ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 57.5ms (= 5748.7ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.31 (= 57.5ms / 43.8ms)

OneFlow resnet50 time: 26.3ms (= 2627.3ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.7ms (= 3766.0ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.43 (= 37.7ms / 26.3ms)

OneFlow resnet50 time: 18.7ms (= 3739.6ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 32.2ms (= 6435.0ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.72 (= 32.2ms / 18.7ms)

OneFlow resnet50 time: 17.6ms (= 3526.7ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 33.0ms (= 6600.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.87 (= 33.0ms / 17.6ms)

OneFlow resnet50 time: 16.5ms (= 3305.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 30.1ms (= 6018.6ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.82 (= 30.1ms / 16.5ms)

OneFlow swin dataloader time: 0.200s (= 40.049s / 200, num_workers=1)
PyTorch swin dataloader time: 0.128s (= 25.624s / 200, num_workers=1)
Relative speed: 0.640 (= 0.128s / 0.200s)

OneFlow swin dataloader time: 0.056s (= 11.170s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.531s / 200, num_workers=4)
Relative speed: 0.585 (= 0.033s / 0.056s)

OneFlow swin dataloader time: 0.030s (= 6.080s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.313s / 200, num_workers=8)
Relative speed: 0.545 (= 0.017s / 0.030s)

❌ OneFlow resnet50 time: 49.3ms (= 4927.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.8ms (= 6577.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 65.8ms / 49.3ms)

OneFlow resnet50 time: 37.0ms (= 3704.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 46.4ms (= 4636.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.25 (= 46.4ms / 37.0ms)

OneFlow resnet50 time: 27.3ms (= 5456.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 40.1ms (= 8012.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.47 (= 40.1ms / 27.3ms)

OneFlow resnet50 time: 25.4ms (= 5073.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 38.7ms (= 7745.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.53 (= 38.7ms / 25.4ms)

OneFlow resnet50 time: 24.7ms (= 4948.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 36.1ms (= 7219.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.46 (= 36.1ms / 24.7ms)

@ShawnXuan ShawnXuan enabled auto-merge (squash) January 23, 2025 07:13
@ShawnXuan ShawnXuan added op and removed op labels Jan 23, 2025
@ShawnXuan ShawnXuan merged commit cb699cd into master Jan 23, 2025
20 of 21 checks passed
@ShawnXuan ShawnXuan deleted the update_fuse_layer_norm_grad_for_npu branch January 23, 2025 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants