support fuse layer norm grad for npu #10614

ShawnXuan · 2025-01-08T07:02:31Z

No description provided.

github-actions · 2025-01-08T07:03:51Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

crazy-JiangDongHua

我觉得没啥问题了，合并 master 改动也不大

…low-Inc/oneflow into update_fuse_layer_norm_grad_for_npu

github-actions · 2025-01-22T07:00:50Z

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10614/

github-actions · 2025-01-22T07:24:23Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2025-01-23T03:13:45Z

Speed stats:

GPU Name: NVIDIA GeForce RTX 3080 Ti 

❌ OneFlow resnet50 time: 43.2ms (= 4320.7ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 57.3ms (= 5729.9ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.33 (= 57.3ms / 43.2ms)

OneFlow resnet50 time: 26.5ms (= 2650.2ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.6ms (= 3764.1ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.42 (= 37.6ms / 26.5ms)

OneFlow resnet50 time: 18.4ms (= 3672.1ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 34.8ms (= 6961.7ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.90 (= 34.8ms / 18.4ms)

OneFlow resnet50 time: 18.1ms (= 3613.2ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 31.4ms (= 6288.8ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.74 (= 31.4ms / 18.1ms)

OneFlow resnet50 time: 17.3ms (= 3460.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 29.8ms (= 5954.9ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.72 (= 29.8ms / 17.3ms)

OneFlow swin dataloader time: 0.200s (= 40.072s / 200, num_workers=1)
PyTorch swin dataloader time: 0.128s (= 25.526s / 200, num_workers=1)
Relative speed: 0.637 (= 0.128s / 0.200s)

OneFlow swin dataloader time: 0.055s (= 10.965s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.585s / 200, num_workers=4)
Relative speed: 0.601 (= 0.033s / 0.055s)

OneFlow swin dataloader time: 0.031s (= 6.224s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.343s / 200, num_workers=8)
Relative speed: 0.537 (= 0.017s / 0.031s)

❌ OneFlow resnet50 time: 49.3ms (= 4934.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.0ms (= 6496.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 65.0ms / 49.3ms)

OneFlow resnet50 time: 36.6ms (= 3662.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 47.0ms (= 4695.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 47.0ms / 36.6ms)

OneFlow resnet50 time: 27.9ms (= 5573.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.9ms (= 7975.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.43 (= 39.9ms / 27.9ms)

OneFlow resnet50 time: 25.2ms (= 5044.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 38.5ms (= 7700.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.53 (= 38.5ms / 25.2ms)

OneFlow resnet50 time: 24.7ms (= 4948.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 38.5ms (= 7702.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.56 (= 38.5ms / 24.7ms)

github-actions · 2025-01-23T05:19:52Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2025-01-23T06:00:16Z

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10614/

github-actions · 2025-01-23T06:39:48Z

Speed stats:

GPU Name: NVIDIA GeForce RTX 3080 Ti 

❌ OneFlow resnet50 time: 43.8ms (= 4376.2ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 57.5ms (= 5748.7ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.31 (= 57.5ms / 43.8ms)

OneFlow resnet50 time: 26.3ms (= 2627.3ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.7ms (= 3766.0ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.43 (= 37.7ms / 26.3ms)

OneFlow resnet50 time: 18.7ms (= 3739.6ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 32.2ms (= 6435.0ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.72 (= 32.2ms / 18.7ms)

OneFlow resnet50 time: 17.6ms (= 3526.7ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 33.0ms (= 6600.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.87 (= 33.0ms / 17.6ms)

OneFlow resnet50 time: 16.5ms (= 3305.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 30.1ms (= 6018.6ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.82 (= 30.1ms / 16.5ms)

OneFlow swin dataloader time: 0.200s (= 40.049s / 200, num_workers=1)
PyTorch swin dataloader time: 0.128s (= 25.624s / 200, num_workers=1)
Relative speed: 0.640 (= 0.128s / 0.200s)

OneFlow swin dataloader time: 0.056s (= 11.170s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.531s / 200, num_workers=4)
Relative speed: 0.585 (= 0.033s / 0.056s)

OneFlow swin dataloader time: 0.030s (= 6.080s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.313s / 200, num_workers=8)
Relative speed: 0.545 (= 0.017s / 0.030s)

❌ OneFlow resnet50 time: 49.3ms (= 4927.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.8ms (= 6577.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 65.8ms / 49.3ms)

OneFlow resnet50 time: 37.0ms (= 3704.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 46.4ms (= 4636.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.25 (= 46.4ms / 37.0ms)

OneFlow resnet50 time: 27.3ms (= 5456.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 40.1ms (= 8012.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.47 (= 40.1ms / 27.3ms)

OneFlow resnet50 time: 25.4ms (= 5073.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 38.7ms (= 7745.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.53 (= 38.7ms / 25.4ms)

OneFlow resnet50 time: 24.7ms (= 4948.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 36.1ms (= 7219.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.46 (= 36.1ms / 24.7ms)

crazy-JiangDongHua and others added 3 commits December 31, 2024 11:42

add fuse_layer_norm_grad functor and op

46ffed0

auto format by CI

5737916

fix

8558e58

ShawnXuan requested a review from hjchen2 as a code owner January 8, 2025 07:02

auto format by CI

ce66dff

ShawnXuan changed the base branch from fuse_layer_norm_grad_for_npu to master January 22, 2025 01:54

ShawnXuan requested a review from jackalcooper as a code owner January 22, 2025 01:54

ShawnXuan changed the title ~~fix~~ support fuse layer norm grad for npu Jan 22, 2025

ShawnXuan requested review from crazy-JiangDongHua, mosout and oneflow-ci-bot January 22, 2025 02:08

ShawnXuan added enhancement op labels Jan 22, 2025

Merge branch 'master' into update_fuse_layer_norm_grad_for_npu

9a72438

crazy-JiangDongHua reviewed Jan 22, 2025

View reviewed changes

ShawnXuan added 2 commits January 22, 2025 06:23

fix

68f1531

Merge branch 'update_fuse_layer_norm_grad_for_npu' of github.com:Onef…

3490b48

…low-Inc/oneflow into update_fuse_layer_norm_grad_for_npu

ShawnXuan and others added 2 commits January 22, 2025 07:21

update error message

3ae03ef

auto format by CI

b3aa784

ShawnXuan requested review from oneflow-ci-bot and removed request for jackalcooper, hjchen2 and oneflow-ci-bot January 23, 2025 01:43

ShawnXuan added 2 commits January 23, 2025 02:57

update

802a153

update

76b8675

update

db0ff1e

auto format by CI

6b5dd6a

ShawnXuan requested review from oneflow-ci-bot and removed request for oneflow-ci-bot January 23, 2025 05:25

ShawnXuan enabled auto-merge (squash) January 23, 2025 07:13

ShawnXuan added op and removed op labels Jan 23, 2025

mosout approved these changes Jan 23, 2025

View reviewed changes

ShawnXuan merged commit cb699cd into master Jan 23, 2025
20 of 21 checks passed

ShawnXuan deleted the update_fuse_layer_norm_grad_for_npu branch January 23, 2025 07:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support fuse layer norm grad for npu #10614

support fuse layer norm grad for npu #10614

ShawnXuan commented Jan 8, 2025

github-actions bot commented Jan 8, 2025

crazy-JiangDongHua left a comment

github-actions bot commented Jan 22, 2025

github-actions bot commented Jan 22, 2025

github-actions bot commented Jan 23, 2025

github-actions bot commented Jan 23, 2025

github-actions bot commented Jan 23, 2025

github-actions bot commented Jan 23, 2025

support fuse layer norm grad for npu #10614

support fuse layer norm grad for npu #10614

Conversation

ShawnXuan commented Jan 8, 2025

github-actions bot commented Jan 8, 2025

crazy-JiangDongHua left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 22, 2025

github-actions bot commented Jan 22, 2025

github-actions bot commented Jan 23, 2025

github-actions bot commented Jan 23, 2025

github-actions bot commented Jan 23, 2025

github-actions bot commented Jan 23, 2025