Some facts about the runtime speed.

Hi I tried to reimplement the similar operation as yours in Tensorflow and found two facts w.r.t the runtime speed performance:

1. The gpu kernel which adds the filter gradients from one batch to another has almost no influence on the speed performance, in fact, the original [MXNet implementation](https://github.com/msracver/Deformable-ConvNets/blob/f4e163719c8e63cfad7af1caaaab93d373750393/faster_rcnn/operator_cxx/deformable_convolution-inl.h#L221) also applies this idea.

2. Splitting back propagations for different inputs variables into different TF ops does help to accelerate the runtime speed, but there's only 30% boost observed, compared with wrapping them into one TF op. 

I think the straggler is most likely to be the im2col/col2im operation, which is implemented in pure cuda code with little optimizations (compared with CuDNN). And the Author of Deform Conv also admitted that the main downside of their implementation is that they did not apply any CuDNN for the optimization (sorry I cannot find the origin). 

Hopefully, these results can be helpful for those who are also interested in the Deform Conv implementation in Tensorflow, especially when the [Deform Conv V2](https://arxiv.org/abs/1811.11168) paper has been released recently. 

Any comments or further discussion are welcomed and Merry Christmas!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some facts about the runtime speed. #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Some facts about the runtime speed. #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions