Better design pattern for data_weight synchronization

The ready event in neighbor_allreduce dst_weight makes sure the data_weight computation is done before communication, as Pytorch CUDA stream is not synchronized with our CUDA stream.