Ask questionsAre there results with other normalizations?
Hello, thanks for the awesome project and paper.
Are there some results with other normalizations (instance norm, layer norm...) instead of shuffle BN?
I found that shuffle BN takes 20 % of times in
def forward(self, im_q, im_k)
with both V100x4, V100x2 settings.
In addition, the shuffle time is 6x times longer than inference of key features in https://github.com/facebookresearch/moco/blob/master/moco/builder.py#L133-L135
I think if replacement of batchnorm with other normalizations does not hurts the results, we can make the model training more faster.
Answer questions KaimingHe
I am not sure how you profile your timing. Shuffling is not slow at all. It is much faster than other alternatives such as SyncBN, because SyncBN happens for all layers, but shuffling is just a one-time effort. Shuffling of the input can be further optimized into the dataloader (not in this code), so is virtually free; shuffling of the output feature is very small.
We have also tried GroupNorm, and it is ~2% worse than BN.
Related questionsNo questions were found.