profile
viewpoint
Aleksander Holynski holynski University of Washington Seattle holynski.org Ph.D student, University of Washington

pull request commentfacebookresearch/synsin

Memory leak in custom batch-norm

Sure, happy to help!

holynski

comment created time in 2 months

PR opened facebookresearch/synsin

Memory leak in custom batch-norm

Using the ResNetEncoder and ResNetDecoder models results in a pretty hefty (~7MB/s) memory leak on CPU memory, which ultimately results in training crashing after a couple days.

I traced it down to a few lines in the custom batch-norm. The variables self.stored_mean, self.stored_var seem to only be used during inference, but are not protected from computing gradients, and thus are extending the computation graph on each call to bn.forward().

An old issue on the main PyTorch repo seems to validate these findings: https://github.com/pytorch/pytorch/issues/20275

I've tested with this change, and the memory usage remains constant.

+14 -13

0 comment

1 changed file

pr created time in 2 months

push eventholynski/synsin

holynski

commit sha da12397a012078482de8dd2196e5fb29c7eb8c8c

Disabling graph computation for running staitstics in custom batch norm

view details

push time in 2 months

issue closedfacebookresearch/synsin

ResNetDecoder default params

Hey -- thanks for releasing the code, it's been very useful.

I've been reading through to try to make I understand everything, and there's one part that has me a bit confused:

In the top-level training script, train.py, an instance of the main training model ZbufferModelPts is instantiated, which in turn instantiates all the sub-models, which include the depth estimator, the feature encoder, and the feature decoder.

In particular, the feature decoder is instantiated as: https://github.com/facebookresearch/synsin/blob/82ff948f91a779188c467922c8f5144018b40ac8/models/z_buffermodel.py#L35

which, if we've chosen to use the ResNet blocks (as is default in the parameters shown in train.sh), will return: https://github.com/facebookresearch/synsin/blob/82ff948f91a779188c467922c8f5144018b40ac8/models/networks/utilities.py#L31

Shouldn't the channels_in here be 64, since it's taking as input the re-projected features? Or am I missing something?

closed time in 2 months

holynski

issue commentfacebookresearch/synsin

ResNetDecoder default params

Good catch, thanks!

It's probably worth making a pull request at some point to change all the default parameters to those used in the paper...

holynski

comment created time in 2 months

issue commentfacebookresearch/synsin

ResNetDecoder default params

Thanks!

Yes, you're right that the default parameter for get_decoder() is UNetDecoder64, but in train.sh the parameters specifically override this default:

https://github.com/facebookresearch/synsin/blob/82ff948f91a779188c467922c8f5144018b40ac8/train.sh#L11

Also, I've been following that codepath, because from my understanding of the paper, the ResNet architecture is the one used in the proposed method. From the paper:

Our spatial feature network and refinement networks are composed of ResNet blocks. (Appendix B)

It also mentions:

We experimented with using a UNet architecture instead of a sequence of ResNet blocks for the spatial feature network and refinement network. This led to much worse results and was more challenging to train. (Appendix E)

Before I dug into the code, I assumed that the default parameters (when directly calling train.py, or importing a ZBufferModel in another script) would have reproduced the architecture and parameters described in the paper, but it seems that this might not be the case.

holynski

comment created time in 3 months

issue openedfacebookresearch/synsin

ResNetDecoder default params

Hey -- thanks for releasing the code, it's been very useful.

I've been reading through to try to make I understand everything, and there's one part that has me a bit confused:

In the top-level training script, train.py, an instance of the main training model ZbufferModelPts is instantiated, which in turn instantiates all the sub-models, which include the depth estimator, the feature encoder, and the feature decoder.

In particular, the feature decoder is instantiated as: https://github.com/facebookresearch/synsin/blob/82ff948f91a779188c467922c8f5144018b40ac8/models/z_buffermodel.py#L35

which, if we've chosen to use the ResNet blocks (as is default in the parameters shown in train.sh), will return: https://github.com/facebookresearch/synsin/blob/82ff948f91a779188c467922c8f5144018b40ac8/models/networks/utilities.py#L31

Shouldn't the channels_in here be 64, since it's taking as input the re-projected features? Or am I missing something?

created time in 3 months

more