profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/endernewton/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Xinlei Chen endernewton Facebook Menlo Park xinleic.xyz

endernewton/tf-faster-rcnn 3599

Tensorflow Faster RCNN for Object Detection

endernewton/iter-reason 262

Code for Iterative Reasoning Paper (CVPR 2018)

aayushbansal/MarrRevisited 93

Marr Revisited: 2D-3D Alignment via Surface Normal Prediction

endernewton/subdiscover 40

Subcategory Discovery and Segmentation for CVPR 2014 Paper

endernewton/c2board 32

Tensorboard for Caffe2

endernewton/neil-test 32

NEIL Test Code

lichengunc/mteqa 11

Multi-Target Embodied Question Answering

endernewton/webly-supervised 6

Code release for webly supervised paper (ICCV'15)

issue commentfacebookresearch/simsiam

Failing on Reproducing ImageNet Linear Classification Results with SGD

Oh I see, for the released model, we did not search over different SGD lr. In order to reproduce the 4096 batch size LARS, you can also do gradient accumulation -- which is actually not that hard given that linear-eval is light and BN free.

ferreirafabio

comment created time in 13 days

issue closedfacebookresearch/simsiam

Failing on Reproducing ImageNet Linear Classification Results with SGD

Hello,

I'm trying to reproduce the ImageNet results with SGD in a DDP setting with 8 GPUs and batch size 256, learning rate 30, weight decay 0, momentum 0.9, 100 epochs. The linear classification performance according to the paper is yields 1% lower accuracy which would be 68,1-1=~67,1%. However, with the bs 256 pre-training checkpoint provided I can only get to Acc@1 65.080 Acc@5 86.696. Any idea what I could do to match the described performance? Hints on how to set the hyperparameters? I am using the original code from this repo.

Here's the output of the last valid batch:

Epoch: [99][4980/5005]  Time  0.401 ( 0.217)    Data  0.362 ( 0.049)    Loss 1.6190e+00 (1.5670e+00)    Acc@1  56.25 ( 63.85)   Acc@5  87.50 ( 84.66)
Epoch: [99][4990/5005]  Time  0.060 ( 0.217)    Data  0.012 ( 0.049)    Loss 1.8884e+00 (1.5670e+00)    Acc@1  56.25 ( 63.85)   Acc@5  87.50 ( 84.66)
Epoch: [99][5000/5005]  Time  0.057 ( 0.217)    Data  0.017 ( 0.050)    Loss 1.1361e+00 (1.5666e+00)    Acc@1  68.75 ( 63.86)   Acc@5  90.62 ( 84.66)
Test: [  0/196] Time 12.220 (12.220)    Loss 8.8439e-01 (8.8439e-01)    Acc@1  77.34 ( 77.34)   Acc@5  94.92 ( 94.92)
Test: [ 10/196] Time  0.276 ( 2.376)    Loss 1.3778e+00 (1.0927e+00)    Acc@1  64.06 ( 72.16)   Acc@5  89.84 ( 91.73)
Test: [ 20/196] Time  0.302 ( 1.952)    Loss 1.1767e+00 (1.0888e+00)    Acc@1  78.12 ( 73.05)   Acc@5  87.50 ( 91.15)
Test: [ 30/196] Time  0.274 ( 1.867)    Loss 1.2236e+00 (1.0760e+00)    Acc@1  67.97 ( 73.44)   Acc@5  91.80 ( 91.33)
Test: [ 40/196] Time  0.280 ( 1.672)    Loss 1.2726e+00 (1.1910e+00)    Acc@1  67.58 ( 69.84)   Acc@5  92.97 ( 90.62)
Test: [ 50/196] Time  0.274 ( 1.728)    Loss 8.7860e-01 (1.1958e+00)    Acc@1  77.34 ( 69.55)   Acc@5  94.92 ( 90.83)
Test: [ 60/196] Time  0.275 ( 1.664)    Loss 1.4648e+00 (1.1892e+00)    Acc@1  64.84 ( 69.67)   Acc@5  88.28 ( 91.12)
Test: [ 70/196] Time  0.312 ( 1.662)    Loss 1.0541e+00 (1.1579e+00)    Acc@1  73.05 ( 70.44)   Acc@5  91.80 ( 91.43)
Test: [ 80/196] Time  0.274 ( 1.663)    Loss 1.9151e+00 (1.1739e+00)    Acc@1  51.17 ( 70.04)   Acc@5  80.47 ( 91.08)
Test: [ 90/196] Time  0.285 ( 1.715)    Loss 2.3642e+00 (1.2366e+00)    Acc@1  47.27 ( 68.99)   Acc@5  74.22 ( 90.19)
Test: [100/196] Time  0.279 ( 1.648)    Loss 2.1153e+00 (1.2950e+00)    Acc@1  51.17 ( 67.87)   Acc@5  75.78 ( 89.36)
Test: [110/196] Time  0.274 ( 1.647)    Loss 1.2629e+00 (1.3154e+00)    Acc@1  70.70 ( 67.53)   Acc@5  88.67 ( 89.03)
Test: [120/196] Time  0.286 ( 1.624)    Loss 1.7559e+00 (1.3332e+00)    Acc@1  64.06 ( 67.34)   Acc@5  81.64 ( 88.72)
Test: [130/196] Time  0.279 ( 1.634)    Loss 1.2278e+00 (1.3690e+00)    Acc@1  70.70 ( 66.59)   Acc@5  90.23 ( 88.21)
Test: [140/196] Time  0.282 ( 1.584)    Loss 1.6165e+00 (1.3958e+00)    Acc@1  63.67 ( 66.11)   Acc@5  86.72 ( 87.85)
Test: [150/196] Time  0.273 ( 1.574)    Loss 1.6683e+00 (1.4218e+00)    Acc@1  68.36 ( 65.76)   Acc@5  80.86 ( 87.37)
Test: [160/196] Time  0.275 ( 1.537)    Loss 1.2704e+00 (1.4396e+00)    Acc@1  70.70 ( 65.53)   Acc@5  88.28 ( 87.07)
Test: [170/196] Time  0.277 ( 1.514)    Loss 1.0841e+00 (1.4603e+00)    Acc@1  74.22 ( 65.04)   Acc@5  93.36 ( 86.76)
Test: [180/196] Time  0.275 ( 1.472)    Loss 1.5453e+00 (1.4728e+00)    Acc@1  63.28 ( 64.80)   Acc@5  88.28 ( 86.58)
Test: [190/196] Time  0.274 ( 1.446)    Loss 1.3819e+00 (1.4701e+00)    Acc@1  64.45 ( 64.85)   Acc@5  91.80 ( 86.61)
 * Acc@1 65.080 Acc@5 86.696

Thanks!

closed time in 13 days

ferreirafabio

issue commentfacebookresearch/simsiam

Failing on Reproducing ImageNet Linear Classification Results with SGD

Oh I see, for SGD, you may want to search over different lr in order to get results that match LARS. You can try to increase or lower the lr.

ferreirafabio

comment created time in 13 days

issue commentfacebookresearch/simsiam

Failing on Reproducing ImageNet Linear Classification Results with SGD

Did you refer to the provided log file and check if it matches?

ferreirafabio

comment created time in 13 days

push eventendernewton/tf-faster-rcnn

Xinlei Chen

commit sha b4da911705925ec00e8359b76a2d7260f6d1d314

Update README.md

view details

push time in 23 days

issue closedfacebookresearch/moco-v3

Transfer learning performance of MoCo v3 on more challenging downstream dense prediction tasks.

Thanks for your great work!

I believe a goal of un-/self-supervised learning is to learn transferrable feature representations. I notice that MoCo v3 conduct a study on some smaller image classification datasets such as CIFAR-10/-100, and the performance is quite impressive.

But it seems that the performance of modern neural nets on these image classification datasets is somewhat saturated. I believe the community is more interested in more challenging downstream dense prediction tasks such as object detection and scene parsing. The specific task decoder layers such as DETR (for object detection) and SETR (for semantic segmentation or scene parsing) can be almost used out of the box. I wonder is there a plan on studying the transfer learning performance of MoCo v3 on downstream dense prediction tasks in the future?

closed time in a month

Yuxin-CV

issue commentfacebookresearch/moco-v3

Transfer learning performance of MoCo v3 on more challenging downstream dense prediction tasks.

There are plans but are separate efforts from MoCo v3. We will have updates on this topic in the near future, so please stay tuned.

Yuxin-CV

comment created time in a month

push eventfacebookresearch/moco-v3

Xinlei Chen

commit sha 4881c3abb1dda14aca4f22e17b335b4bdb748dc4

update README to make the pointer to reference models.

view details

push time in a month

issue commentfacebookresearch/moco-v3

what is the parameters used for linear classification in resnet50 experiment?

It is listed in: https://github.com/facebookresearch/moco-v3/blob/main/CONFIG.md

yxchng

comment created time in a month

issue closedfacebookresearch/moco-v3

what is the parameters used for linear classification in resnet50 experiment?

specifically, what are the parameters for

python main_lincls.py \
  -a [architecture] --lr [learning rate] \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  --pretrained [your checkpoint path]/[your checkpoint file].pth.tar \
  [your imagenet-folder with train and val folders]

in resnet50 experiment?

closed time in a month

yxchng

issue closedfacebookresearch/moco-v3

Checkpoint release

Hi,

are you going to release the checkpoints?

closed time in a month

JACKHAHA363

issue commentfacebookresearch/moco-v3

Checkpoint release

We have released the pre-trained models at https://github.com/facebookresearch/moco-v3/blob/main/CONFIG.md. Let us know if you think some important ones are missing.

JACKHAHA363

comment created time in a month

issue commentfacebookresearch/moco-v3

Tensorflow version

The mmf team (https://github.com/facebookresearch/mmf), specifically @ronghanghu have a good implementation of MoCo v3 using PyTorch XLA, which could be a good starting point on TPUs. Maybe good to check about their plan for releasing it.

wildphoton

comment created time in a month

issue closedfacebookresearch/moco-v3

Tensorflow version

Thank you for open source the Pytorch implementation. I wonder if the original tensorflow implementation have been released for the purpose of training on TPUs.

closed time in a month

wildphoton

issue commentfacebookresearch/moco-v3

Tensorflow version

We have not released and there is no plan to release the code.

wildphoton

comment created time in a month

issue closedfacebookresearch/moco-v3

ViT-Small model number of heads in attention differs from the original paper

Per the DeiT paper and timm's implementation, ViT-S uses 6 heads in the attention blocks. It seems the ViT-S here uses 12 heads. Is there a reason the number of heads is doubled?

closed time in a month

YellowPig-zp

issue commentfacebookresearch/moco-v3

ViT-Small model number of heads in attention differs from the original paper

We empirically found 12-head ViT-S works better than 6-head, with the same computation. Thus the decision. DEiT folks have also confirmed a similar observation later.

YellowPig-zp

comment created time in a month

issue closedfacebookresearch/simsiam

Checkpoints for COCO and VOC

Is it possible to share trained checkpoints for the SimSiam models finetuned on COCO and VOC?

closed time in a month

alexlioralexli

issue commentfacebookresearch/simsiam

Checkpoints for COCO and VOC

I think you can directly do the fine-tuning on COCO/VOC using the released code? We do not have plans to release the trained Mask R-CNN models, but it should be easy to reproduce.

alexlioralexli

comment created time in a month

issue commentfacebookresearch/simsiam

Learning rate scheduler setting for different epochs

We conducted experiments setting the max epoch. So 2.

BaohaoLiao

comment created time in a month

issue closedfacebookresearch/simsiam

Learning rate scheduler setting for different epochs

Hi,

in the Table 4 of your paper, you compared the results of different epochs. May I ask you what's the difference of the learning rate scheduler settings for theses epochs?

I think there are two options we can use:

  1. You conduct only one experiment and set the max epoch to 800, then select 100, 200, 400, 800 ep.
  2. You conduct four experiments and set the max epoch to 100, 200, 400, and 800.

Want to make sure which one did you use, since their lr schedulers are different.

closed time in a month

BaohaoLiao

issue closedfacebookresearch/simsiam

Negative Loss

I am getting Negative loss value during training, is it normal? why did you use negative sign in the loss function?

closed time in a month

say2sarwar

issue commentfacebookresearch/simsiam

Negative Loss

It is normal, because it negative cosine similarity. The negative sign makes it a proper "loss". Please see details in the paper.

say2sarwar

comment created time in a month

issue commentfacebookresearch/simsiam

What's the point if we do not gather all outputs in different GPUs to compute contrastive loss

4096 batch size being worse is due to the difficulty of training large batch size, it is observed in other training (e.g. supervised ImageNet) as well.

BaohaoLiao

comment created time in 2 months

issue closedfacebookresearch/simsiam

What's the point if we do not gather all outputs in different GPUs to compute contrastive loss

Hi,

this is a really great work. However, I have a general question for the contrastive loss.

In your code, you use 8GPUs for a total batch size of 256. It means 32 samples in one GPU. You compute the contrastive loss of these 32 samples on the same GPU firstly, then gather the loss from different GPUs to compute the final gradient.

However, it makes little sense for me to use this way to increase the batch size. One challenge for the contrastive loss is to find hard negative. Normally we increase the batch size on one single GPU to handle this problem. Since larger batch size offer us more possibility to find hard negatives. But if we use DDP, this kind of larger total batch size is not useful.

For example, I use 16 GPUs for a total batch size of 512. This will result in the same number of samples (32) on one GPU as above. Would it better to gather all of the output embeddings from different GPUs to one GPU to compute the contrastive loss?

In Table 2 of your paper, how do your change the batch size? Increasing the samples on a single GPU and fix the number of GPUs, or increasing the number of GPUs and fix the number of samples on a single GPU? The result is a little weird for me, total batch size of 4096 is the worst.

closed time in 2 months

BaohaoLiao

issue commentfacebookresearch/simsiam

What's the point if we do not gather all outputs in different GPUs to compute contrastive loss

Is it posted on the wrong repo? For SimSiam, we use l2-loss (cosine similarity), we do not use contrastive loss.

BaohaoLiao

comment created time in 2 months

issue commentfacebookresearch/moco-v3

release pretrained weights

Some pretrained models can be found here: https://github.com/facebookresearch/moco-v3/blob/main/CONFIG.md. We will update it with more weights when we have more pretrained weights to share.

maliho0803

comment created time in 2 months

issue closedfacebookresearch/moco-v3

release pretrained weights

will you have plan to release your pretrained weights

closed time in 2 months

maliho0803

issue commentfacebookresearch/moco-v3

KNN curve code

Yes this is planned. We will verify it on our side before publishing it. It may take a while though. Thanks for the patience!

kashif

comment created time in 2 months

issue commentfacebookresearch/simsiam

Checkpoint with 200 epochs?

Okay I can train one on my side.

lzyhha

comment created time in 2 months