profile
viewpoint

Ask questionsPre-train time too long.

According to your result, pretrain 200 epochs(Resnet 50 baseline) need 53H in a 8 V100 machine. But the training speed in my 8V100 machine is three/four times slower than this. I don't know why. Maybe the environment configs is different. So, can you release your environment configs? Thanks!

This is pretrain log, 0.6s per batch, 3000s(about 1h) per epoch.

2020-07-16T09:20:06.867Z: [1,0]<stdout>:Epoch: [16][4000/5004]	Time  1.300 ( 0.685)	Data  0.000 ( 0.084)	Loss 1.0633e+00 (1.2471e+00)	Acc@1 100.00 ( 95.40)	Acc@5 100.00 ( 97.76)
2020-07-16T09:20:12.016Z: [1,0]<stdout>:Epoch: [16][4010/5004]	Time  0.309 ( 0.685)	Data  0.000 ( 0.084)	Loss 1.4829e+00 (1.2472e+00)	Acc@1  87.50 ( 95.40)	Acc@5  93.75 ( 97.76)
2020-07-16T09:20:18.283Z: [1,0]<stdout>:Epoch: [16][4020/5004]	Time  1.043 ( 0.685)	Data  0.000 ( 0.084)	Loss 1.1532e+00 (1.2472e+00)	Acc@1  96.88 ( 95.40)	Acc@5  96.88 ( 97.75)
2020-07-16T09:20:24.301Z: [1,0]<stdout>:Epoch: [16][4030/5004]	Time  0.271 ( 0.685)	Data  0.000 ( 0.084)	Loss 1.1201e+00 (1.2469e+00)	Acc@1  96.88 ( 95.40)	Acc@5 100.00 ( 97.75)
2020-07-16T09:20:30.259Z: [1,0]<stdout>:Epoch: [16][4040/5004]	Time  0.413 ( 0.684)	Data  0.000 ( 0.083)	Loss 1.4439e+00 (1.2468e+00)	Acc@1  90.62 ( 95.40)	Acc@5  93.75 ( 97.75)
2020-07-16T09:20:36.487Z: [1,0]<stdout>:Epoch: [16][4050/5004]	Time  0.213 ( 0.684)	Data  0.000 ( 0.083)	Loss 1.1293e+00 (1.2468e+00)	Acc@1  93.75 ( 95.40)	Acc@5 100.00 ( 97.76)
2020-07-16T09:20:42.951Z: [1,0]<stdout>:Epoch: [16][4060/5004]	Time  0.232 ( 0.684)	Data  0.000 ( 0.083)	Loss 1.1727e+00 (1.2470e+00)	Acc@1 100.00 ( 95.40)	Acc@5 100.00 ( 97.75)
2020-07-16T09:20:48.433Z: [1,0]<stdout>:Epoch: [16][4070/5004]	Time  0.260 ( 0.684)	Data  0.000 ( 0.083)	Loss 1.3516e+00 (1.2469e+00)	Acc@1  96.88 ( 95.40)	Acc@5  96.88 ( 97.75)
2020-07-16T09:20:54.556Z: [1,0]<stdout>:Epoch: [16][4080/5004]	Time  0.271 ( 0.684)	Data  0.000 ( 0.083)	Loss 1.0669e+00 (1.2469e+00)	Acc@1  96.88 ( 95.40)	Acc@5 100.00 ( 97.76)
2020-07-16T09:21:01.362Z: [1,0]<stdout>:Epoch: [16][4090/5004]	Time  0.914 ( 0.684)	Data  0.000 ( 0.082)	Loss 1.3178e+00 (1.2468e+00)	Acc@1  90.62 ( 95.40)	Acc@5  96.88 ( 97.75)
2020-07-16T09:21:07.425Z: [1,0]<stdout>:Epoch: [16][4100/5004]	Time  0.215 ( 0.683)	Data  0.000 ( 0.082)	Loss 9.2172e-01 (1.2467e+00)	Acc@1 100.00 ( 95.40)	Acc@5 100.00 ( 97.75)
2020-07-16T09:21:14.707Z: [1,0]<stdout>:Epoch: [16][4110/5004]	Time  0.359 ( 0.684)	Data  0.000 ( 0.082)	Loss 1.3362e+00 (1.2468e+00)	Acc@1  96.88 ( 95.40)	Acc@5  96.88 ( 97.75)
➜  2020-7-16 nvidia-smi
Thu Jul 16 09:41:17 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:05:00.0 Off |                    0 |
| N/A   54C    P0   181W / 250W |   4802MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:08:00.0 Off |                    0 |
| N/A   56C    P0   109W / 250W |   4810MiB / 32480MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:0D:00.0 Off |                    0 |
| N/A   42C    P0   176W / 250W |   4808MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:13:00.0 Off |                    0 |
| N/A   43C    P0   172W / 250W |   4810MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE...  Off  | 00000000:83:00.0 Off |                    0 |
| N/A   56C    P0   197W / 250W |   4804MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   58C    P0   168W / 250W |   4810MiB / 32480MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE...  Off  | 00000000:8E:00.0 Off |                    0 |
| N/A   43C    P0    64W / 250W |   4810MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-PCIE...  Off  | 00000000:91:00.0 Off |                    0 |
| N/A   42C    P0   157W / 250W |   4808MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+

It seems that this problem is caused by pytorch version. This is my running environment:

pytorch1.3.1-py36-cuda10.0-cudnn7.0
facebookresearch/moco

Answer questions KaimingHe

See also #33

I believe your issue is not particularly related to this repo. Please try running the PyTorch offcial ImageNet training code, which this repo is based, to check the data loading speed of your environment. See #5 for how a similar issue was addressed.

useful!

Related questions

No questions were found.
source:https://uonfu.com/
Github User Rank List