profile
viewpoint

Ask questionsGot error '' Input tensor shape: torch.Size([1024]). Additional info: {'b': 4}. Expected 2 dimensions, got 1'' everything used to work fine now it keeps giving me this error, it only happens on taming.

deepspeed train_dalle.py --local_rank=0 --image_text_folder /home/valterjordan/DALLE-pytorch/datasets/train2017 --truncate_captions --deepspeed --distributed_backend deepspeed --fp16 --taming [2021-06-29 20:20:36,222] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2021-06-29 20:20:36,232] [INFO] [runner.py:358:main] cmd = /home/valterjordan/miniconda3/envs/DALLE-pytorch/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 train_dalle.py --local_rank=0 --image_text_folder /home/valterjordan/DALLE-pytorch/datasets/train2017 --truncate_captions --deepspeed --distributed_backend deepspeed --fp16 --taming

[2021-06-29 20:20:36,528] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0]} [2021-06-29 20:20:36,529] [INFO] [launch.py:89:main] nnodes=1, num_local_procs=1, node_rank=0 [2021-06-29 20:20:36,529] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2021-06-29 20:20:36,529] [INFO] [launch.py:102:main] dist_world_size=1 [2021-06-29 20:20:36,529] [INFO] [launch.py:105:main] Setting CUDA_VISIBLE_DEVICES=0 /home/valterjordan/miniconda3/envs/DALLE-pytorch/lib/python3.7/site-packages/pytorch_lightning/metrics/init.py:44: LightningDeprecationWarning: pytorch_lightning.metrics.* module has been renamed to torchmetrics.* and split off to its own package (https://github.com/PyTorchLightning/metrics) since v1.3 and will be removed in v1.5 "pytorch_lightning.metrics.* module has been renamed to torchmetrics.* and split off to its own package" Using DeepSpeed for distributed execution [2021-06-29 20:20:37,314] [INFO] [distributed.py:47:init_distributed] Initializing torch distributed with backend: nccl using pretrained VAE for encoding images to tokens Working with z of shape (1, 256, 16, 16) = 65536 dimensions. loaded pretrained LPIPS loss from taming/modules/autoencoder/lpips/vgg.pth VQLPIPSWithDiscriminator running with hinge loss. Loaded VQGAN from /home/valterjordan/.cache/dalle/vqgan.1024.model.ckpt and /home/valterjordan/.cache/dalle/vqgan.1024.config.yml 118287 image-text pairs found for training wandb: W&B syncing is set to offline in this directory. Run wandb online or set WANDB_MODE=online to enable cloud syncing. [2021-06-29 20:20:43,289] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13+3352086, git-hash=3352086, git-branch=sparse_triton_support [2021-06-29 20:20:44,498] [INFO] [engine.py:80:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1 [2021-06-29 20:20:44,513] [INFO] [engine.py:598:_configure_optimizer] Removing param_group that has no 'params'in the client Optimizer [2021-06-29 20:20:44,513] [INFO] [engine.py:602:_configure_optimizer] Using client Optimizer as basic optimizer [2021-06-29 20:20:44,513] [INFO] [engine.py:612:_configure_optimizer] DeepSpeed Basic Optimizer = Adam [2021-06-29 20:20:44,513] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 unfused optimizer with dynamic loss scale [2021-06-29 20:20:44,513] [INFO] [unfused_optimizer.py:36:init] Fused Lamb Legacy : False [2021-06-29 20:20:44,522] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = Adam [2021-06-29 20:20:44,522] [INFO] [engine.py:449:_configure_lr_scheduler] DeepSpeed using client LR scheduler [2021-06-29 20:20:44,522] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2021-06-29 20:20:44,522] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0003], mom=[(0.9, 0.999)] [2021-06-29 20:20:44,522] [INFO] [config.py:737:print] DeepSpeedEngine configuration: [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] activation_checkpointing_config { "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "partition_activations": false, "profile": false, "synchronize_checkpoint_boundary": false } [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] allreduce_always_fp32 ........ False [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] amp_enabled .................. False [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] amp_params ................... {'opt_level': 'O1'} [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] checkpoint_tag_validation_enabled True [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] checkpoint_tag_validation_fail False [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] disable_allgather ............ False [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] dump_state ................... False [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] dynamic_loss_scale_args ...... None [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] elasticity_enabled ........... False [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] flops_profiler_config ........ { "detailed": true, "enabled": false, "module_depth": -1, "profile_step": 200, "top_modules": 1 } [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] fp16_enabled ................. True [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] global_rank .................. 0 [2021-06-29 20:20:44,522] [INFO] [config.py:741:print] gradient_accumulation_steps .. 1 [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] gradient_clipping ............ 0.5 [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] gradient_predivide_factor .... 1.0 [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] initial_dynamic_scale ........ 4294967296 [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] loss_scale ................... 0 [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] memory_breakdown ............. False [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] optimizer_legacy_fusion ...... False [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] optimizer_name ............... None [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] optimizer_params ............. None [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] pld_enabled .................. False [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] pld_params ................... False [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] prescale_gradients ........... False [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] scheduler_name ............... None [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] scheduler_params ............. None [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] sparse_attention ............. None [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] sparse_gradients_enabled ..... False [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] steps_per_print .............. 10 [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] tensorboard_enabled .......... False [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] tensorboard_job_name ......... DeepSpeedJobName [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] tensorboard_output_path ...... [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] train_batch_size ............. 4 [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] train_micro_batch_size_per_gpu 4 [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] wall_clock_breakdown ......... False [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] world_size ................... 1 [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] zero_allow_untested_optimizer False [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] zero_config .................. { "allgather_bucket_size": 500000000, "allgather_partitions": true, "contiguous_gradients": false, "cpu_offload": false, "cpu_offload_params": false, "cpu_offload_use_pin_memory": "cpu_offload_use_pin_memory", "elastic_checkpoint": true, "gather_fp16_weights_on_model_save": false, "load_from_fp32_weights": true, "max_live_parameters": 1000000000, "max_reuse_distance": 1000000000, "overlap_comm": false, "param_persistence_threshold": 100000, "prefetch_bucket_size": 50000000, "reduce_bucket_size": 500000000, "reduce_scatter": true, "stage": 0, "sub_group_size": 1000000000000 } [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] zero_enabled ................. False [2021-06-29 20:20:44,523] [INFO] [config.py:741:print] zero_optimization_stage ...... 0 [2021-06-29 20:20:44,524] [INFO] [config.py:747:print] json = { "amp":{ "enabled":false, "opt_level":"O1" }, "flops_profiler":{ "detailed":true, "enabled":false, "module_depth":-1, "output_file":null, "profile_step":200, "top_modules":1 }, "fp16":{ "enabled":true }, "gradient_accumulation_steps":1, "gradient_clipping":0.5, "train_batch_size":4 } Using /home/valterjordan/.cache/torch_extensions as PyTorch extensions root... Emitting ninja build file /home/valterjordan/.cache/torch_extensions/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.1738746166229248 seconds [2021-06-29 20:20:45,005] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: dalle-ds-cp/global_step0/mp_rank_00_model_states.pt Traceback (most recent call last): File "/home/valterjordan/miniconda3/envs/DALLE-pytorch/lib/python3.7/site-packages/einops/einops.py", line 368, in reduce return recipe.apply(tensor) File "/home/valterjordan/miniconda3/envs/DALLE-pytorch/lib/python3.7/site-packages/einops/einops.py", line 205, in apply backend.shape(tensor)) File "/home/valterjordan/miniconda3/envs/DALLE-pytorch/lib/python3.7/site-packages/einops/einops.py", line 150, in reconstruct_from_shape raise EinopsError('Expected {} dimensions, got {}'.format(len(self.input_composite_axes), len(shape))) einops.EinopsError: Expected 2 dimensions, got 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train_dalle.py", line 553, in <module> loss = distr_dalle(text, images, return_loss=True) File "/home/valterjordan/miniconda3/envs/DALLE-pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/valterjordan/miniconda3/envs/DALLE-pytorch/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 914, in forward loss = self.module(*inputs, **kwargs) File "/home/valterjordan/miniconda3/envs/DALLE-pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/valterjordan/DALLE-pytorch/dalle_pytorch/dalle_pytorch.py", line 485, in forward image = self.vae.get_codebook_indices(image) File "/home/valterjordan/miniconda3/envs/DALLE-pytorch/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, **kwargs) File "/home/valterjordan/DALLE-pytorch/dalle_pytorch/vae.py", line 199, in get_codebook_indices return rearrange(indices, '(b n) () -> b n', b = b) File "/home/valterjordan/miniconda3/envs/DALLE-pytorch/lib/python3.7/site-packages/einops/einops.py", line 424, in rearrange return reduce(tensor, pattern, reduction='rearrange', **axes_lengths) File "/home/valterjordan/miniconda3/envs/DALLE-pytorch/lib/python3.7/site-packages/einops/einops.py", line 376, in reduce raise EinopsError(message + '\n {}'.format(e)) einops.EinopsError: Error while processing rearrange-reduction pattern "(b n) () -> b n". Input tensor shape: torch.Size([1024]). Additional info: {'b': 4}. Expected 2 dimensions, got 1

lucidrains/DALLE-pytorch

Answer questions johngore123

Did you fix the issue?

Yes it should be fixed in the latest version.

useful!

Related questions

No questions were found.
source:https://uonfu.com/
Github User Rank List