Even though I set the visible gpu to 0,2,5,7, there is still a problem of insufficient allocated memory space, and the error gpu is gpu3.here are my error callback:
Traceback (most recent call last): File "main_moco_files_dataset_strong_aug.py", line 500, in <module> main() File "main_moco_files_dataset_strong_aug.py", line 187, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid)torch.multiprocessing.spawn.ProcessRaisedException:-- Process 3 terminated with the following error:Traceback (most recent call last): File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/disco/chenwei/lizijing/papercode/PatchSearch-main/main_moco_files_dataset_strong_aug.py", line 357, in main_worker train(train_loader, model, optimizer, scaler, summary_writer, epoch, args) File "/disco/chenwei/lizijing/papercode/PatchSearch-main/main_moco_files_dataset_strong_aug.py", line 406, in train loss = model(images[0], images[1], moco_m) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward output = self.module(*inputs[0], **kwargs[0]) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/disco/chenwei/lizijing/papercode/PatchSearch-main/moco/builder.py", line 144, in forward q2 = self.predictor(self.base_encoder(x2)) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/timm/models/vision_transformer.py", line 307, in forward x = self.forward_features(x) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/timm/models/vision_transformer.py", line 299, in forward_features x = self.blocks(x) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/timm/models/vision_transformer.py", line 177, in forward x = x + self.drop_path(self.attn(self.norm1(x))) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/disco/anaconda3/envs/patch_search/lib/python3.7/site-packages/timm/models/vision_transformer.py", line 153, in forward attn = attn.softmax(dim=-1)RuntimeError: CUDA out of memory. Tried to allocate 456.00 MiB (GPU 3; 44.53 GiB total capacity; 42.48 GiB already allocated; 244.44 MiB free; 42.76 GiB reserved in total by PyTorch)
here are my running config:
#!/usr/bin/env bashset -xset -eOUTPUT_DIR='/disco/chenwei/lizijing/code_output_dir'CODE_DIR='/disco/chenwei/lizijing/papercode/SSL-Backdoor-main/poison-generation/data'EXPERIMENT_ID='HTBA_trigger_10_targeted_n02106550'export CUDA_VISIBLE_DEVICES=0,2,5,7EXP_DIR=$OUTPUT_DIR/$EXPERIMENT_ID/mocoEVAL_DIR=$EXP_DIR/linearDEFENSE_DIR=$EXP_DIR/patch_search_iterative_search_test_images_size_1000_window_w_60_repeat_patch_1_prune_clusters_True_num_clusters_1000_per_iteration_samples_2_remove_0x25FILTERED_DIR=$DEFENSE_DIR/patch_search_poison_classifier_topk_20_ensemble_5_max_iterations_2000_seed_4789RATE='1.00'SEED=4789### STEP 1.1: pretrain the modelpython main_moco_files_dataset_strong_aug.py \ --seed $SEED \ -a vit_base --epochs 200 -b 1024 \ --stop-grad-conv1 --moco-m-cos \ --multiprocessing-distributed --world-size 1 --rank 0 \ --dist-url "tcp://localhost:$(( $RANDOM % 50 + 10000 ))" \ --save_folder $EXP_DIR \ $CODE_DIR/$EXPERIMENT_ID/train/loc_random_loc*_rate_${RATE}_targeted_True_*.txt
I guarantee that the CUDA VISIBLE DEVICES'CUDA VISIBLE DEVICES' environment variable has been set successfully because I added a print statement to the python program being executed and detected that the value of the corresponding environment variable is normal, which I now expect to be able to make. pytorch assigns the model correctly to the specified gpu