I have this simplified code snippet, which loads an image and feed to a model of 1 CNN layer.
def main(cfg):model = Model().cuda()dataset = Dataset(cfg)optimizer = optim.AdamW(model.parameters(), lr=cfg.learning_rate)train_dataloader = DataLoader( dataset, batch_size=cfg.batch_size, num_workers=cfg.num_workers, shuffle=False, pin_memory=True )p = next(model.parameters())for epoch in range(cfg.max_epochs): for idx, (target) in enumerate(train_dataloader, start=1): to_pil_image(target.squeeze(0)).save('test.jpg') print(p[0, 0, 0, 0]) target = target.to('cuda') target = F.interpolate(target, (256, 256), mode='bilinear', align_corners=False) output = model(target) loss = F.mse_loss(target, output) loss.backward() # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0) optimizer.zero_grad() optimizer.step()
I noticed that when calling optimizer.step()
then optimizer.zero_grad()
, the code works properly (loss decreases and the model converge).
But when I call zero_grad()
then step()
, then p.grad
will be 0 after zero_grad()
(which is expected), but p[0, 0, 0, 0]
will become nan after step()
.
Is this an expected behaviour? Since to my understanding, calling zero_grad()
before step()
should have the effect of not updating the weights at all.