Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13891

`zero_grad` before `step` cause gradient explosion?

$
0
0

I have this simplified code snippet, which loads an image and feed to a model of 1 CNN layer.

def main(cfg):model = Model().cuda()dataset = Dataset(cfg)optimizer = optim.AdamW(model.parameters(), lr=cfg.learning_rate)train_dataloader = DataLoader(    dataset,     batch_size=cfg.batch_size,     num_workers=cfg.num_workers,    shuffle=False,    pin_memory=True    )p = next(model.parameters())for epoch in range(cfg.max_epochs):    for idx, (target) in enumerate(train_dataloader, start=1):        to_pil_image(target.squeeze(0)).save('test.jpg')        print(p[0, 0, 0, 0])        target = target.to('cuda')        target = F.interpolate(target, (256, 256), mode='bilinear', align_corners=False)        output = model(target)        loss = F.mse_loss(target, output)        loss.backward()        # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0)        optimizer.zero_grad()        optimizer.step()

I noticed that when calling optimizer.step() then optimizer.zero_grad(), the code works properly (loss decreases and the model converge).

But when I call zero_grad() then step(), then p.grad will be 0 after zero_grad() (which is expected), but p[0, 0, 0, 0] will become nan after step().

Is this an expected behaviour? Since to my understanding, calling zero_grad() before step() should have the effect of not updating the weights at all.


Viewing all articles
Browse latest Browse all 13891

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>