Potential Solutions when encountering the nan gradient problem with pytorch

Firstly make sure the inputs do not contain or loss is not inf or NaN (e.g., via printing).
Make sure there’s no division-by-zero throughout the entire computational graph. Especially also check operations like x.sqrt() or x.pow(), make sure the values involved don’t cause mathmatical errors that can happen when they are too small, add an epsilon (e.g., 1e-8) if that’s the case.
Sometimes the problem can be caused by low precision rate: for example, if your tensors involved in the computation are torch.float16, try change to float32 by tensor.to(torch.float32), that can help in reducing numerical instability, potentially resolving the issue, though at the cost of increased computational resources.
Make sure the learning rate is not too large if it is involved in the problem. Also try gradient clipping to prevent gradients from becoming too large.
Sometimes calling torch.autograd.set_detect_anomaly(True) as a starting point may help.

Enjoy Reading This Article?