A glimpse into PyTorch Autograd internals
Trying to understand some errors we get, while using pyTorch
Intro
Here, we are going to discuss the internals of PyTorch Autograd
module. The most of us don’t have to know about this. I was the same till I came across this error:
|
|
This came from executing the following code:
|
|
But why? We defined that the gradient of a
should be calculated by putting requires_grad
to be True
!
After some investigation, the error was due to that a.grad
is None
. But why is that the case? When we multiplied a
by 0.1
we internally created a new tensor that is intermediate
. And by default, PyTorch only populate gradients for leaf
tensors only.
Notes
These are some notes to help understand the internals of PyTorch, and its Autograd module, and why did we get this error.
To know more about this, in a more structured way,read this blog post here
requires_grad
tells PyTorch, if it should save the forward results, to use it to calculate the gradients.All tensors with
requires_grad
isFalse
are leafs tensors.Tensors with
requires_grad
isTrue
are leafs only if they are created by the user, and not as a result by a mathematical expression for example.When a tensor is created as a result of an operation on a few tensors, then it would be
leaf
if and only if all the tensor used to generate it hasrequires_grad
set toFalse
, as only then, therequires_grad
for this tensor will be set toFalse
, and as we mentioned above, it will be a leaf.For a tensor to have
tensor.grad
populated, it must haverequires_grad
to beTrue
, and it must be aleaf
.You can’t change the
requires_grad
fornon-leaf
tensors. They are set automatically according the tensors used in the operation creating themPyTorch doesn’t allow updates(in-place) on leaf tensors if the
requires_grad
is set toTrue
, as it causes troubles in the backward pass.What does it mean, that Autograd will not populate the gradient for a tensor?
- Here’s what the backward graph does while calculating and propagating the gradient:
- If this tensor has
requires_grad
set toFalse
, then this tensor is not part of the backward graph, and it will do nothing. - If the tensor has
requires_grad
set toTrue
, andis_leaf
set toFalse
, this means that this is anintermediate
tensor that came off an operation. In this case, Autograd doesn’t populate thegrad
attribute for this tensor, and just propagate the gradient to the operation that generated this tensor (tograd_fn
.) - Lastly, if the tensor is a
leaf
, and hasrequires_grad
set toTrue
, then Autograd calculates the accumulated grad value, and puts it ingrad
field for the tensor.
- If this tensor has
- There’s an amazing video explaining this here
- Here’s what the backward graph does while calculating and propagating the gradient:
Now for the Solutions:
There are two solutions for this:
- call
a.retain_grad()
, which tells PyTorch to keep the grad for this tensor anyway.- Note: this solution is not fast.
- remove
requires_grad
from the initialization of the tensor, and then do it explicitly in a separate step. - wrap the weight update under
torch.no_grad
, and I will explain why we need this step below.
|
|
Normal training
Here’s another script for a training a network with just one neuron. Our normal training system, would be something like this:
|
|
This will not work!!
As we said, the weight update line, will produce an intermediate
tensor, which the grad
will not be populated for.
That’s why we all wrap our weight updating steps under torch.no_grad()
context, and do in-place calculation,which tells PyTorch not to keep track of these operations in the grad calculations, and thus the weights tensor will remain a leaf
tensor.
|
|
- so we use
torch.no_grad()
for two main reasons:- we don’t need to include the weight update operation in the gradient graph.
- although, we zero our gradient before every backward pass, but this weight update step will add a new branch in the gradient calculations for the weights tensors,and this will mess with the calculations.
- we need to be able to do
in-place
operation, and so, our tensor will remain aleaf
, and Autograd will keep populating its gradient.
- we don’t need to include the weight update operation in the gradient graph.
Conclusion
At last, we usually, use pre-built optimizers for the training step, or we even use a trainer
, which make the things way easier.
This was just a try to go back to the basics and trying to understand the roots of the problem.