Gotchas

Think about axis ordering and about how to save stuff for backward. (See here.)
Every C++ function defining the forward/backward passes of an autograd.Function should start by detaching every torch.Tensor that they are given. First of all, this is because there is no point keeping track of gradients when our implementations typically use a lot of in-place operations. Secondly, not doing this raises some errors as the autograd framework tries to point this fact out to us.

Provide feedback