It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today. Back then it was common to meddle with gradients in between the gradient propagation.
For example, Alex Graves's (great! with attention) 2013 paper "Sequence Generation with Recurrent Neural Networks" has this line:
One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the
LSTM layers (before the sigmoid and tanh functions are applied) to lie within
a predefined range.
with this footnote:
In fact this technique was used in all my previous papers on LSTM, and in my publicly available LSTM code, but I forgot to mention it anywhere—mea culpa.
That said, backpropagation seems important enough to me that I once did a specialized videocourse just about PyTorch (1.x) autograd.