Attention role, mechanics and backprop


When I do backprop, bigger constants (which correlates with bigger activations) along the functions of the chain rule (is functions are linear) results in bigger adjustments. I am saying that simply because the calculated derivatives gonna be bigger.

When I apply attention, I am asking to the attention calculator (the small additional neural networks, which serves from input s and a) to make faster adjustements for the attention focus via backprop (QUESTION A), or the only utility of attention idea is to make a more efficient forward? (QUESTION B) (Vía silence the not-userfull inputs as if they where noise, or the mechanics is more complex? (QUESTION C).

I underestand attention focus the operation of te NN in certain important inputs, but i am asking about some details (I hope to have explained myself well enough, sorry no English native as you can infer!)

Thanks. Felipe.

I think these are different topics.

Backpropagation is a method of computing the gradients, so that the solution with the minimum cost can be found. It applies (behind the scenes) to all supervised learning methods.

Attention is a technique for characterizing a sequence of examples, so that the relationship between them can be learned.

1 Like