Confusion regarding activation and feature presence correlation

Say I have a classifier model with ReLU. The trend throughout layers is that a higher activation value corresponds to a stronger presence of the feature the neuron is trained to look for. Why?

  1. is it possible for the model to associate lower activation to a stronger presence? Since the parameters can self-tune?
  2. What if we are using Sigmoid activation function throughout the model? Would it be possible for model to associate negative activation to a stronger/weaker presence?

Took me a while to came up with this explanation:

Why is the trend that higher output activation = stronger feature presence valid:

  • When the model is initialized, weights are set to small random values. At this point, there is no trend; everything is random. In the first forward pass, all neurons are more or less equally likely to activate because their weights are randomly initialized. However, due to the stochastic nature of initialization and the underlying data distribution, some neurons might coincidentally produce outputs that are more aligned with the target labels. During backpropagation, weights are updated based on how well the model’s predictions match the actual labels. Each weight w is updated according to the gradient descent rule w := w - α * dJ/dw​. dJ/dw is computed as a chain of derivatives, starting from the final output layer back to the layer where w resides. The computed gradient tells us how much the cost function will change if w is increased by a small amount. A positive gradient means increasing w will increase the cost, so the update rule will decrease w. A negative gradient means increasing w will decrease the cost, so the update rule will increase w. If a neuron’s output strongly contributes to correct predictions, the gradients for its incoming weights will generally be negative (because increasing these weights will help to decrease the cost function more). So these weights will be increased individually in subsequent updates. The updated weights are used in the next forward pass. If the weight update improved the model’s performance (i.e., lowered the cost function), the same neurons are more likely to activate again, and their associated weights are more likely to grow larger in future backpropagations, reinforcing this iterative cycle. As you go through more iterations, the trend starts to emerge. Neurons that consistently help in making correct predictions will see their weights grow larger. Conversely, neurons that make poor contributions (when a neuron’s output doesn’t help or even hinders the model in minimizing the loss function) will see little to no change (especially in the case of ReLU, they might be “dead” and always output zero ⇒ zero gradient so no update). In the beginning, all neurons are treated equally. Then as different costs are computed, each neuron and their weights are updated differently, via shrinking or enlarging. There could be mixed contributions (in a scenario where some weights contribute negatively but the overall prediction is near correct, it would depend on how the cost function and gradients are computed. Some weights might indeed get larger if they help to offset other, more dominant contributions that are pushing the prediction in the wrong direction), but overall each neuron contributes to many other next level neurons so its impact is distributed, so as the accumulated gradient, plus there is some randomness involved. Enlarging weights have larger and larger dZ/dA (there are many techniques to combat overgrowth) so it keeps making more impact and receives better updates (if yhat keeps getting near right) and shrinking weights have smaller and smaller dZ/dA (if yhat keeps getting far off) so eventually it makes no impact to the prediction and the gradient stops updating effectively. There can be some back and forths. (Rule of thumb: If a certain feature is important for correctly predicting the output label, then during backpropagation, the weights corresponding to that feature will grow larger. Conversely, if the feature isn’t important, the weights will either shrink or stay small.) Now, because the weights that deal with important input activations grow larger and non important ones stay small. If the feature is detected then the output activation would get pretty large (these large weights x big numbers) based on how confident the input activations are, else (large weights x small numbers) or (small weights x small numbers) or (… x 0) = small activation or 0. And this explains the trend throughout layers that a higher activation value corresponds to a stronger presence of the feature the neuron is trained to look for.