What exactly is activation?
In the videos it was mentioned that activation is the probability output from a neuron by considering certain input features. So what this probability output is trying to predict? In the context of the example given, is the probability trying to predict the t shirt being a top-seller or not?
What exactly is activation?
The activation will return the output of node. In the next few videos, you will be familiar with the type of activation function.
Actually I am confused if the output from a neuron predicts the probability of being a top-seller(considering certain features e.g. price and shipping cost) ?
Is that probability referred to as affordability ?
Or is it something different?
Activation functions are one of the most important parts of a neural network. Check this post for further (and to avoid redundant) explanation.
In your specific case, I am assuming you try to predict a binary output (to be a best seller, which would be 1, and not being a best seller, which would return 0). You can see that output as the probability of being a best seller (100% - 0%). For that purpose the best activation function would be a sigmoid, which constrains the output to the 0 - 1 range.
Given the case you had a multiclass classification problem, you would use a softmax function, which gives each of the outputs a probability of belonging to that class.
I think we are discussing the following slide. My answer would be the output of the neuron in the output layer predicts the probability of being a top-seller. Nothing more, and nothing less.
First, generally speaking, we can’t interpret the outcome of just any neuron. So we can’t just give a human understandable name to say, “OK, this neuron means exactly this or that”.
Second, there is a time when we can interpret the outcome of a neuron to carry a certain property, and that is when we explicitly constrain the neuron to carry that property.
For the example in the slide, when we train that neural network, (1) we use sigmoid as the activation for the neuron in the output layer, AND we use the log loss function as the loss function; and (2) our label is whether the product is a top seller. (1) & (2) together gives the outcome of the neuron of the output layer a meaning of probability, and therefore, we can intrepret that single particular neuron as the “probability of being a top seller”.
However, we never constrain those intermediate (hidden) layers. Meaning that, we don’t have label data for affordability, and we never constrain any neuron’s output to be consistent with the label data for affordability. Without such constraint, we can’t interpret those hidden layers’ neurons as “affordability” or anything else. They are just neurons which are very helpful in providing inputs for the neuron in the output layer to predict the probability of being a top seller. - which is also, I believe, what Professor Andrew Ng was trying to demonstrate, but for the sake of understanding and discussion, perhaps giving the neurons of the hidden layer some names can be very helpful, and that’s why you see affordability, awareness and so. So naming hidden layers’ neurons is just for demonstrating the concept about neural network and we can’t easily do this in practice without (1) the constraint that I described in above that happens during the training process or (2) an in-depth analysis of the outcome of the neurons after the training process.
So is the concept of hidden layers an abstract one?
I mean, is not there any way to accurately assign the number of hidden layers and the number of neurons to our model? and how does these neurons pick the correct features automatically ? how would they know that?
is there any Mathematics behind this?
Generally speaking, no there is no such a way.
Exception? Yes, if you go to C2 W1 Lab “Coffee Roasting in Tensorflow”, you find the following plots that try to visualize what each of the 3 trained neurons are doing.
On the other hand, you can still draw this plot before the training, and just to show how the sample labels are distributed in the input features space (Temperature and Duration), and then you will find that those red crosses are forming a triangle, and given such insight, I would choose 3 neurons accurately because we need only 3 linear boundaries to form a triangle.
However, in practice, we can’t always visualize our dataset like this, and that’s a reason why generally speaking, we can’t accurately choose the number of layers and neurons-per-layer.
So what do we do? That is a long story and not covered in this MLS, but in course 2 of the deep learning specialization when hyperparameter tuning is discussed. In short, the simplest basic is, this involves trying different options, evaluating the trials and look for the best option.
Intead of saying “pick”, I would say the reason why a trained NN knows how to transform the input features into the right abstract neuron outputs is because of gradient descent.
Our gradient descent algorithm optimizes our neurons by minimizing the log loss. So whatever final state you see in the neurons after the training, it is merely a product of minimized cost given the training dataset. You can say “if the neurons look like that, then the cost is minimized, but if it looks like something else, the cost becomes worse”.
Yes, it’s the maths the drives the change of the weights in a neuron, and you have learnt that in course 1 when we talk about gradient descent and gradients.
In summary, the maths drives the change of weights into the direction of minimizing the cost, and the maths doesn’t predict how the weights will finally look like, and it doesn’t predict how many layers and neurons are needed. No matter how many layers and neurons you give to the neural network, the maths works in the same way – trying to change the weights of all available neurons in order to minimize the cost. This is very machanical