Firstly, I would like to thank you for the brilliantly well prepared lab (Coffee Roasting in Tensorflow)
Secondly, I want to express the distinction I see between the algorithms we learnt in week 1 and the things we have learnt so far in week 2
In week 1 we had to come up with an appropriate form of the function (e.g. ax + bx^2 + c*y) by ourselves. Then we use gradient descent to map the adequate values for the parameters.
As a result our trained model would circle the data in a similar fashion:
The straight lines here represent the boundaries. It means that we need just three sigmoid functions. If a coffee is not good enough, then the value of at least one sigmoid function would be high and therefore rejected in the final conclusion. I hope here I gave you the intuition why choose exactly 3 neurons in the hidden layer.
However, I am wondering how each neuron’s sigmoid function determines its own cost function? It must be a brilliant technique.
To conclude,
While week 1 presented us a way to directly map function parameters, based on a cost function, week 2 gives us a totally new glimpse into how similar results can be achieved using an another method
With logistic regression, you have to allow for a more complex boundary by creating more complex features by hand. The learning method then figures out the best weights to use create an optimum fit.
With a NN, the non-linear function in the hidden layer allows it to create a complex hypothesis automatically through training. You do not have to create the new features yourself.
To answer your question: “However, I am wondering how each neuron’s sigmoid function determines its own cost function? t must be a brilliant technique.”
Yes, there is indeed a brilliant technique and it is called backpropagation.
Going back to the Basics: To update a weight parameter, what we need is not necessarily the Cost, rather it is the \frac {\partial J} {\partial w}.
There is only a single Cost value J for the overall network which we are trying to minimize. But we can still find the derivative of that Cost w.rt to every single parameter of the network - (\frac {\partial J} {\partial w_{i,j}} , \frac {\partial J} {\partial b}) for every layer of the network. In this manner, we do not just update the final layer (w,b) parameters but we update all the (w,b) parameters in all the layers of the network, all the way back to the first layer, such that the overall cost J is minimized - And, this adds to the magic of Neural Networks!
If you want to know more about this, you can take a look here
Wow, back propagation is a wild concept! So I suppose we again randomly set all the parameters in the network? And is there an algorithm that determines the right amount of neurons in a layer and the amount of layers that would be perfect for the case or it is something that we should determine on our own?
The random initialization of the parameters happens only once, to get the process started. And then we forward propagate the output at each layer to finally arrive at the cost function at the output layer - From there on backpropagation kicks in. This cyclic process of Forward Prop → Backward Prop → Forward Prop → Backward Prop… continues till the network eventually converges.
The number of units in a layer and number of layers is still something that we control and decide - it doesn’t automatically happen.