So I was not able to develop an intuition to understand the explanation mentioned in this Lab. I am at this point not able to understand neural networks fully for that matter. So for this example, layer 1 has 2 units the first unit separates class 0 and 1 but how does it go about doing that?
If we just talk about the 1st unit and z = wx + b and relu with input data as given in the example, does that neuron try to find values of w and b to fit a decision boundary? or does it use find the values of w and b to minimize cost? Not able to visualize this is making difficult to understand.
And how is unit 0 able to only target class 0 and 1? and not Class 1 and 2 or in other words why unit 1 targets only those 2 classes and and not the others?
NN’s are trained so that the output layer minimizes the cost. Or said another way, only the output layer is aware of needing to make any decisions between true and false.
The hidden layer learns whatever is needed to help minimize the cost.
Mathematically, it works fine. Intuitively, it isn’t intuitive at all.
In that specific example what does it do mathematically with z = wx + b and relu function? What is the cost function there? Not sure if we discussed the cost function for relu in the lectures.
Just trying to understand And how is unit 0 able to only target class 0 and 1? and not Class 1 and 2 or in other words why unit 1 targets only those 2 classes and and not the others?
The cost function is only computed at the output layer.
The gradients of all of the weight and bias values (for all the hidden and output layers) uses calculus to compute the partial derivatives, starting from the output layer and “backpropagating” the errors into the hidden layers. We start from the output because that’s the only layer where we have labels for comparison.
The gradients are computed automatically by the TensorFLow layer definitions.
It isn’t. Why in particular do you believe this is how it works?
For each label, there is a corresponding unit in the output layer. That’s how a NN for classification is defined, and the data set is organized to support that. The gradients are computed for each network path from every output back into the hidden layers.
It is described in the explanation : “Unit 0 has separated classes 0 and 1 from classes 2 and 3.” So my question is why does Unit 0 separates classes 0 and 1 from classes 2 and 3 and Unit 1 separates classes 0 and 2 from classes 1 and 3 and not the other way around. I am trying to understand how that line is plotted in the graph above by Unit 0 and then by Unit 1.
At this point in the course, gradients in NN are not yet covered so my knowledge is just based on what has been taught so far in the course and basing my assumptions on that.
The form \vec{w} \cdot \vec{x} + b dictated that it is going to be a linear partition with respect to the dimensions in \vec{x}
Every neuron is \vec{w} \cdot \vec{x} + b and so each of them represents one linear partition with respect to the dimensions in their \vec{x}. Note that \vec{x} means the input to that neuron, and while the input is the same for all neurons in the same layer, the inputs are different for neurons among different layers.
Since the partitions are linear, the best they can do is to separate one or two classes from the rest, the worst might be that they don’t separate any classes (e.g. on the left hand side of the partition there is no sample, and all samples are on the right hand side)
The partitions are initially randomized, so they were probably not in the best nor the worst cases
As gradient descent (which I will not go into detail since you have not learnt it) goes, the weights w, b will incrementally get updated. Equivalently, the partitions will be slowly moved. When the weights are finally optimized, it means the partitions are finally moved to places that are able to separate some classes from the rest
The neurons are not assigned to separate specific class at the beginning. We do not know it before we start the model training process. How we initialize the weights (the partitions) and the gradient descent process that updates the weights (moves the partitions) will decide which neuron will separate which classes from the rest.
The resulting partitions you have shared here are the results after the training process.
Thanks. Sorry, I thought you were referring to the units in the hidden layer.
At the output, each unit is trained to recognize one label, via the “one vs all” method. In one-vs-all, each output label is converted into a one-hot representation, where the output for that label is “true” and all other labels are set to “false”.
Here by linear partition do you mean the decision boundary?
So when you talk abt separating classes, is it about creating that decision boundary? In one of the lecture @ 5:57 Prof Ang talked about decision boundaries with function wx + b = 0 when z = 0 (x1+x2=3) and then he went on to assume a value of w and b and then draws a decision boundary. Is that how the neuron tries to separate classes? It assumes certain w and b tries to draw a decision boundary and then iterates on to that with backpropogation?
Human initializes the weights to some random values.
Human trains the weights using gradient descent algorithms that involves backprop.
Human tries to interpret what a neuron has done by visualizing the boundary, and so you see a graph like below
Machine does not assume w or b, but randomly iniitalized as instructed by human
Machine is instructed through the gradient descent algorithm to keep updating the w and b throughout the training process
Machine has no visual about the boundary.