Hi, I was just wondering about the explanation given in this lab. Specifically, when you plot the final multiclass decision boundaries, it looks like this:
This looks like a ‘star’ shape with rays coming out at different angles from a central point. It’s not two straight lines because the lines ‘bend’ at the intersection point.
However, in the explanation, each unit is shown to divide the points linearly:
What I’m confused about is that, the two first-layer units don’t seem to partition the groups that well, e.g. there are some points lying on or very close to the boundary. But somehow the decision boundaries in the output layer partition very well. And secondly, the output decision boundary is not the superposition of the two hidden layer linear decision boundaries. But maybe that’s because of the softmax.
My best guess for what’s going on is that the hidden layer doesn’t actually need to partition classes, but the two neurons need to learn ‘good enough’ partitionings of the input data (which are different from each other), such that the output layer can then learn a linear combination of the two independent hidden-layer partition values that works well for the overall classification. Is that right?
In the explanation graphs (below), the ‘sample’ layer decision boundaries are depicted as (more) perfectly partitioning the input data, but if I have understood correctly, it doesn’t really matter how good this partitioning is. Or is a good partitioning necessary for the output layer to work?
Sorry if this question is confusing, it just feels like something is inconsistent in my understanding of the graphs and I’m trying to figure out what it is
The lines are not straight and continuous because they’re segmented, to reduce the number of points that need to be computed that are on the decision boundary.
They look weird at the middle intersection because of quantization in how they’re drawn.
I do not clearly understand the remainder of your question.
I’m not talking about the squiggliness/jagged shape of the lines in the charts, I’m talking about whether they are theoretically meant to be straight or not.
For some more concrete questions:
Clearly the hidden layer neurons have linear decision boundaries. Does the output layer have piecewise linear decision boundaries? They look ‘almost linear’ in the first chart.
Is it true that the output layer can learn to classify accurately even if the hidden layers themselves don’t perfectly partition the input data (e.g. as either {class0, class1} or {class2, class3})? This is suggested by the fact that the later charts of hidden layer neurons have some points from the same class lying on both sides of the decision boundary in certain cases.
Yes - you pointed the cause out - that the first chart is linear because it plots the neuron outputs in the first layer, but the second chart is not linear because it plots the neuron outputs in the output layer. Different layers, different behaviors. The first layer is linear because there is no additional non-linear activation function between it and the input features. The second layer has additional non-linear activation.
There is no law that dictates quantitatively how good earlier layer had to learn for the output layer to learn well. However, apparently, the better the earlier layers are able to distinguish, the easier it is for the later layers. Put it this way, you can construct a model of 3 layers instead, leaving the first hidden layer untrainable, and then only train the other layers. You will end up with a model where the first hidden layer learnt nothing (because it had been forced to be untrainable). There is a chance that the model is still able to perform reasonably well (but I am not guaranteeing you that it must perform as well or better).
For actually how good such model can perform, I will leave it to you to try it out yourself. You might first attempt a model of 2 layers which are all trainable, then move on to a model of 3 layers with the first one being not trainable, and then move on to 6 layers with the first four untrainable, and then maybe 20 layers with the first 18 untrainable, and see the effect of bad early layers on the overall model performance.
Another experiment that you might do is to have a model of 3 layers. Let’s say you think training it for 100 epochs is just right for it to perform well, then you might do these:
set n = 90
construct a new model of 3 layers, and train the full model for n epochs, then set the first layer to be untrainable, and train the partial model for 100 - n epochs
record the model’s performance
set n = 80 and repeat step 2 and 3.
You might iterate this process for n = 100, 90, 80, ..., 0 to find out how performance changes against the maturity of the first layer.