The results and the figures in ‘Multi-class classification’ video seem to be quite intuitive, since every element in Z[L] has to go through a same activation function, and the Z[L] is a result from the linear function. (Where Z[L] = W[L]·A[L-1] + b[L])
But how can it be proved in a formal way. Anyone can help me?
The formula Z = W^{[l]} \cdot A^{[l-1]} + b^{[l]} where all the inputs are matrices or vectors is the definition of a linear transformation. Of course softmax is not a linear function, so the full result mapping from either the inputs or just the last layer plus the activation are not linear functions.
Please help us by clarifying what more you want to prove here.
Oh it seems like I misunderstood something.
But I mean, look at the figures, why are those sections linearly divided?
Sorry, I will have to watch that lecture again to reconstruct what those examples are about. It may be tomorrow before I have time to do that. But anything you can graph in 2D like that is a pretty simplistic example.
Ok, it’s simple. As he explained in the lecture, those examples are the simplest possible case: it’s not really a neural network. It’s like Logistic Regression with only one layer, but with a softmax activation instead of sigmoid. So the inputs are points in the plane with coordinates (x_1, x_2) and the outputs are a softmax vector computed by taking:
z = W \cdot x + b
\hat{y} = softmax(z)
So let’s suppose that C = 4, meaning that we have 4 possible output “classes” to recognize. So that means each vector x is a 2 x 1 vector and W is a 4 x 2 matrix and b is a 4 x 1 column vector. Both z and \hat{y} will be 4 x 1 column vectors for the case of a processing a single input.
Now think about what those lines mean in the graphs: they are the “decision boundaries” between two possible predicted classes. Let’s say it’s between class 1 and class 2. Then that line is the set of points (x_1, x_2) for which the \hat{y}_1 and \hat{y}_2 values are equal. But note that if the softmax values are equal, that means the input z values are equal, because softmax is strictly monotonic. That means that they are the points which satisfy the following equation:
w_{11} * x_1 + w_{12} * x_2 + b_{11} = w_{21} * x_1 + w_{22} * x_2 + b_{21}
What I’ve done there is just write out the meaning of the matrix multiply followed by the addition of b.
Then if you simplify that by gathering like terms, you get this:
(w_{11} - w_{21} ) * x_1 + (w_{12} - w_{22}) * x_2 = b_{21} - b_{11}
Well, what is that? It’s the equation of a line in the plane, right?
Prof Ng mentions again at the very end that this is the simplest case. He explicitly points out that if you add more layers to create a real neural network to do this type of softmax classification, then the decision boundaries that can be represented are much more complex than linear functions.
1 Like
Maybe it’s worth also pointing out that this is just a generalization of the reason that Logistic Regression can only express a linear decision boundary, even in the case that the inputs are more than 2D vectors. In the Logistic Regression case, the output is a single number, rather than a vector as in the softmax case:
z = w \cdot x + b
\hat{y} = sigmoid(z)
It is making a prediction in the form of a probability between 0 and 1, so we interpret the prediction as “Yes” (It’s a cat) if \hat{y} > 0.5. So the decision boundary between “Yes” and “No” is expressed by this equation:
\hat{y} = 0.5
But we also know that sigmoid is strictly monotonic and sigmoid(0) = 0.5, so that equation is equivalent to:
z = 0
If you write that out, it becomes:
w \cdot x + b = 0
Or
\displaystyle \sum_{i = 1}^n w_i * x_i = - b
That is the equation of a plane (hyperplane) in the input space of all x vector values.
It’s a really good example. Thanks!