Thank you for such a well organized course and for putting in the time to help everyone that posts questions.
I am sorry if this question sounds dumb but I am really trying to understand the details. Regarding logistic regression, it’s clear to me why we cannot simply use the linear equation, since this would also give us negative values as an output and in classification we need a 0 or 1 as an output. So the Sigmoid helps us with that. What’s not clear to me is why we need to Transpose the Weight parameter when we multiply it with the features X. I understand X is a vector and not a simple integer; could it be that by doing W.T we actually force W to be a vector as well, allowing us to multiply two vectors with each other?
The answer to this is in the details of matrix multiplication.
The simplest way to think about this is to just focus on the shapes of your vectors that you are working with, making sure that the general rule for the shapes in matrix multiplication is satisfied (vectors can be thought of as matrices with just one row or one column). To multiply two matrices, the one on the left has to have the same number of columns as the matrix on the right has rows.
Thinking about it this way, taking the transpose sometimes just allows us to multiply together two matrices that would otherwise not be allowed to be multiplied together. Maybe this alone will help you make sense of it; the transpose can be like a formalization that allows us to use matrix multiplication to do what we want with two vectors.
Moreover, and this is where things get more nuanced, there’s often multiple different ways to multiply two matrices together when we have the transpose at our disposal. Which we choose determines the shape of the resulting vector/matrix. For example, if X and W are both row vectors of length n, multiply(X, W.T) gives you a single number, where multiply(X.T, W) gives you an n by n matrix.
When you multiply a vector by another vector (or generally a matrix) on one side, you can think of this as representing some general linear function on that vector. So we can consider X as some input and W as some transformation matrix. Eventually, we want to do this to a bunch of samples in a dataset of vectors all at once, which we stack into a matrix. Due to the wonderful way matrix multiplication is defined, when we multiply a matrix by another matrix, we can think of it as applying that same linear function to all of the rows or columns of that matrix simultaneously.
There’s obviously a lot a details I’m glancing over here; whether we are thinking of operating on rows or columns depends on what side of the ‘input’ matrix we are multiplying the ‘transformation’ matrix on, and the rules of matrix multiplication constrain the shapes of matrices that are allowed to be multiplied together. But again, everything can be simplified a lot if you just focus on the rules for matrix multiplication and make sure (1) that things fit and (2) that the output matrix has the shape that you want.
Welcome to the DL Specialization @jpredroanascimento. I am not understand your question, or which part of which week of the course to which you refer.
We have a feature matrix X which is of dimension (n, m); n is the number of features and m is the number of training examples. A single training example x is thus an n-dimensional column vector (i.e. dimension n x 1). The logistic regression for binary classification produces a probability that the example represents is a “positive case” (“cat”=1) or a “negative case” (“not cat”=0). So the output for that example is a scalar (1 x 1). In this course, the row dimension of the weight matrix W is equal to the number of outputs; the column dimension is the number of inputs, here, number of features n. Now, let’s just let M represent any matrix of constants and let z = Mx + b. Question: what are the dimensions of M necessary to produce a scalar value z? Next question: What matrix transformation of W is necessary to reach the same conclusion?