C2-W3 Why do we have to transpose "logits" and "labels"

Continuing the discussion from Week 3 - compute_total_loss Incorrect:

Hi,
In Week 3 of Course 2, for the assignment, I needed to transpose “logits” and “labels” in the “tf.keras.losses.categorical_crossentropy” function to make the “compute_total_loss” function work. However, I’m not entirely sure why this transposition was necessary.

I’ve noticed that we frequently transpose other matrices in various functions throughout this assignment, and I’m feeling a bit confused.
Could someone please offer some guidance or clarification on this matter?

Your assistance would be greatly appreciated.

1 Like

The particular case of compute_total_loss and why the transpose is required is discussed on this thread.

Each case will be determined by the particular circumstances: how the data is formatted and what the operations being used require. One other case I can think of was in the Logistic Regression discussions in DLS C1 W2. There we needed to transpose the weight vector w in order to make the linear activation work:

Z = w^T \cdot X + b

That was because Prof Ng chooses to use the convention that standalone vectors are column vectors. So w has dimensions n_x x 1 and then because X is defined to have dimensions n_x x m in that case (also related to the previous link) we need the transpose in order for the dot product to work.

1 Like

The other high level point here is that we’re always starting from mathematical formulas: that’s the root of everything. Then we need to translate those into linear algebra operations and finally to express everything as algorithms written in python. Exactly how all that plays out depends on all the decisions we make along the way about how to encode and represent the data.

The example of how Prof Ng chooses to represent the X matrix containing the input “sample” vectors and how he chooses to represent the w weight vector in Logistic Regression result in the particular linear algebra operations there. He could have chosen different representations and then things would work out differently in terms of the details of the expressions.

Hi @paulinpaloalto,

From the lectures in Course 1 Week 4, titled “Getting your matrix dimensions right,” I gathered that when we have the equation Z = W X + b, the shape of W1 should be (number of units x number of features), and X’s shape should be (number of features x m), which results in Z1 having a shape of (number of units x m).

In this way, I’m not entirely certain, but it seems that the convention in TensorFlow uses transposed matrices compared to Ng’s convention, which would mean Z = XW + b not Z=WX+ b

Is my assumption correct? Is there any documentation on this topic available on the TensorFlow website?

Thank you so much @paulinpaloalto

TensorFlow is not involved yet in C1 W4. There and in C1 W3, Prof Ng gets to compose the W matrices however he wants and he chooses to “stack” the tranposed weight column vectors as the rows of the weight matrix. So you get:

Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]}

No transpose required. The dimensions are:

W^{[l]} is n^{[l]} x n^{[l-1]}
A^{[l-1]} is n^{[l-1]} x m
b^{[l]} is n^{[l]} x 1

If you put all that together, you find that A^{[l]} will be n^{[l]} x m.

Once we got to TensorFlow, it may well be that the tensors are arranged differently because of the “samples first” convention. You also have a choice of where to put the “channels” dimension when you’re dealing with ConvNets. Generally the shape of a batch of image tensors would be:

m x h x w x c

Where m is samples, h is vertical dimension in pixels, w is width in pixels and c is channels.

Hi,

Thanks for explaining the “samples first” convention. It’s much clearer to me now.

Best regards.