Week 4 Triplet loss

jm1e16 · October 18, 2021, 4:32pm

Hello,

what is the point in using the max in J=max(d_P - d_N, 0) in the triplet cost function? It seems like an unnecessary complication for the optimizer if the initial weights make the d_P - d_N positive (the gradient will be zero)…

jonaslalin · October 27, 2021, 11:07am

jm1e16 · October 27, 2021, 2:00pm

Mmm okay, but how does the training “focus” on the more difficult pairs?
The optimizer as we learnt it simply computes all the derivatives, no matter if the values are zero or not. There would have to be some hard coded instruction to skip the derivatives computations if the loss came out as zero?

jonaslalin · October 27, 2021, 2:51pm

No, it is just plain math. For example,

\newcommand{\dv}[2]{\frac{\mathrm{d} #1}{\mathrm{d} #2}}

J = \sum_{i = 1}^m \max(0, w x_i)

\max(0, w x_i) = \begin{cases} w x_i & \text{ if } w x_i > 0 \\ 0 & \text{ if } w x_i \le 0 \end{cases}

\dv{\max(0, w x_i)}{w} = \begin{cases} x_i & \text{ if } w x_i > 0 \\ 0 & \text{ if } w x_i \le 0 \end{cases}

\dv{J}{w} = \sum_{i = 1}^m \dv{\max(0, w x_i)}{w} = \sum_{i = 1}^m \begin{cases} x_i & \text{ if } w x_i > 0 \\ 0 & \text{ if } w x_i \le 0 \end{cases}

No signal will affect the parameter w for training examples where we already have w x_i \le 0, i.e., the training algorithm only focuses on the more difficult pairs where we still have w x_i > 0. We can think of the other training examples as discarded during this weight update. We are not trying to improve on training examples where we already have w x_i \le 0, i.e.,

jm1e16 · October 27, 2021, 4:44pm

Thanks, I think I understand now.
The “hard coded” instruction to skip the computation is in your analytical expression of the derivative itself (0 if wx_i <= 0).

I imagined that the whole derivative expression (just x_i in this case) is firstly evaluated for the whole batch in a vectorized way and then multiplied by zero if the max is saturated. Something like np.sum(dwX/dw * (wX>0), axis=1), where X=[x_1,x_2,...] and wX is the mask to multiply by zero broadcasted to the length of x_i.

Does this mean that the vectorization over the training batch examples is not common for CNNs? We haven’t done a vectorized implementation in the labs I think…

jonaslalin · October 27, 2021, 5:12pm

Yes, using the max function acts as a derivative gate and is a common trick in machine learning.

I just illustrated without using vectorization. You can as easily add that to my example as well. You can try it out for yourself. Vectorization or no vectorization doesnt matter, the derivative will get a zero signal from examples that satisfy w x_i \le 0 in our example. I just used a very simple example which is easier to understand. A more complicated example will still have the same principles applied.

jm1e16 · October 27, 2021, 5:34pm

I am not sure how to vectorize it in such way to be honest. I would do np.sum(dwX/dw * (wX>0), axis=1), but this doesn’t avoid the gradient computation also for the zero components. Any hints?

jonaslalin · October 27, 2021, 6:01pm

X \in \mathbb{R}^{1 \times m}

wX > 0 \in \mathbb{R}^{1 \times m} \text{ which acts as a mask of ones and zeros}

\frac{dJ}{dw} = (wX > 0) X^T \in \mathbb{R}^{(1 \times m) \times (m \times 1) = 1 \times 1}

should do it

jm1e16 · October 27, 2021, 6:38pm

Thanks for the reply, I think you will be able to see what I mean now.
\phantom{x}
Your formula in numpy is np.dot(wX > 0, X.T).
\phantom{x}
This is computationally better than what I wrote before, but you can see that you haven’t avoided evaluating \frac{\text{d}wx_i}{\text{d}w} for all the training examples i, no matter if wx_i\gt0 or not.
So where is the saving in the “effort”? You may as well update all of the pairs during the gradient descent for the same computational cost using this approach.

jonaslalin · October 27, 2021, 6:43pm

Yes, but we don’t want to focus on examples where we already are good enough, we don’t want those examples interfering with how we try to do better on the “hard” examples.

jm1e16 · October 27, 2021, 9:38pm

Ah ok, so it’s not about computational effort then.
I think I get it now. Perhaps the key point to realize is that a global solution to

\text{minimize } J_1=\sum_{i=1}^{m}(d_P-d_N)

may lead to weights where the predictions are extremely good for some samples, but quite bad for others.
We prefer weights that satisfy d_P-d_N+\alpha\leq0 for as many samples as possible, no matter how closely, which we can achieve using

\text{minimize }J_2=\sum_{i=1}^{m}\text{max}(d_P-d_N+\alpha, 0).

This is even confirmed noting that the global minimum of J_2 is J_2^*=0 when all the samples satisfy d_P-d_N+\alpha\leq0, and the computation can stop.
Maybe at that point, having found a feasible starting point, the user may even want to continue the optimization using J_1 and a constraint d_P-d_N+\alpha\leq0 \text{ for all }i to improve the performance.

jonaslalin · October 28, 2021, 10:49am

yes

it is unlikely that you will achieve this for every training example, so when you are done training, you are done. no need to relax the criterion and start again. you are already performing the best you can.

Topic		Replies	Views
Loss function in triplet loss Convolutional Neural Networks coursera-platform	4	779	October 1, 2021
Class4 Week4 Ex1 Function triplet_loss Convolutional Neural Networks coursera-platform	1	714	July 15, 2021
Error in the triplet_loss function (based on expected output) Convolutional Neural Networks coursera-platform	6	409	December 13, 2023
C4 W4 A1: Programming Assignment: Face Recognition UNQ_C1 possibly wrong unit test answer Convolutional Neural Networks coursera-platform	2	505	May 4, 2023
Exercise 1 - triplet_loss - why Convolutional Neural Networks coursera-platform	2	523	June 7, 2022

Week 4 Triplet loss

Related topics