Hello,

what is the point in using the max in J=max(d_P - d_N, 0) in the triplet cost function? It seems like an unnecessary complication for the optimizer if the initial weights make the d_P - d_N positive (the gradient will be zero)…

Hello,

what is the point in using the max in J=max(d_P - d_N, 0) in the triplet cost function? It seems like an unnecessary complication for the optimizer if the initial weights make the d_P - d_N positive (the gradient will be zero)…

Mmm okay, but how does the training “focus” on the more difficult pairs?

The optimizer as we learnt it simply computes all the derivatives, no matter if the values are zero or not. There would have to be some hard coded instruction to skip the derivatives computations if the loss came out as zero?

No, it is just plain math. For example,

\newcommand{\dv}[2]{\frac{\mathrm{d} #1}{\mathrm{d} #2}}

J = \sum_{i = 1}^m \max(0, w x_i)

\max(0, w x_i) =
\begin{cases}
w x_i & \text{ if } w x_i > 0 \\
0 & \text{ if } w x_i \le 0
\end{cases}

\dv{\max(0, w x_i)}{w} =
\begin{cases}
x_i & \text{ if } w x_i > 0 \\
0 & \text{ if } w x_i \le 0
\end{cases}

\dv{J}{w} = \sum_{i = 1}^m \dv{\max(0, w x_i)}{w} = \sum_{i = 1}^m
\begin{cases}
x_i & \text{ if } w x_i > 0 \\
0 & \text{ if } w x_i \le 0
\end{cases}

No signal will affect the parameter w for training examples where we already have w x_i \le 0, i.e., the training algorithm only focuses on the more difficult pairs where we still have w x_i > 0. We can think of the other training examples as discarded during this weight update. We are not trying to improve on training examples where we already have w x_i \le 0, i.e.,

Thanks, I think I understand now.

The “hard coded” instruction to skip the computation is in your analytical expression of the derivative itself (`0 if wx_i <= 0`

).

I imagined that the whole derivative expression (just `x_i`

in this case) is firstly evaluated for the whole batch in a vectorized way and then multiplied by zero if the max is saturated. Something like `np.sum(dwX/dw * (wX>0), axis=1)`

, where `X=[x_1,x_2,...]`

and `wX`

is the mask to multiply by zero broadcasted to the length of `x_i`

.

Does this mean that the vectorization over the training batch examples is not common for CNNs? We haven’t done a vectorized implementation in the labs I think…

Yes, using the max function acts as a derivative gate and is a common trick in machine learning.

I just illustrated without using vectorization. You can as easily add that to my example as well. You can try it out for yourself. Vectorization or no vectorization doesnt matter, the derivative will get a zero signal from examples that satisfy w x_i \le 0 in our example. I just used a very simple example which is easier to understand. A more complicated example will still have the same principles applied.

I am not sure how to vectorize it in such way to be honest. I would do `np.sum(dwX/dw * (wX>0), axis=1)`

, but this doesn’t avoid the gradient computation also for the zero components. Any hints?

X \in \mathbb{R}^{1 \times m}

wX > 0 \in \mathbb{R}^{1 \times m} \text{ which acts as a mask of ones and zeros}

\frac{dJ}{dw} = (wX > 0) X^T \in \mathbb{R}^{(1 \times m) \times (m \times 1) = 1 \times 1}

should do it

Thanks for the reply, I think you will be able to see what I mean now.

\phantom{x}

Your formula in numpy is `np.dot(wX > 0, X.T)`

.

\phantom{x}

This is computationally better than what I wrote before, but you can see that you haven’t avoided evaluating
\frac{\text{d}wx_i}{\text{d}w}
for all the training examples i, no matter if wx_i\gt0 or not.

So where is the saving in the “effort”? You may as well update all of the pairs during the gradient descent for the same computational cost using this approach.

Yes, but we don’t want to focus on examples where we already are good enough, we don’t want those examples interfering with how we try to do better on the “hard” examples.

Ah ok, so it’s not about computational effort then.

I think I get it now. Perhaps the key point to realize is that a global solution to

\text{minimize } J_1=\sum_{i=1}^{m}(d_P-d_N)

may lead to weights where the predictions are extremely good for some samples, but quite bad for others.

We prefer weights that satisfy d_P-d_N+\alpha\leq0 for as many samples as possible, no matter how closely, which we can achieve using

\text{minimize }J_2=\sum_{i=1}^{m}\text{max}(d_P-d_N+\alpha, 0).

This is even confirmed noting that the global minimum of J_2 is J_2^*=0 when all the samples satisfy d_P-d_N+\alpha\leq0, and the computation can stop.

Maybe at that point, having found a feasible starting point, the user may even want to continue the optimization using J_1 and a constraint d_P-d_N+\alpha\leq0 \text{ for all }i to improve the performance.

yes

it is unlikely that you will achieve this for every training example, so when you are done training, you are done. no need to relax the criterion and start again. you are already performing the best you can.