Cost function, normalization and mean layer quesitons

Q1. Can I use the (similar to) cross entropy cost function as we used in a classification problem to train Siamese network?

image

where
A is the anchor sentence
X_i is one sentence
Y_i is the label of the sentence, 1 is positive, 0 is negative.

The cost function shall be able to classify all positive sentences near the anchor, while pushing all negative sentences away from the anchor.

This cost function is not chosen, is it because the gradient is hard to calculate? Or did I miss anything?

EDIT:

The FaceNet paper introduces the triplet loss function.

My understand is the cross entropy cost is way too aggressive for this training task.

The paper also mentioned their architecture does have a normalization layer, but it seems they didn’t mention why.
image

Q2. Why do we need normalization in the Siamese network?

We don’t need normalization layer in previous weeks. Is it because using the cos similarity function require us to L2 normalize the vectors?

Q3. Why there is a mean layer in the network


LSTM = tl.Serial(
        tl.Embedding(vocab_size=vocab_size, d_feature=model_dimension),
        tl.LSTM(model_dimension),
        **tl.Mean(axis=1),**
        tl.Fn('Normalize', lambda x: normalize(x))
    )

EDIT: the LSTM layer output contains the y output of all steps. (see the following screenshot)
image

The Mean calculates the average of all the LSTM outputs at each step. However, why not just use the last hidden state of the LSTM, which presumes to contain the semantic feature of the whole sentence?

The FaceNet paper introduces the triplet loss function.

My understand is the cross entropy cost is way too aggressive for this training task.

The paper also mentioned their architecture does have a normalization layer, but without mentioning why.
image