C5W3 - Softmax with higher dimension

For anyone who goes into the utilities code and helper functions provided for the course, you would discover in the project for neural machine translation, there was a need to define a new softmax function. I have a question and some contributions.

My contributions are:

  1. The new softmax-defined function is done so as to be able to cater to dimensions higher than two dimensions.
  2. Trying to interpret the function step by step has not been the easiest and I think I would need better math skills to understand the code. However, I did discover a simpler way to write the code using tf inbuilt function:

def alt_softmax(x, axis=1):
return tf.nn.softmax(x, axis)

Now my questions are:

  1. As regards the concatenation that happens in the attention architecture, why do we concatenate at the last dimension, I mean why Concatenate(axis=-1)? Basically, I am confused as regards the dimensioning that happens in that architecture during the concatenation. How do the RepeatVectors of S concatenate into a big matrix with a?

  2. Why do we need a softmax that handles more than a two-dimension matrix (or a vector)?

Thank you.

What is the advantage in defining a new alt_softmax() function that simply returns a TensorFlow layer? You could call the TF layer directly.

I think a normal softmax later is coded to deal with softmax of ndim>2 across axis=0.
A customized softmax function like this helps to deal with inputs of ndim>2 across axis of choice.

The initial softmax defined by Andrew as a helper function was a little bit long and complicated. But this is shorter and helps circumvent the need to understand the math behind it.

Although I don’t know why the ndim would be greater than 2. And that is one of the questions I raised.