@Deepti_Prasad perhaps you can help me on this as I know you are one of the few NLP mentors.
I get the point they make that the weights are ‘shared’, but does this happen when we are still in the dual LSTM network stage, or are they saying we do this once we go to the cosine similarity juncture ?
I’m not quite at the assignment yet but I am finding this really confusing… I’d imagine you’d share weights at the LSTM stage, but have no idea how you’d do that.
Or in the args list it cites d_model as a default of 128… But I am presuming they mean ‘d_feature’… Or otherwise I’m not sure where this value is coming from…