I’m on exercise 4 out of 8 on the Transformers lab. Obviously, having worked this hard, I would really like to pass this course. I’ve gotten here with some help, but without any use of AI or googling solutions, and without anyone actually showing me how to do anything, on this forum or outside of it.
Now I am half a lab from done with a 20ish week class. I’m not a cheater. I have never cheated on a homework or test problem in my life.
But I would honestly like what help you are able to give me without actually cheating here. Please. I’m so sorry to bother you.
I’ve got code written that passes the unit tests for the previous three parts. Thanks. Clarifying what question one was asking did help. Question four here honestly seriously looks okay to me. I’ve checked up down and backward. But it is giving me this error.
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-21-00617004b1af> in <module>
1 # UNIT TEST
----> 2 EncoderLayer_test(EncoderLayer)
~/work/W4A1/public_tests.py in EncoderLayer_test(target)
92 [[ 0.23017104, -0.98100424, -0.78707516, 1.5379084 ],
93 [-1.2280797 , 0.76477575, -0.7169283 , 1.1802323 ],
---> 94 [ 0.14880152, -0.48318022, -1.1908402 , 1.5252188 ]]), "Wrong values when training=True"
95
96 encoded = encoder_layer1(q, False, np.array([[1, 1, 0]]))
AssertionError: Wrong values when training=True
So I guess my first question: Is it supposed to have training=training for every layer call? Since it is in fact training? I guess I would have assumed so, since backward propogation has to happen. That’s how it currently is in my jupyter notebook, though I’ve tried zillions of variations.
Secondly, based on the documentation for tf.keras.MultiHeadAttention, I think I’m probably not supposed to modify the mask here in this specific function? So that it is either ones or zeros? Right now I have not modified it.
Thirdly, all this routine is asking for is the MHA followed by a layer normalization followed by a FFN followed by a Dropout layer followed by a layer normalization of a sum right?
Am I missing anything here conceptually?
Steven