In the C4_W4, we learned the intricacies of implementing a Local Sensitive Hashing (LSH) into the self-attention to save memory. However, what I cannot see is how it was actually used in the ReformerLM.
shape11 = trax.shapes.ShapeDtype((1, 1), dtype=np.int32)
def attention(*args, **kwargs):
kwargs['predict_mem_len'] = 120 # max length for predictions
kwargs['predict_drop_len'] = 120 # never drop old stuff
return tl.SelfAttention(*args, **kwargs)
model = ReformerLM(
vocab_size=33000,
n_layers=6,
mode='predict',
attention_type=attention,
)
All I see is a tl.SelfAttention
. Please help me understand if I may be missing something here.
Thanks!
Hey @cmosguy,
Indeed, the ReformerLM that is used in the assignment, doesn’t use the LSH Self-Attention, that was implemented in the Ungraded Lab 1: Reformer LSH. However, trax
offers different implementations of LSH Self-Attention, which you can easily swap out in the below line of codes (can be found here):
def attention(*args, **kwargs):
kwargs['predict_mem_len'] = 120 # max length for predictions
kwargs['predict_drop_len'] = 120 # never drop old stuff
return tl.SelfAttention(*args, **kwargs)
The only other things you need to change are the hyper-parameters and the pre-trained model (since it might be trained using tl.SelfAttention
only). Feel free to do it as a self-exercise, and do share your results with the community. I hope this helps.
Cheers,
Elemento
1 Like
Thanks @Elemento
I’ll go and try to do this. Is the main reason why the LSH was not used, was because the amount of data chunks is so small it made no sense for the authors of the notebook to implement LSH? The LSH is designed to be able to manage datasets of 1M tokens as an input. This dataset for the input sentences are much smaller. Am I am correct?
Hey @cmosguy,
Indeed that could be one of the reasons. Another reason I believe, which is more or less related to what you have mentioned, might be absence of pre-trained models using LSH Self-Attention, which have been trained on smaller datasets like the one used in the assignment. And lastly, perhaps the developers wanted the learners to focus more towards the Reversible Layers, instead of the LSH Self Attention, since it was already discussed in a considerable depth in UGL 1, and hence, they avoided the use of LSH Self Attention. I hope this resolves your query.
Cheers,
Elemento
1 Like