For LSH in Reformer model (C4_W4 Reformer Efficient Attention: Ungraded Lab), could anyone explain the following mechanism for locality-sensitive hashing using pseudo-random rotations ? Note: There are some angular LSH animations which tried to do some explanations, but I am still confused.
Besides, why does the author mentioned : can store in space O(d) and evaluate in time O(d log d) using the Fast Hadamard Transform ?
As for LSH attention, why does it still need to check on previous chunk given that all the information had been rearranged into entirely different buckets which are not in ascending order ? and it seems to me that this LSH attention still needs to be auto-regressive ?
An answer to your first two questions requires an understanding of Fast Hadamard Transform, which is somewhat challenging and may be the result of a mathematical trial-and-error process rather than an entirely logical deduction.
The best explanation I have been able to find so far can be found here. The original values of a matrix are transformed by means of a generated Hadamard matrix, which is symmetric. Next, an FHT algorithm is applied as described in the post, serving to limit the number of calculations required to Nlog2N. This appears to be another transformation, which may well have been derived through mathematical trial-and-error (even if it may be ad hoc explainable one way or another after the fact, an explanation I am not aware of).
As to your third question, LSH attention needs to check on other chunks if these contain vectors with the same hash values. This is shown in figure 12, where other chunks are only checked if they have the same color code referring to hash values. This is done to keep the chunks at a single size. LSH attention only needs to be auto-regressive if it is concerned with causal attention; the discussion in the lab is about self-attention.