Question on locality-sensitive hashing

vasyl.delta · January 13, 2024, 8:43am

Thank you for the wonderful course.
I have a question on locality-sensitive hashing.
In the course it was shown how to select a “bucket” when the space is divided into parts by different planes. However, how to calculate the coefficients of these planes was not discussed. Can you please shed light on this issue - how to define these planes?
Will be grateful to you for the answer.
My best regards,
Vasyl.

jyadav202 · January 14, 2024, 7:25am

Hi!
The planes are generated randomly using np.random.normal. We want only the normal vector for these planes to determine the sign of where vector V will lie wrt. plane. Or in other words, we are searching for “similar” V vectors to vector P(Plane normal).
Ideally, we want to run KNN on all the vectors but it is computationally expensive, we use LSH to trade-off accuracy for speed and only search vectors in certain buckets. These buckets, as you mention, are created by hyperplanes. Ofcourse we want as less hyperplanes as possible to reduce the computation but at the same time want more accurate results in KNN search. This is more of an engineering problem where you try different count of planes and even create multiple sets of them (called universe in video and assignment) to get better results.

Hope that answers your question.

vasyl.delta · January 15, 2024, 5:01am

Thank you for the answer!
I have here a question, please: how the number of planes is connected with the number of buckets?
Like in a figure below, the intersection of 3 planes produces 7 intersection areas, so are there 7 buckets which are enumerated?
Or, according to explained in lecture approach, each point can be encoded by 1 bit in respect to each plane (0-below, 1-above) and thus the buckets are encoded by 3-bit numbers so that there 8 of them?
I will be grateful to the answer.
Vasyl.

jyadav202 · January 15, 2024, 6:22am

Hi!
I cant see your figure attached so I cant answer to your figure related question.
But the idea of hyperplane is to divide the space into buckets. Each time one hyperplane divides the space into 2 parts thus creating subspaces which we call buckets. This will result in 2^n buckets if n is the number of hyperplanes.
If the buckets (as you mention) are encoded with 3-bits, that means there are 3 hyperplanes, resulting in 2^3=8 buckets.

vasyl.delta · January 15, 2024, 6:58am

Thank you for the answer!
Sorry, here is the picture I meant:

So, we have 7 intersection areas but they are not the buckets, right?

jyadav202 · January 15, 2024, 10:30am

Sorry for the confusion ealier, a subregion is not a bucket in itself, but it creates one. What I mean is that since we assign either 0 or 1, the total buckets becomes 2^n.
We do not enumerate per region, but per hyperplane.

vasyl.delta · January 16, 2024, 3:46pm

Thank you for the answer, Jadav!
But… what I want to say is that the number of buckets should be less than 2^n.
(i.e. some will be always empty).
For example, consider 3 lines on the plane. They create 7 intersection regions. The hash value “6” is impossible to get, because there could be no point which is above red & green lines, but simultaneously below the blue line.

Please let me know your opinion.
My best regards,
Vasyl.

paulinpaloalto · January 16, 2024, 3:49pm

Yes, I had also noticed that there were 7 in the diagram and worked through a couple of examples with different labeling of the + and - sides. I’m not sure I could prove it mathematically, but in every case there is one of the 8 combinations that is not achievable. So the conjecture is that you end up with 2^n - 1 non-empty buckets.

We should try it with 4 lines and see if we end up with 15, but that will be a lot messier in that there are more ways to draw the diagram.

arvyzukai · January 16, 2024, 6:12pm

Hi @vasyl.delta

This should answer your question regarding how many regions after we divide the space.

For example, with 4 lines in 2 dimensions the most regions you would get is 11.

Btw, if I remember correctly, in the previous courses when dividing the space with hyper planes, we divided them through the origin (we did not use the bias term) so your picture is be a bit misleading.

The more accurate intuition would be that we have multiple randomly rotated lines that go through origin (center point) and divide the space into multiple regions that way. This makes more sense with cosine similarity - since we are looking for vectors that point to the same direction. (The picture you have is more suited for Euclidean distance - when looking for points located nearby in Euclidean space). In other words, we do not care to divide the space to as many regions as possible, but to have regions that would limit our search space suitable for our similarity measure. And of course, we have multiple of these universes just in case

Cheers

jyadav202 · January 17, 2024, 9:41am

Hi!
Additional Food for thought:

I consistently used the word hyperplane and avoided using the word line. What you drew here are lines, but in fact these are planes passing through your screen and what we want is a dot product of their normals (which are perpendicular to your lines and denoted as P) with our data vector (V).
Now when you draw a normal to a plane, what tells you which side is positive and which side is negative? There can be 2 normals and both are correct. The sign of +/- comes from the reference point that we choose.
In our case, we assign random P vectors so the signs are fixed for the normal (example [3,-2]).
The signs that you drew in your image cannot be just random, they have to be calculated after the dot product of these normals P with V. In that case, are you sure ++- is not possible? If that blue line was at the top, ++- would be possible for the same P value.

To avoid all this confusion (I am a visual learner too and would like to draw my vectors but adds to the confusion) its easier to just think in terms of vectors (N-dimensional).

As per my understanding:
For the count of buckets, refer to video “Multiple planes” timestamp 2:14/3:49 where Yonus mentions that we need a single hash value to know which bucket to assign our V to. That means, the total possible hash values will tell us how many buckets we can have. For 3 hyperplanes we can have 0-7 hash values. thus 8 buckets. (In some papers, you might read the term ‘hash bucket’). So buckets are nothing but the total possible hash values we can achieve with the hash function. For the LSH we use in course we can achieve 2^n.

vasyl.delta · January 17, 2024, 12:16pm

Dear Jadav, thank you for the answer.
Just wanted to clarify:
In the my picture the signs were not random. The first sign in the triple corresponds to whether the points are above the first line or not etc.
Combination +± is really impossible in this case since there could be no points which are above the first two lines (rea & green) and simultaneously below the third (blue) one.

arvyzukai · January 17, 2024, 3:37pm

Hi @jyadav202

You’re a bit wrong, please see my link above.

In particular, the 2D space, we can divide it with 3 lines into 7 regions/buckets at most (it could be 6 if all the lines cross at the same point; or in extreme cases when lines overlap there could be even less).
In a 3D space we would divide the space with a 2D plane.
In a nD space we would divide with a (n-1)D hyperplane and the number of regions/buckets would definitely not be 2^n but less.

Cheers

jyadav202 · January 17, 2024, 4:17pm

@vasyl.delta @arvyzukai I would like to refrain myself from further discussion on this topic. But I hope we all can go with our own understanding of what signs signify and what buckets mean in the context of LSH.

arvyzukai · January 17, 2024, 4:45pm

@jyadav202 signs signify on what side of the hyperplane the point is (either above or below). I think everyone agrees on that one?

vasyl.delta · January 18, 2024, 4:40am

Dear Jadav, thank you for the patience with my questions!

jyadav202 · January 18, 2024, 9:25am

@arvyzukai the OP question was about:

to which I explained that its 2^n, but @vasyl.delta used an example of 3 lines on plane and concluded that the buckets should be <2^n. What I wanted to explain in my reply was that this explanation is not LSH in its totality and buckets are not just those which figure shows (7 as everyone counts). To be called LSH, we need to repeat this process of bisecting multiple times and then (and only then) we can see the total buckets to be 2^n. I saw that the discussion was going to pure cartesian geometry from which I wanted to degrees. (I accept that I am still learning how to convey my answers more clearly.)
To the best of my knowledge signs here do signify the side on which a Ndim
vector lies on (N+1)dim hyperplane in a (N+1)dim hyperspace, but I was using the term ‘sign’, ‘line’, ‘point’ and ‘buckets’ with caution.

@vasyl.delta I would like you refer this video for more insights. Pay attention to the computation cost calculation where they calculate total points in a bucket and divide the figure by total buckets of 2^K. I wanted to find a better reference material for the explanation of buckets using LSH to perform semantic similarity in vector space, but couldn’t find one. The OG LSH paper is too mathematical and doesn’t address this in a straightforward manner (link).

arvyzukai · January 18, 2024, 10:54am

@jyadav202 thank you for not ignoring my question (since you wanted to refrain from further discussion and I kind of dragged you back into it). To explain my ignorance: the best case scenario of this discussion for me would be showing me where I’m wrong, that way I could learn something new. Second best scenario is where my understanding would be confirmed and the worst scenario is where I would be left with my wrong understanding

So, if you don’t mind, first thing - the OP question is very much in discussion - I claim, that the number of real buckets you can get when dividing space is not 2^n but less. The number of buckets you can create with such a hash function is 2^n but there will be empty buckets (in the OP’s case there would be no real ++ - points, unless the space was cut in different places and the blue line would be “higher”).
P.S. in practice this should not be a problem since we usually have more dimensions than buckets, in that case (when number of buckets is less than or equal to number of dimensions in hyperspace, in those cases the formula 2^n is perfectly fine:

I do not claim that the number of buckets is all there is to LSH but it is an important part of how we reduce the search space.

Second thing - I’m not sure I understand what, in your view, is wrong with Cartesian geometry? Maybe that is related to your main objection (complex numbers or sparsity in high dimensional space or maybe something else)?

I’m not sure if this is a typo but in my understanding the hyperplane is one dimension lower than the hyperspace - otherwise we could not divide the hyperspace into 2 subspaces (it should have been n_dimension hyperplane in n+1_dimensional hyperspace). In other words we can divide a line with a point (1 dimensional space with 0 dimensional point), a plane with a line (2 dimensional space with a 1 dimensional line), a space/volume with a plane (3 dimensional space with a 2 dimensional plane), a hyperspace with a hyperplane (4 dimensional space with a 3 dimensional hyperplane) and etc.

Or are you saying that it is not the typo?

Looking forward to hearing from you.

jyadav202 · January 23, 2024, 12:32pm

Nothing wrong with it. As long as we understand the bigger picture of what LSH is.

Yes, you are right about the dimensional shape of hyperspace and hyperplane with respect to each other. But I used (N+1) dimension intentionally. That was to follow what Yonus said about how we used another n dim plane to divide n dim space. But as I said, as long as we understand the intentions behind why it’s done that way, all is well!

Topic		Replies	Views
NLP Specialization C1 W4 Conclusion Video NLP with Classification and Vector Spaces week-4	1	549	May 29, 2022
Exercise 09 - hash tables, vector multiplication NLP with Classification and Vector Spaces week-4	1	551	June 7, 2022
C1_W4_Assignment - Choosing number of planes NLP with Classification and Vector Spaces week-4	1	311	November 16, 2023
Topic 3.3 in the last assignment: Finding the most similar tweets with LSH NLP with Classification and Vector Spaces week-4	2	494	November 21, 2022
Hash value from multiple planes: NLP with Classification and Vector Spaces week-4	1	549	December 28, 2022

Question on locality-sensitive hashing

Related topics