I am trying to understand the curse of dimensionality from this lecture: https://www.youtube.com/watch?v=BbYV8UfMJSA (the professor starts talking about it from the 2nd minute of the video)
And I didn’t really get what the professor means by the points being uniformly distributed in the cube. I thought a uniform distribution means that every outcome is equally likely to happen, and in this context every point is equally likely to be picked up, but I don’t see how that would make sense for this explanation or how it would make the volume of the cube containing the k nearest neighbors to a point be equal to k/n. Or does he mean that the points in the cube are evenly spaced maybe?
Unfortunately I don’t have the time to watch the whole video, but I at least checked. I think what he is saying… Or as I like to see it, think of an amazing, yet complicated swarm of birds (or there also any number of animals you can pick) that flock, say Starlings-- Individual behavior is amazingly complex, but, at least there is some ‘sense’ of a center to the flock.
But not all data is like this… And like I said, unfortunately no time to watch the whole video, but an important query to me I’ve kind of realized is it is also heavily dependent on ‘are we even asking the right question’ ?
All the data in the world, even with DL models, is not going magically ‘solve’ your problem if your approach is wrong. Or, personally, today I know the trend has kind of been ‘throw everything at the wall until something sticks’. I disagree with that. General models are pretty impressive, but we still choose the data we use.
At least, thus far in my studies, I think this is a really important fact that has gone under considered.
So… this is at least my, very quick interpretation of what he is saying.
I believe he is talking about distribution of points in a d dimensional space, not a probability distribution. He is focused on large dimensions and points out that the non-intuitive insight, that for large d, l approaches 1, meaning the nearest neighbors are not very near at all, is different from what you get from doing the analysis in low dimensional space. Clearly it’s hard for our brains to grok what near means in 1,000 dimensional space.
So what he means by uniformly distributed is that the points are evenly spaced? I guess that would justify why the volume of the cube containing the k nearest neighbors of a point would be k/n (as n approaches infinity I guess)
Also, do you know why cosine similarity works with embeddings eventhough they tend to be high-dimensional (maybe not that much, most of them don’t exceed 784 or something like that)? I’ve seen some people say that it is not sensitive to high dimensionality but I’m not sure if that’s correct and if so why. Because it is still a dot product between the 2 vectors, so it should be sensitive to the number of dimensions in my opinion.
That’s my understanding, yes. Even though that starting assumption leads to the weird conclusion that in the higher dimensional examples the unit cube is mostly empty and both the nearest neighbors and the non-nearest neighbors are all very close to the edge of the unit cube. (Which means it is very sparse). Apparently edge means something different when d is very large. Or maybe it’s chalk drawing of the unit cube that creates the problem. I think of it a little like non-Euclidean geometry; just because it isn’t what our brains are good at visualizing doesn’t mean it’s not mathematically consistent given its postulates.
As discussed in the linked video and the section of the paper I linked above, in high dimensions all points have approximately the same Euclidean distance, so that measure doesn’t contain useable information. By computing relative distance instead of absolute, cosine similarity still provides information useful for comparing two sparse high dimensional vectors.
Thank you for the explanation the article! Very useful!
I was going to ask why assume the data is uniformly distributed? Like would the same conclusion hold if the data was more dense/sparse, but I guess it does. It would simply be slower/faster as a function of the number of dimensions.
As for the cosine distance, I asked about it because of this lecture, starting from this minute NLPC3: Curse of Dimensionality [29:32](no need to watch the video I would say, the slide is what I wanted to share).
But I think that would be true when the features are noisy as stated in the lecture, and so when he says: > while some claim "cosine is better in high dimensionality" this is false
This is true only the data is noisy, which might be very common; I don’t really know, but I think it is the case. However LLMs can handle typos I believe and are trained on lots of data, so they are less prone to noise maybe?
And I guess, even though the text data used to train the LLM and thus generate the embeddings is very high-dimensional, it would still live in a lower dimensional space than the original one maybe? because I would say not all possible combinations of words do make sense? And if you try to use cosine similarity on a non-coherent or garbage sentence then the result would be garbage as well.
Sorry if I’m just spitting nonsense here, I still haven’t learned about LLMs and the math behind them yet, so I’m just talking based on intuition maybe and what seems natural to me. And sorry, I keep throwing videos and lectures at you out of nowhere
The uniform distribution was just an assumption to demonstrate a point about the counterintuitive behavior of distances in high dimensions. He then goes on (starting at about 11:20) to talk about why K nearest neighbors can actually work with “real” data, e.g. images in which case the data are not uniformly distributed, but form lower dimensional manifolds embedded in the higher dimensional space.
You should take DLS Course 5 and learn about word embeddings and Attention Models. The point is that embeddings are not used for full sentences: they are used for individual words. You train embedding models separately and then use them as part of the “encoding” phase of an attention model.
I have only gotten to about 17 minutes so far and haven’t yet had a chance to hear what he says about cosine similarities. It’s fascinating material. Thanks for the link. I hope to finish the rest of it tomorrow.
I do see that similar to the difference between two Euclidean distances approaching zero, in extreme high dimensions the angle between their vectors approaches 90, so the cosine similarity also fails to be informative.
My sense though is that, while this might be provably true mathematically, maybe it isn’t as significant practically. Ok, maybe you can’t directly compute the similarity of Shakespeare’s works (~900K words) with Proust In Search of Lost Time (~ 1.2M words), but there are applications where cosine similarity (or distance) do seem quite suitable and useful.
I concur with @paulinpaloalto that lower dimensional spaces are where this is most true, and that manifolds are an important related concept. I never finished my topology course, but maybe some of the more math literate readers of the thread will opine.
You can find more searching on manifolds and machine learning
I haven’t gotten to the part where he discusses cosine similarities, but maybe the point about the angles approaching 90 degrees in higher dimensions is analogous to the point about the volumes of the k/n cubes approaching the volume of the containing cube that was the point of the first 10 minutes of the lecture. Meaning that it applies to the “general” case analogous to uniformly distributed data, e.g. two random vectors in whatever the dimension is. But we’re not dealing with uniformly distributed data: since there are no patterns in it by definition ML is gonna be useless.
I’d have to go back and look at the various word embedding models that were covered and used in DLS C5, but I remember the cosine similarities worked just fine. But maybe the GloVe model with 50 dimensions is not considered “high dimensionality” in the sense the Prof is talking about here. Looking forward to hearing more of what he says.
I use to see this terminology in statistics. Multiple parameters/variables typically introduce a lot of noise in estimation. It is very hard to obtain good estimation results when the number of parameters is huge. Reducing the number of parameters may contribute to the accuracy, but you need to know which are safe to drop and which are important to keep.
There are mathematical ways to do that. In the lecture the Professor refers to PCA (Principal Component Analysis) and SVD (Singular Value Decomposition) at one point as ways to reduce dimensions, although he doesn’t actually go down that path. SVD and PCA are covered in the original Stanford Machine Learning course by Prof Ng.
My introduction to this term was mathematical finance, discussed in the context of asset allocation, which depends on a covariance matrix that grows as the square of the number of assets, potentially becoming larger than the number of historical pricing data points. This leads to the use of factor analysis to reduce dimensionality. See for example in