Embedding matrix, connecting method in Deep Learning Specialization course 5 and short course "Understanding and applying text embeddings"

Hi, two years ago I finished the Deep Learning Specialization. In course 5, Andrew explains how an embedding matrix is created while training a NLP. In short, it is said that each word is matched with a category. I don’t remember exactly now, but say “fruit”, “person”, “vehicle” and so on. So the word “apple” would get a high value for “fruit”, but not so much for the other categories. Andrew also says that he uses these categories only to be able to explain it, but in fact the categories do not make any sense to humans as the embedding matrix is actually created during training.

In the short course “Understanding and applying text embeddings”, Andrew mention a few different techniques to create text embeddings. And in the course, the dimension of the embedding vectors is 768.

Now my question: Would that mean that the embedding matrix in the the Deep Learning Specialization have 768 “categories”?

I am just trying to connect what is being taught in the specialization and the short course. If nothing else, thanks to DeepLearning AI and Andrew Ng for all the awesome courses.

1 Like

There aren’t actually any categories. An embedding matrix just associates words with other words that tend to be found nearby in a chunk of text.

1 Like

@TMosh Right, but having an embedding vector with the dimension of 768, can it be connected to how this is explained in the Deep Learning Specialization? I.e. having 768 rows in the matrix Andrew uses to explain the embedding matrix in the Specialization?

1 Like

Andrew uses a lot of intuitive explanations to help convey general concepts.
They aren’t necessarily accurate in all the details.

1 Like

Since it’s been 2 years since you took DLS C5, it might be worth going back and just listening to the lectures on Week 2 about word embeddings. Prof Ng discusses both how to train word embedding models and how to use them in some detail. You can listen to the lectures in “audit” mode, so it doesn’t cost you anything.

The point is that the dimension of the word embeddings is a “hyperparameter”, meaning a choice you make as the system designer. So it’s basically arbitrary. And as both you and Tom have mentioned, the meaning of the individual dimensions in the learned embeddings are not really accessible in the sense that you can’t say what they really “mean”. They are just learned by the algorithm and they either work or they don’t when you then use them as part of the input to the training process for some other model, e.g. Attention or Transformers. Maybe you could deduce some things by comparing the embeddings of particular words. E.g. if you compared embedding values of “airplane”, “sailplane”, “pigeon” and “eagle”, maybe you could deduce that some combination of elements in the embedding might signify “degree of wingedness” or “degree of ability to fly”. One common technique is to use “cosine similarity” to find clusters of words that are close in meaning. Then you could examine the vectors of the various words to see which elements are relatively large in value and try to reason from there. But there is no guarantee you could discern anything meaningful from doing that.

There are a number of pretrained embedding models and some of them even come in different dimensions, which may be appropriate for different applications. E.g. You can find GloVe models with 100 dimensions and also with 300 dimensions. Meaning that there may be applications in which 100 dimensions are “good enough” and that saves you compute and memory when training your higher level model based on the embeddings. Word2Vec uses 300, I think. I’ve seen references to other models having 1000 dimensions.

2 Likes