Have I missed doc2vec topic?

Hi! Have we learned how to make vectors from document? Not only from words?

Except this from onehot vectors

Hi someone555777,

The basis for this is discussed in the first course of the specialization.

can you reference on an exact video or lab maybe? I understand the topic of tokenization, but I don’t remember that we learned how to do minimalistic vector of large text specially from word embeddings.

Creating document vectors from word embeddings is a more specific issue. I do not remember this being discussed in depth. I am not sure if a best practice has been established regarding this. You can do a bit of googling and see which methods you can find and could be useful to you. It may be a subject to add to the course.

1 Like

Hi @someone555777

If I remember correctly the simple way was explained as to represent documents as word counts. For example, Shakespeare plays:
image

Then you can simply compare the documents by PCA or even simpler, by just two words:
image

* The images are from this excellent book - Speech and Language Processing by Dan Jurafsky and James H. Martin (Chapter 6).

Nowadays the more sophisticated ways are use to represent the text with Language Models. There is a new free short course - Understanding and Applying Text Embeddings with Vertex AI that I guess should explain a lot of this (I haven’t taken this course yet, but I think it should be about that).

Cheers