Have I missed doc2vec topic?

someone555777 · August 2, 2023, 7:13pm

Hi! Have we learned how to make vectors from document? Not only from words?

Except this from onehot vectors

reinoudbosch · September 5, 2023, 4:34pm

Hi someone555777,

The basis for this is discussed in the first course of the specialization.

someone555777 · September 6, 2023, 4:11pm

can you reference on an exact video or lab maybe? I understand the topic of tokenization, but I don’t remember that we learned how to do minimalistic vector of large text specially from word embeddings.

reinoudbosch · September 6, 2023, 4:47pm

Creating document vectors from word embeddings is a more specific issue. I do not remember this being discussed in depth. I am not sure if a best practice has been established regarding this. You can do a bit of googling and see which methods you can find and could be useful to you. It may be a subject to add to the course.

arvyzukai · September 7, 2023, 6:05am

Hi @someone555777

If I remember correctly the simple way was explained as to represent documents as word counts. For example, Shakespeare plays:

Then you can simply compare the documents by PCA or even simpler, by just two words:

* The images are from this excellent book - Speech and Language Processing by Dan Jurafsky and James H. Martin (Chapter 6).

Nowadays the more sophisticated ways are use to represent the text with Language Models. There is a new free short course - Understanding and Applying Text Embeddings with Vertex AI that I guess should explain a lot of this (I haven’t taken this course yet, but I think it should be about that).

Cheers

Topic		Replies	Views
How a vector representation is created for a document? Generative AI with Large Language Models week-3	0	114	June 2, 2024
[Week 2] - Embedding and Transfer Learning Sequence Models	6	613	May 24, 2021
Creating word embeddings NLP with Classification and Vector Spaces week-3	2	321	July 30, 2024
Question on Sentiment Classification Lecture Sequence Models week-2	6	262	January 19, 2024
C1_W4_Assignment: Problems in grading the submissions NLP with Classification and Vector Spaces week-4	6	548	December 14, 2022

Have I missed doc2vec topic?

Related topics