Creating embeddings of entire tweets

Jose_James · February 13, 2023, 6:02am

Hi so I understand that the network training an embedding layer which represents each word in a n-dimensional space

My question is the following mean layer. My understanding is that I am taking the mean for each dimension in the embedding layer, reducing a tweet to a single vector representation.

Does this essentially create then an “embedding” for the tweet as a whole? If not how can I intuitively understand what the mean operation accomplishes. I am not sure if embedding is the right word here, but I mean some sort of latent space representation.

If the mean layer does create an “embedding” for the whole tweet, then I can similarly represent tweets in a PCA plot (or just a straightforward plot since in the assignment the embedded layer consists of only 2 dimensions). How do I interpret this representation of a tweet compared to the output of the network.

Sorry if this is a confusing question. If needed I can try to demonstrate plots to clarify my question.

Anyway great job guys, love the course so far.

arvyzukai · February 13, 2023, 9:45am

Hi @Jose_James

That is correct.

Does this essentially create then an “embedding” for the tweet as a whole?

Yes, but in Course 3 Week 1 Assignment the model still does not account for word sequence (it is similar to CBOW approach where every word’s position in a sentence does not influence the prediction). For example (you can try yourself in the notebook):

If not how can I intuitively understand what the mean operation accomplishes.

It does create an “embedding” for the tweet but it is just a simple average over all embedding dimensions. Intuitively you could imagine the oversimplified analogy of first dimension - “bright vs. dark”, second - “light vs. heavy”, third - “near vs. far” and so on. Each word is assigned a value for each embedding dimension and when it comes to tweet embedding - it’s just the average of all words over these dimensions. Further down the network, the model’s Dense layer weights (or accounts for) these every dimension and outputs two values (pos vs. negative).

If the mean layer does create an “embedding” for the whole tweet, then I can similarly represent tweets in a PCA plot.

Yes, you can.

(or just a straightforward plot since in the assignment the embedded layer consists of only 2 dimensions).

Actually just the temp variables in the assignment are of dimension 2, the model trained is 256 (Embedding_9088_256 - meaning each 9088 word is assigned 256 dimension vector).

How do I interpret this representation of a tweet compared to the output of the network.

It’s tricky … You can use it to compare different words (or tweet embeddings averages) but the results are not straighforward and have many intricacies (depending on the dataset, pre-process, even with the same dataset - depending on weight initialization, random mini-batch assigning etc.) and as a result you can have different graphs. So in other words interpretation is tricky

Cheers

Jose_James · February 13, 2023, 11:26pm

Thanks for the reply this was helpful.

For the last point, could mapping the embeddings of entire tweets give insight for why a tweet has a positive or negative sentiment. For example, wrt to tweets that have a negative sentiment, it can cluster based on tweets on subject (like sports vs news) or emotions (anger vs sadness).

arvyzukai · February 14, 2023, 9:25am

No problem

The way it is implemented in this week it’s unlikely (tweet embedding is just the mean of it’s each individual word embeddings). Even individual word embeddings are unlikely to reveal any interesting insights because in this week these embeddings are not context depending. (You could find obvious correlations like “:(” correlates with negative but “sports vs. news” is unlikely). Classical techniques (like k-means) to find these correlations would be way easier.

On the other hand, later in the course more sophisticated embeddings are introduced (like BERT). By running analysis on them (which is entire research field on its own and is definitely not easy) could reveal interesting correlations / groups of tweets that offer insights why this or that.

Jose_James · February 14, 2023, 6:37pm

Thanks again for your insight. Could you point me a paper or lab as an example of using transformers to embed and cluster entire texts. Appreciate your time to help me.

arvyzukai · February 15, 2023, 7:04am

No problem

I would suggest this blog post by Chris McCormick (there’s a colab you can follow along and poke around).

As for the papers I would suggest BERT Rediscovers the Classical NLP Pipeline or What Happens To BERT Embeddings During Fine-tuning?. Of course, there are many other good papers, so please share with us what you’ll find interesting.

Cheers

Topic		Replies	Views
C3 W1 Assignment Model intuition NLP with Sequence Models week-1	1	507	December 29, 2022
How are word embedding calculated end to end NLP with Sequence Models week-1	6	599	January 10, 2023
Week2, emojify v_2, Embedding layer Sequence Models	2	526	April 21, 2022
Problem understanding: Manipulating word embeddings code NLP with Classification and Vector Spaces week-3	5	321	July 30, 2024
About word embeddings in the CBOW model NLP with Probabilistic Models week-4	1	519	December 1, 2022

Creating embeddings of entire tweets

Related topics