Hi so I understand that the network training an embedding layer which represents each word in a n-dimensional space

My question is the following mean layer. My understanding is that I am taking the mean for each dimension in the embedding layer, reducing a tweet to a single vector representation.

Does this essentially create then an “embedding” for the tweet as a whole? If not how can I intuitively understand what the mean operation accomplishes. I am not sure if embedding is the right word here, but I mean some sort of latent space representation.

If the mean layer does create an “embedding” for the whole tweet, then I can similarly represent tweets in a PCA plot (or just a straightforward plot since in the assignment the embedded layer consists of only 2 dimensions). How do I interpret this representation of a tweet compared to the output of the network.

Sorry if this is a confusing question. If needed I can try to demonstrate plots to clarify my question.

Does this essentially create then an “embedding” for the tweet as a whole?

Yes, but in Course 3 Week 1 Assignment the model still does not account for word sequence (it is similar to CBOW approach where every word’s position in a sentence does not influence the prediction). For example (you can try yourself in the notebook):

If not how can I intuitively understand what the mean operation accomplishes.

It does create an “embedding” for the tweet but it is just a simple average over all embedding dimensions. Intuitively you could imagine the oversimplified analogy of first dimension - “bright vs. dark”, second - “light vs. heavy”, third - “near vs. far” and so on. Each word is assigned a value for each embedding dimension and when it comes to tweet embedding - it’s just the average of all words over these dimensions. Further down the network, the model’s Dense layer weights (or accounts for) these every dimension and outputs two values (pos vs. negative).

If the mean layer does create an “embedding” for the whole tweet, then I can similarly represent tweets in a PCA plot.

Yes, you can.

(or just a straightforward plot since in the assignment the embedded layer consists of only 2 dimensions).

Actually just the temp variables in the assignment are of dimension 2, the model trained is 256 (Embedding_9088_256 - meaning each 9088 word is assigned 256 dimension vector).

How do I interpret this representation of a tweet compared to the output of the network.

It’s tricky … You can use it to compare different words (or tweet embeddings averages) but the results are not straighforward and have many intricacies (depending on the dataset, pre-process, even with the same dataset - depending on weight initialization, random mini-batch assigning etc.) and as a result you can have different graphs. So in other words interpretation is tricky

For the last point, could mapping the embeddings of entire tweets give insight for why a tweet has a positive or negative sentiment. For example, wrt to tweets that have a negative sentiment, it can cluster based on tweets on subject (like sports vs news) or emotions (anger vs sadness).

The way it is implemented in this week it’s unlikely (tweet embedding is just the mean of it’s each individual word embeddings). Even individual word embeddings are unlikely to reveal any interesting insights because in this week these embeddings are not context depending. (You could find obvious correlations like “:(” correlates with negative but “sports vs. news” is unlikely). Classical techniques (like k-means) to find these correlations would be way easier.

On the other hand, later in the course more sophisticated embeddings are introduced (like BERT). By running analysis on them (which is entire research field on its own and is definitely not easy) could reveal interesting correlations / groups of tweets that offer insights why this or that.

Thanks again for your insight. Could you point me a paper or lab as an example of using transformers to embed and cluster entire texts. Appreciate your time to help me.