Features Vector

as the vocabulary is created using all the training tweets. which means that for each tweet in the training tweets, we have at least one word that definitely belongs in the vocabulary. So, what’s the point of “sparse representation”?


Could you be more specific where the “sparse representation” is mentioned? As far as I remember the concept of sparse representation is not used in NLP Course 1 Week 1? In C1 W1 the words are represented by vectors of size 3 (and are not sparse).

In general, sparse representation (of a word) is the concept where the vector, representing the token (eg. word) is mostly 0 everywhere. For example, one-hot-encoding uses 1 in the word’s place and 0s in every other words place (concrete example representing some word : [0 0 0 … 0 0 1 0 0 … 0 0 0]). It’s just one of the ways to represent the “word”.


Actually I am referring to this particular part of the lecture.

As the Video mentions, you have to encode the tweet somehow (that is from words to numbers). One way to do it is to use a sparse representation (not an efficient one but still).

It is not efficient because if you have a three word tweet, then the resulting vector would three ones (in the places, that those words are) and many 0s. For example, a tweet “I love learning” would be encoded with a vector [1, 1, 1, 0, 0 … 0] or [0 0 1 0 1 0 1 0 … ] (depending on the vocabulary) which length is the whole vocabulary (for example 10 000).

Later in this week, another way is suggested with a vector length of just 3 (a more efficient way - just the bias, the number of “positive” appearances of the words combined and the number of “negative” appearances of the words combined). This way your vector representing the tweet is reduced from 10 000 + 1 to 2 + 1.

To add to @arvyzukai clear explanation, I would like to share the way I understood this:

I imagined that for each word I would have an on/off switch. If my vocab has 10 words, I would put in line 10 “on”/“off” switches, all initialized in position “off”. If I get a tweet with 3 words, I would turn to “ON” 3 of the switches and the other 7 would remain “off”.

If we add tweets and this increases our vocab to, say, 100 words, then I would have 100 switches now. If I have another tweet with 3 words, I would turn on 3 switches to “ON” and now I would have 97 switches in “OFF”.

See how as the vocab increases, we have more “OFF” switches in our line? in other words, the vector of these hypothetical switches becomes more sparse.

Hope this helps!


@Juan_Olano Your explanation is way better :+1:

That is a very elegant and easy way to visualize the problem. I will certainly remember it and “re-use” :slight_smile: it in the future.


I’m glad I could help!