General Question about Vector Space

Glenn_DiCostanzo · August 8, 2022, 3:20pm

My understanding is vector space refers to the word_embeddings that house all of the word vector representations. When those vectors were determined, was there preprocessing? Beyond just cleaning misc character symbols, was there stemming or lemmatization? Or do the vectors depend on separating run and running?

gent.spah · August 8, 2022, 4:03pm

Yeah there is preprocessing involved in creating the embeddings. One nice article I found about this is this one Embeddings. This is a general creation process (taken from the article).

Read the text → Preprocess text → Create (x, y) data points → Create one hot encoded (X, Y) matrices → train a neural network → extract the weights from the input layer

As you can see it involves quite a few complex processes which you will probably learn as the keep doing the specialization.

Glenn_DiCostanzo · August 8, 2022, 4:24pm

Thanks for the link. I’m interested to give that a read. Follow-up question: assuming I’m a user of NLP but not a developer, would that mean that I use pre-existing word_embedding files? Are they already part of typical NLP packages like NLTK & SpaCY? If a practitioner was looking to categorize some misc. industry-specific texts (short ~100 word comments), I assume it’s better to make a word embeddings file from the same industry’s texts. Sorry for open ended questions…this helps me organize what the lessons are teaching…if it’s background or foreground information.

gent.spah · August 9, 2022, 8:03am

Hi,

My thoughts on these:

assuming I’m a user of NLP but not a developer, would that mean that I use pre-existing word_embedding files? - Most probably but there might be older techniques that don’t use them but modern techniques use word embeddings.

Are they already part of typical NLP packages like NLTK & SpaCY? - I cant remember exactly right now but most probably if those are trained on a corpus they probably have words embeddings already.

f a practitioner was looking to categorize some misc. industry-specific texts (short ~100 word comments), I assume it’s better to make a word embeddings file from the same industry’s texts. - Yeah creating your own specific embedding for your purpose would be probably better than some model trained on generic texts.

Glenn_DiCostanzo · August 10, 2022, 12:01am

Hello Gent Spahiu,
Thanks for sharing your thoughts, and I also appreciate the link to the Medium article on Embeddings. All helpful.

John8 · August 13, 2022, 6:48pm

I appreciate the question and the answer. I was confused in the course because these “word embeddings” started appearing and I could not recall how they were constructed. Does the course discuss how these embeddings are created? Did I miss somethings? Thank you.

Topic		Replies	Views
So, what is word embeddings? NLP with Probabilistic Models week-4	26	833	August 23, 2023
Creating word embeddings NLP with Classification and Vector Spaces week-3	2	320	July 30, 2024
Extracting Word Embedding Vectors — what do we do? NLP with Probabilistic Models week-4	2	476	July 15, 2023
Data issue NLP with Classification and Vector Spaces week-4	2	290	December 31, 2023
Question for the vector representation NLP with Attention Models week-1	3	563	April 27, 2023

General Question about Vector Space

Related topics