Data issue

When creating a post, please add:

I have not been through the material of week 4 here. Note that they say in the notebook that they did some significant preprocessing on the English and French embedding datasets in order to make them usable in the notebook environment on Coursera. They mention subsetting them pretty significantly to reduce memory consumption and also it’s clear that they have repackaged them using Pickle. So if your goal is to get a German embedding database that you can use in this assignment keeping everything else the same, you will have to write some code to perform the equivalent transformations on the full German dataset. So unless the course providers did this for you, it’s going to be a significant amount of work. E.g. how do you compute the equivalent subset of the German dataset? Did they mention any references to this issue in the lectures in Week 4? I have only looked at what it says in the assignment notebook and I don’t even see the link to the GitHub repo you give, although maybe I just missed it.

Maybe we get lucky and one of the real NLP mentors will be able to give more guidance here. Sorry.

Thank you for your response. For clarification, this is the repository mentioned in the Week 4 lab course: GitHub - vjstark/crosslingual_text_classification: cross lingual text classification on amazon reviews. In the lab, the subset was focused on French, and fortunately, the repository provided German word embeddings. I wanted to create one for German with all the dimensions.

Additionally, I came across another source that offers word embeddings in 300 dimensions, provided in both bin and text formats. In case someone needs it, here is the link: Wiki word vectors · fastText. Thank you for your help once again.