I looked into the codes provided for the assignment. I understand every portion of the code below except this statement: word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
Can someone please explain what is going on here, please? THat will help tremendously. Thanks!
CODE:
def read_glove_vecs(glove_file):
with open(glove_file, 'r') as f:
words = set()
word_to_vec_map = {}
for line in f:
line = line.strip().split()
curr_word = line[0]
words.add(curr_word)
word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
i = 1
words_to_index = {}
index_to_words = {}
for w in sorted(words):
words_to_index[w] = i
index_to_words[i] = w
i = i + 1
return words_to_index, index_to_words, word_to_vec_map
For anyone else who sees this, it is one of the utility functions that is provided for you with the assignment. The purpose is to read in the data for a pre-trained GloVE embedding map. It is not solution code, so it’s ok to show the source and discuss how it works.
You can print out some of the data as it’s being processed to see what is going on. The input file is a text file with 400000 lines. Each line has 51 tokens: the first is a word, the next 50 are ASCII encoded floating point numbers that form the 50 element GloVe embedding vector for that word. So what that line does is index off the last 50 elements and convert them to a numpy array with 50 elements of type np.float64. It then assigns that to the dictionary entry that maps the word in string form to the embedding.
Then it returns 3 python dictionaries that give you complete access to the mappings between words, index values and embedding vectors.
Here’s my instrumented version of the code that I used to figure that out:
def my_read_glove_vecs(glove_file):
with open(glove_file, 'r') as f:
words = set()
word_to_vec_map = {}
lines = 0
for line in f:
line = line.strip().split()
curr_word = line[0]
words.add(curr_word)
word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
my_vec = word_to_vec_map[curr_word]
lines += 1
if lines < 110 and lines > 100:
print(f"line {lines} word {curr_word} vec {my_vec.shape}")
print(f"total words {lines}")
i = 1
words_to_index = {}
index_to_words = {}
for w in sorted(words):
words_to_index[w] = i
index_to_words[i] = w
i = i + 1
return words_to_index, index_to_words, word_to_vec_map
When I run that in the notebook, here’s what I get:
line 101 word so vec (50,)
line 102 word them vec (50,)
line 103 word what vec (50,)
line 104 word him vec (50,)
line 105 word united vec (50,)
line 106 word during vec (50,)
line 107 word before vec (50,)
line 108 word may vec (50,)
line 109 word since vec (50,)
total words 400000
If you’re going to post code on the forum, please use the “preformatted text” tool, so that the indentation is correct and your code doesn’t look like Markdown.