Course 5 Week 2 Assignment 1

Hello Mentors and learners,
Hope everyone is doing great!!!
A very silly doubt but I have to get it cleared.
So, can anyone explain this code snippet( (line by line would be better specifically for loop)?

`def read_glove_vecs(glove_file):
with open(glove_file, ‘r’) as f:
words = set()
word_to_vec_map = {}

    for line in f:
        line = line.strip().split()
        curr_word = line[0]
        words.add(curr_word)
        word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
return words, word_to_vec_map

`

Hi @pushkarp6 ,

I’ll try to explain what is happening in the code snippet:

`def read_glove_vecs(glove_file):
definition of the function with a file name pass in as parameter

with open(glove_file, ‘r’) as f:
open the file for reading, here ‘f’ is a file object to be referenced when accessing the file content as you will see in the ‘for’ loop

words = set()
call the set() function to create a set and assign the variable ‘words’ to the set.
In Python, a set is an unordered sequence of elements, and each element is unique and must be immutable (which cannot be changed). However, the set itself is mutable. In other words, we can add and remove items in a set.

word_to_vec_map = {}
declare word_to_vec_map as a dictionary data type

for line in f:
get a line form the file object referenced by f

line = line.strip().split()
removes any leading space and trailing space and split the line into individual word and store them in ‘line’ where ‘line’ is an array of strings

curr_word = line[0]
set curr_word point to the first element of line (ie. first word)

words.add(curr_word)
add curr_word into the set

word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
create an array object of type float64 and assign to word_to_vec_map indexed by curr_word.

So the ‘for’ loop will loop through the file content and stop when it reachs the end of file, ie, no more content to read.

1 Like

@Kic
That’s an awesome explanation, really helpful!!!
Thank you so much for your efforts and time I understood it in one go!!!:v:

It was nice explanation but can anyone elaborate more on this specific line:

word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)

first let me be clear what I understand from this:
line[1:] takes all words next to first word of the line [curr_word] and saves them in the format np.float64
(I am considering words in the datatype ‘string’)
So how is such little snippet, single line doing so much without using any functions (converting string to a float type and how is that even possible)
or here curr_words and all other words has some other datatype?
@ca_mentors_q1_2022
@Kic

I think it is better to see what is included in a “line”, which is a single entry in “glove.6B.50d.txt”.

When it is read from a file, it is like this.

'the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581\n'

It is a string, but includes two information in there. The first one is actual “word”, and others are vectors.
Once this raw “line” is splitted, then, a “line” becomes a list of 51 strings, like this.

['the',
 '0.418',
 '0.24968',
 '-0.41242',
 '0.1217',
      :
  '-0.11514',
 '-0.78581']

Now, you know what we need to do. The first entry is to be set to “curr_word”, and the rest should be stored into “word_to_vec_map” for “curr_word”. One thing that we need to do is, all vector entries are string encoded, which need to be casted to “float64”.

Hope this helps.

Thanks you @anon57530071, I don’t know how but actually my “glove.6B.50d.txt” file is empty, I can’t find anything there, that caused confusion I guess.
Thanks a lot for your help