Three general questions by starting the second week assignemnt

Why the f.read().split(’\n’) can sort the vocabulary as below?
image

and why do we need to use sorted() again in the next code block if it is already sorted?
image

Why there is a space character in the first one of the list, but when we check the first 50 letters in the vocabulary list there is not?
image

Hi.

  • f.read().split('\n') does not sort the vocabulary - it just reads the lines in hmm_vocab.txt(the order there is).
  • when we sort the vocabulary the first character actually is not space - it is '' (line 7 in the hmm_vocab.txt). I am not really sure why they use empty character in the vocabulary, but in reality you would need to consider yourself what should be in your vocabulary and what not

Thank you so much Arvyzukai. So does that mean hmm_vocab.txt has already been sorted?

hmm_vocab.txt is “kind of” sorted - Python sorts by byte value by default utf-8, for example :

In: sorted(['C', 'B', 'A', 'b', 'c', 'a', '1'])
Out: ['1', 'A', 'B', 'C', 'a', 'b', 'c']

Note: capital letters comes first

It’s “safer” for consistency to sort the list again when you read a file from unknown computer (maybe different encoding, maybe different locale (é comes after z) etc.). Actually, the word order in vocabulary does not need to be sorted but it must to be consistent through whole process of achieving your goal (during training, inference etc.) - you cannot have inconsistent mapping (for example, somewhere in your code word ! maps to: 0, and somewhere else in the code it cannot map to: 1