I’m now on the last section
label_sequences, label_word_index = tokenize_labels(labels)
print(f"Vocabulary of labels looks like this {label_word_index}\n")
print(f"First ten sequences {label_sequences[:10]}\n")
It prints out a LOT of values:
Vocabulary of labels looks like this {'<OOV>': 1, 's': 2, 'said': 3, 'will': 4, 'not': 5, 'mr': 6, 'year': 7, 'also': 8, 'people': 9, 'new': 10, 'us': 11, 'one': 12, 'can': 13, 'last': 14, 'first': 15, 't': 16, 'time': 17, 'two': 18, ..... all the way to 'allocating': 29713, 'heerenveen': 29714}
First ten sequences [[96, 176, 1157, 1220, 54, 1122, 742, 5211, 85, 1074, 4267, 147, 184, 4127, 1344, 1311, 1595, 47, 9, 949, 96, 4, 6516, 329, 92, 23, 17, 140, 3128, 1330, 2519, 576, 419, 1277, 72, 2963, 3046, 1755, 10, 894, 4, 755, 12, … all the way to 14996, 14997, 6527, 4802, 31, 5813, 10942, 19540, 19541, 19542, 19543, 162, 59, 949, 27, 4003, 8836, 5003, 3, 30, 63, 2884, 4420, 2, 63, 4004]]
So, clearly not getting the “Expected Output” of:
Vocabulary of labels looks like this {'sport': 1, 'business': 2, 'politics': 3, 'tech': 4, 'entertainment': 5}
First ten sequences [[4], [2], [1], [1], [5], [3], [3], [1], [1], [5]]