in the code here:
total_words = len(tokenizer.word_index) + 1
I don’t understand why the +1, anyone help?
Regards
in the code here:
total_words = len(tokenizer.word_index) + 1
I don’t understand why the +1, anyone help?
Regards
Because the to_categorical
consumes from the index of 0
class as the default behavior. Have you noticed the definition of tf.keras.utils.to_categorical
?
Anyway, lets do tiny Hacking Bamm:
Here is the source of to_categorical
:
def to_categorical(y, num_classes=None, dtype='float32'):
y = np.array(y, dtype='int')
input_shape = y.shape
if input_shape and input_shape[-1] == 1 and len(input_shape) > 1:
input_shape = tuple(input_shape[:-1])
y = y.ravel()
if not num_classes:
num_classes = np.max(y) + 1
n = y.shape[0]
categorical = np.zeros((n, num_classes), dtype=dtype)
print(y) # <-----------
categorical[np.arange(n), y] = 1
output_shape = input_shape + (num_classes,)
categorical = np.reshape(categorical, output_shape)
return categorical
To simplify everyting, let’s use a simple example:
b = tf.keras.utils.to_categorical([1, 2, 3, 4], num_classes=4) # Problem here!!!!
b
Suppose that the tokenizer
has given [1, 2, 3, 4]
only.
As you can see the categorical
was planned for four
classes which were our wish. But you can also find that the print
outputs 1,2,3,4
, where the 4
was invalid for indexing (it was looking for 5th
class which didn’t exist). In order to compensate for this, we can fake
a 0
class there, that was why the +1
was used.
When you gave num_classes=5
, the return would be
array([[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.]], dtype=float32)
As you can see there was an extra 0
class, in other words, each row contained 5
cols.
Notice: If did this, because the class 0
never exists, the class 0
would never be returned, which means you would never see a row with of full zero: [0,0,0,0,0]
.
Hopefully, help
what’s the problem with the follow?
array([[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.]], dtype=float32)
there are category, why do we need 0 classs
I think it was mentioned that the 1 was added to account for Out of Vocab tokens.
Hello Dhruv,
I do too think that the 1 accounts for OOV or out of vocabulary
What do you think @Dan_Reed sir?
Thanks and Regards,
Mayank Ghogale