In Lab1, whats the rationale for adding the 1 to the 'total_words' variable

in the code here:
total_words = len(tokenizer.word_index) + 1

I don’t understand why the +1, anyone help?

Regards

@Dan_Reed

Because the to_categorical consumes from the index of 0 class as the default behavior. Have you noticed the definition of tf.keras.utils.to_categorical ?

Anyway, lets do tiny Hacking Bamm:

Here is the source of to_categorical:

def to_categorical(y, num_classes=None, dtype='float32'):
 
  y = np.array(y, dtype='int')
  input_shape = y.shape
  if input_shape and input_shape[-1] == 1 and len(input_shape) > 1:
    input_shape = tuple(input_shape[:-1])
    
  y = y.ravel()

  if not num_classes:
    num_classes = np.max(y) + 1
  n = y.shape[0]
  categorical = np.zeros((n, num_classes), dtype=dtype)

  print(y) # <-----------

  categorical[np.arange(n), y] = 1
  output_shape = input_shape + (num_classes,)
  categorical = np.reshape(categorical, output_shape)
  return categorical

To simplify everyting, let’s use a simple example:

b = tf.keras.utils.to_categorical([1, 2, 3, 4], num_classes=4) # Problem here!!!!
b

Suppose that the tokenizer has given [1, 2, 3, 4] only.

As you can see the categorical was planned for four classes which were our wish. But you can also find that the print outputs 1,2,3,4, where the 4 was invalid for indexing (it was looking for 5th class which didn’t exist). In order to compensate for this, we can fake a 0 class there, that was why the +1 was used.

When you gave num_classes=5, the return would be

array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]], dtype=float32)

As you can see there was an extra 0 class, in other words, each row contained 5 cols.

Notice: If did this, because the class 0 never exists, the class 0 would never be returned, which means you would never see a row with of full zero: [0,0,0,0,0].

Hopefully, help :stuck_out_tongue_winking_eye:

1 Like

what’s the problem with the follow?

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)

there are category, why do we need 0 classs

I think it was mentioned that the 1 was added to account for Out of Vocab tokens.

Hello Dhruv,
I do too think that the 1 accounts for OOV or out of vocabulary
What do you think @Dan_Reed sir?
Thanks and Regards,
Mayank Ghogale

1 Like