In Lab1, whats the rationale for adding the 1 to the 'total_words' variable

Dan_Reed · December 26, 2021, 11:56am

in the code here:
total_words = len(tokenizer.word_index) + 1

I don’t understand why the +1, anyone help?

Regards

Chris.X · March 21, 2022, 9:52am

Because the to_categorical consumes from the index of 0 class as the default behavior. Have you noticed the definition of tf.keras.utils.to_categorical ?

Anyway, lets do tiny Hacking Bamm:

Here is the source of to_categorical:

def to_categorical(y, num_classes=None, dtype='float32'):
 
  y = np.array(y, dtype='int')
  input_shape = y.shape
  if input_shape and input_shape[-1] == 1 and len(input_shape) > 1:
    input_shape = tuple(input_shape[:-1])
    
  y = y.ravel()

  if not num_classes:
    num_classes = np.max(y) + 1
  n = y.shape[0]
  categorical = np.zeros((n, num_classes), dtype=dtype)

  print(y) # <-----------

  categorical[np.arange(n), y] = 1
  output_shape = input_shape + (num_classes,)
  categorical = np.reshape(categorical, output_shape)
  return categorical

To simplify everyting, let’s use a simple example:

b = tf.keras.utils.to_categorical([1, 2, 3, 4], num_classes=4) # Problem here!!!!
b

Suppose that the tokenizer has given [1, 2, 3, 4] only.

As you can see the categorical was planned for four classes which were our wish. But you can also find that the print outputs 1,2,3,4, where the 4 was invalid for indexing (it was looking for 5th class which didn’t exist). In order to compensate for this, we can fake a 0 class there, that was why the +1 was used.

When you gave num_classes=5, the return would be

array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]], dtype=float32)

As you can see there was an extra 0 class, in other words, each row contained 5 cols.

Notice: If did this, because the class 0 never exists, the class 0 would never be returned, which means you would never see a row with of full zero: [0,0,0,0,0].

Hopefully, help

enzii · May 7, 2022, 11:43am

what’s the problem with the follow?

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)

there are category, why do we need 0 classs

Dhruv_Verma · May 25, 2022, 12:25pm

I think it was mentioned that the 1 was added to account for Out of Vocab tokens.

MayankGhogale · May 27, 2022, 3:34am

Hello Dhruv,
I do too think that the 1 accounts for OOV or out of vocabulary
What do you think @Dan_Reed sir?
Thanks and Regards,
Mayank Ghogale

Topic		Replies	Views
Using Keras' Tokenizer yields values that start at 1 rather than at 0.? Natural Language Processing in TensorFlow week-2 , week-3 , week-4	1	530	October 1, 2022
Tokenize_labels() function in assignment? Natural Language Processing in TensorFlow week-2 , week-3 , week-4	7	817	October 23, 2023
TF2 Course 2 Week 4 error in Convolutional Neural Networks in TensorFlow week-4	9	600	August 15, 2022
C3: W1: Lab1: Tokenizer number of words! Natural Language Processing in TensorFlow week-1	1	564	June 21, 2022
Error with this block of code Natural Language Processing in TensorFlow week-4	7	322	August 28, 2022

In Lab1, whats the rationale for adding the 1 to the 'total_words' variable

Related topics