C3W2 fit_label_encoder Error

Hi,
In this assignment the function “fit_label_encoder” when we apply

tf.keras.layers.StringLookup(oov_tokens=None)

whether I put the argument of oov_tokens=None or just ignore it I am not able to remove the 'UNK" words from the output any idea what am I missing?

Failed test case: Got the wrong vocabulary to encode labels.
Expected:
[‘sport’, ‘business’, ‘politics’, ‘tech’, ‘entertainment’],
but got:
[None, ‘tech’, ‘sport’, ‘politics’, ‘entertainment’, ‘business’].

StringLookup doesn’t support a parameter called oov_tokens. Please see the docs to locate a parameter that allows you to set the number of oov tokens to 0.

>>> import tensorflow as tf
>>> l = tf.keras.layers.StringLookup(oov_tokens=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/myhome/anaconda3/envs/tf2/lib/python3.10/site-packages/keras/src/layers/preprocessing/string_lookup.py", line 320, in __init__
    super().__init__(
  File "/home/myhome/anaconda3/envs/tf2/lib/python3.10/site-packages/keras/src/layers/preprocessing/index_lookup.py", line 193, in __init__
    raise ValueError(f"Unrecognized keyword argument(s): {kwargs}")
ValueError: Unrecognized keyword argument(s): {'oov_tokens': None}
>>> tf.__version__
'2.17.0'
>>> tf.keras.__version__
'3.4.1'

Hi Balaji,
yes there is parameter in Stringlookup function called ‘oov_token’ I just add ‘s’ at end by mistake. I tried both leaving Stringlookup with default parameter and then I edit the oov_token to None but the result is same, this is the only function in the assignment which gives me error and because of this it gives me this result

Shape of the train batch: (32, 2, 120) Shape of the validation batch: (32, 2, 120)

Expected output:

Shape of the train batch: (32, 120)
Shape of the validation batch: (32, 120)

oov_token is for setting a custom unknown token (instead of [UNK]). Here’s an example:

>>> import tensorflow as tf
>>> vocab = ["a", "b", "c", "d"]
>>> l = tf.keras.layers.StringLookup(oov_token="[UNKNOWN]", vocabulary=vocab, invert=True)
>>> data = [[1, 2, 3, 4, 5]]
>>> l(data)
<tf.Tensor: shape=(1, 5), dtype=string, numpy=array([[b'a', b'b', b'c', b'd', b'[UNKNOWN]']], dtype=object)>

Don’t worry about oov_token. Look for another constructor argument to set the number of OOV tokens to 0 By default, this parameter value is 1 which is why we have an [UNK] token.

2 Likes

thanks balaji,
I get it, but the issue of expected output shape is still wrong
Shape of the train batch: (32, 2, 120) Shape of the validation batch: (32, 2, 120)
and because of this issue my create_model() function is not working as the output shape is not what to be expected.

When it comes to encoding text data, here’s what the dimensions of form (BATCH_SIZE, TOKENS) mean for this problem:

  1. BATCH_SIZE: This refers to the number of rows of the text data that’s encoded.
  2. TOKENS: These are the numeric representations of the underlying vocabulary. The reason for encoding a text to integer is to act as a lookup into the embedding layer. Embedding layer maps a word in the vocabulary to a dense vector.

Padding is useful when the output of an embedding layer output is followed by a Dense layer since the weights need to be fixed when the model is built. We can’t have varying lengths within a single batch / across batches for this reason. Some folks leave this logic in place since they are playing with different NN architectures.

Does this help figure out where shapes went wrong?

as for my understanding you are referring to ‘Embedding layer’?

tf.keras.layers.Embedding([moderator edit - code removed])
these are the argument i passed in the Embedding layer.

You are correct.

The details were provided for you to understand how integer token indices are used.

1 Like