C3W2 fit_label_encoder Error

Sajjad_Ali · September 25, 2024, 3:49pm

Hi,
In this assignment the function “fit_label_encoder” when we apply

tf.keras.layers.StringLookup(oov_tokens=None)

whether I put the argument of oov_tokens=None or just ignore it I am not able to remove the 'UNK" words from the output any idea what am I missing?

Failed test case: Got the wrong vocabulary to encode labels.
Expected:
[‘sport’, ‘business’, ‘politics’, ‘tech’, ‘entertainment’],
but got:
[None, ‘tech’, ‘sport’, ‘politics’, ‘entertainment’, ‘business’].

balaji.ambresh · September 25, 2024, 4:38pm

StringLookup doesn’t support a parameter called oov_tokens. Please see the docs to locate a parameter that allows you to set the number of oov tokens to 0.

>>> import tensorflow as tf
>>> l = tf.keras.layers.StringLookup(oov_tokens=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/myhome/anaconda3/envs/tf2/lib/python3.10/site-packages/keras/src/layers/preprocessing/string_lookup.py", line 320, in __init__
    super().__init__(
  File "/home/myhome/anaconda3/envs/tf2/lib/python3.10/site-packages/keras/src/layers/preprocessing/index_lookup.py", line 193, in __init__
    raise ValueError(f"Unrecognized keyword argument(s): {kwargs}")
ValueError: Unrecognized keyword argument(s): {'oov_tokens': None}
>>> tf.__version__
'2.17.0'
>>> tf.keras.__version__
'3.4.1'

Sajjad_Ali · September 25, 2024, 4:49pm

Hi Balaji,
yes there is parameter in Stringlookup function called ‘oov_token’ I just add ‘s’ at end by mistake. I tried both leaving Stringlookup with default parameter and then I edit the oov_token to None but the result is same, this is the only function in the assignment which gives me error and because of this it gives me this result

Shape of the train batch: (32, 2, 120) Shape of the validation batch: (32, 2, 120)

Expected output:

Shape of the train batch: (32, 120)
Shape of the validation batch: (32, 120)

balaji.ambresh · September 25, 2024, 6:14pm

oov_token is for setting a custom unknown token (instead of [UNK]). Here’s an example:

>>> import tensorflow as tf
>>> vocab = ["a", "b", "c", "d"]
>>> l = tf.keras.layers.StringLookup(oov_token="[UNKNOWN]", vocabulary=vocab, invert=True)
>>> data = [[1, 2, 3, 4, 5]]
>>> l(data)
<tf.Tensor: shape=(1, 5), dtype=string, numpy=array([[b'a', b'b', b'c', b'd', b'[UNKNOWN]']], dtype=object)>

Don’t worry about oov_token. Look for another constructor argument to set the number of OOV tokens to 0 By default, this parameter value is 1 which is why we have an [UNK] token.

Sajjad_Ali · September 26, 2024, 2:54am

thanks balaji,
I get it, but the issue of expected output shape is still wrong
Shape of the train batch: (32, 2, 120) Shape of the validation batch: (32, 2, 120)
and because of this issue my create_model() function is not working as the output shape is not what to be expected.

balaji.ambresh · September 26, 2024, 3:40am

When it comes to encoding text data, here’s what the dimensions of form (BATCH_SIZE, TOKENS) mean for this problem:

BATCH_SIZE: This refers to the number of rows of the text data that’s encoded.
TOKENS: These are the numeric representations of the underlying vocabulary. The reason for encoding a text to integer is to act as a lookup into the embedding layer. Embedding layer maps a word in the vocabulary to a dense vector.

Padding is useful when the output of an embedding layer output is followed by a Dense layer since the weights need to be fixed when the model is built. We can’t have varying lengths within a single batch / across batches for this reason. Some folks leave this logic in place since they are playing with different NN architectures.

Does this help figure out where shapes went wrong?

Sajjad_Ali · September 26, 2024, 4:59am

as for my understanding you are referring to ‘Embedding layer’?

tf.keras.layers.Embedding([moderator edit - code removed])
these are the argument i passed in the Embedding layer.

balaji.ambresh · September 26, 2024, 6:56am

You are correct.

The details were provided for you to understand how integer token indices are used.

Topic		Replies	Views
Check OOV token Natural Language Processing in TensorFlow week-2	4	98	November 4, 2024
Unable to Exclude OOV Tokens Natural Language Processing in TensorFlow week-1	9	515	November 18, 2024
C3W1 Assignment - Natural Language processing in Tensorflow String Lookup Natural Language Processing in TensorFlow week-1	5	46	March 11, 2025
Tensorflow Developer Course 3 Week 1 Natural Language Processing in TensorFlow week-1	2	23	January 16, 2025
C3W2 wrong vocabulary to encode labels Introduction to TF for Artificial Intelligence ... week-2	2	38	September 25, 2024

C3W2 fit_label_encoder Error

Related topics