C3 W2 Lab3 - Subword Tokenization

The code is working in the Coursera notebook. However, I like to follow along in my own Jupyter Notebook to practice importing datasets, etc.

When I run through the subword tokenization section I get the following error:

----> 5 sequences = sequences.ragged_batch(batch_size=sequences.cardinality())

AttributeError: ‘MapDataset’ object has no attribute ‘ragged_batch’

I played around with ChatGPT to find a workaround and it says the following:

How to Handle Ragged Data Correctly

Since you seem to be working with variable-length data and are perhaps looking to batch them efficiently, here’s how you can handle ragged data without using a non-existent ragged_batch method:

Option 1: Using padded_batch to Pad Sequences

This is the most straightforward method if your model does not natively support ragged tensors:

python

Copy

# Suppose `train_data` is your dataset
train_sequences = train_data.map(lambda text: vectorize_layer(text))

# Pad the sequences to a fixed length
train_padded = train_sequences.padded_batch(batch_size=32, padded_shapes=(MAX_LENGTH,), padding_values=0)

Option 2: Convert to Ragged Tensor After Batching

If you prefer to keep the data in its ragged form and handle it as such within your model:

python

Copy

# Batch the data first without padding
train_batched = train_sequences.batch(32)

# Convert each batch to a ragged tensor
train_ragged = train_batched.map(lambda x: tf.RaggedTensor.from_tensor(x, padding=0))

The problem with this method is it then gives the following error:

ValueError: The padded shape (120,) is not compatible with the shape () of the corresponding input component.

For my own understanding can you provide some guidance on what’s happening on my own machine. Am I using the wrong version or missing a download?

Thanks

Josh

Hi @joshuasyoung

Maybe your TensorFlow version doesn’t include ragged_batch, which is either newer or experimental in certain releases. Check TF version of the Coursera environment and make sure they are the same with yours.

Hope it helps! Feel free to ask if you need further assistance.

Thanks Alireza. I confirmed the versions used and Coursera is using tf 2.16.1 while my notebook has tf 2.18.0

Since it’s a newer version it seems they have changed this functionality. Do you know the best way to do this task using the newer version?

You’re welcome! happy to help :raised_hands:

I think you can either use padded_batch (most straightforward) or batch normally and then convert to a RaggedTensor.