C3 W2 Lab3 - Subword Tokenization

joshuasyoung · February 26, 2025, 12:14pm

The code is working in the Coursera notebook. However, I like to follow along in my own Jupyter Notebook to practice importing datasets, etc.

When I run through the subword tokenization section I get the following error:

----> 5 sequences = sequences.ragged_batch(batch_size=sequences.cardinality())

AttributeError: ‘MapDataset’ object has no attribute ‘ragged_batch’

I played around with ChatGPT to find a workaround and it says the following:

How to Handle Ragged Data Correctly

Since you seem to be working with variable-length data and are perhaps looking to batch them efficiently, here’s how you can handle ragged data without using a non-existent ragged_batch method:

Option 1: Using `padded_batch` to Pad Sequences

This is the most straightforward method if your model does not natively support ragged tensors:

python

Copy

# Suppose `train_data` is your dataset
train_sequences = train_data.map(lambda text: vectorize_layer(text))

# Pad the sequences to a fixed length
train_padded = train_sequences.padded_batch(batch_size=32, padded_shapes=(MAX_LENGTH,), padding_values=0)

Option 2: Convert to Ragged Tensor After Batching

If you prefer to keep the data in its ragged form and handle it as such within your model:

python

Copy

# Batch the data first without padding
train_batched = train_sequences.batch(32)

# Convert each batch to a ragged tensor
train_ragged = train_batched.map(lambda x: tf.RaggedTensor.from_tensor(x, padding=0))

The problem with this method is it then gives the following error:

ValueError: The padded shape (120,) is not compatible with the shape () of the corresponding input component.

For my own understanding can you provide some guidance on what’s happening on my own machine. Am I using the wrong version or missing a download?

Thanks

Josh

Alireza_Saei · February 26, 2025, 12:24pm

Hi @joshuasyoung

Maybe your TensorFlow version doesn’t include ragged_batch, which is either newer or experimental in certain releases. Check TF version of the Coursera environment and make sure they are the same with yours.

Hope it helps! Feel free to ask if you need further assistance.

joshuasyoung · February 26, 2025, 7:01pm

Thanks Alireza. I confirmed the versions used and Coursera is using tf 2.16.1 while my notebook has tf 2.18.0

Since it’s a newer version it seems they have changed this functionality. Do you know the best way to do this task using the newer version?

Alireza_Saei · February 26, 2025, 7:04pm

You’re welcome! happy to help

I think you can either use padded_batch (most straightforward) or batch normally and then convert to a RaggedTensor.

Topic		Replies	Views
C3W1Lab2-correction_of_ragged_sequences Natural Language Processing in TensorFlow week-module-1	4	22	August 22, 2024
How to tokenize data for NER NLP with Sequence Models week-module-3	1	389	September 23, 2023
Natural Language Processing in TensorFlow: Week 3: Exploring Overfitting in NLP Natural Language Processing in TensorFlow	9	612	September 1, 2022
I don't really understand how padded_batch() works Natural Language Processing in TensorFlow week-module-2 , week-module-3 , week-module-4	1	559	May 18, 2022
C3W3 Assignment - Sequences and Padding Natural Language Processing in TensorFlow week-module-3	4	266	June 7, 2023

C3 W2 Lab3 - Subword Tokenization

How to Handle Ragged Data Correctly

Option 1: Using padded_batch to Pad Sequences

Option 2: Convert to Ragged Tensor After Batching

Related topics

Option 1: Using `padded_batch` to Pad Sequences