I want to ask, is everything that we do for the generating of data in assigment C3_W3_Assignment.ipynb the same that we will do in the last assigment of attention course C4_W4_Assignment?
# trax allows us to use combinators to generate our data pipeline
data_pipeline = trax.data.Serial(
# randomize the stream
trax.data.Shuffle(),
# tokenize the data
trax.data.Tokenize(vocab_dir=VOCAB_DIR,
vocab_file=VOCAB_FILE),
# filter too long sequences
trax.data.FilterByLength(2048),
# bucket by length
trax.data.BucketByLength(boundaries=[128, 256, 512, 1024],
batch_sizes=[16, 8, 4, 2, 1]),
# add loss weights but do not add it to the padding tokens (i.e. 0)
trax.data.AddLossWeights(id_to_mask=0)
)
# apply the data pipeline to our train and eval sets
train_stream = data_pipeline(stream(train_data))
eval_stream = data_pipeline(stream(eval_data))
so, can we delete that huge func data_generator(batch_size, x, y, pad, shuffle=False, verbose=False)
from C3_W3_Assignment.ipynb if we use this code above, for example?
Just can delete trax.data.FilterByLength(2048) I think. But maybe it can be usefull it some cases too.