How to use tf.repeat and another buil-in highlevel funcs on Dataset?

I want to do something like this code of NER task which will align WordPieces of word with tag of this word:

import tensorflow as tf

tokens = tf.ragged.constant([[4], [2, 5, 9]], dtype=tf.int32)
tags = tf.ragged.constant([3, 5], dtype=tf.int32)

flat_tokens = tf.reshape(tokens, [-1])
duplicated_tags = tf.repeat(tags, [tf.shape(tok)[0] for tok in tokens])

print(flat_tokens.numpy())  # -> [4 2 5 9]
print(duplicated_tags.numpy())  # -> [3 5 5 5]

But with input of tokens and tags to tf.repeat as datasets which should be outputs of TextLineDataset. Are any minimalistic ways to do it?

1 Like

Please move this question to General Discussion category since it’s not course related.

It is directly related NLP in Tensorflow course. This topic should be included in it when you provide its update due to a problems, that we discussed.
It is directly connected with tokenization topic. Most part of modern nlp tokenizers are WordPiece.

What I meant was that concepts like ragged tensors and wordpiece tokenization aren’t covered in the course material.

and I offer to include them :slight_smile: At least wordpiece tokenization. As I remember, ragged tensors are discussed on your another Tensorflow course.

The staff have been notified about your offer.

1 Like

Hi Mihail. Thank you for the suggestion. Unfortunately, we can’t give you an answer anytime soon if this can be included in the course itself. But we’ll take note of it. Thanks again!