Recommended way to tokenize new code

The documentation page for the tf.keras.preprocessing.text.Tokenizer function recommends not using it for new code:

Deprecated: tf.keras.preprocessing.text.Tokenizer does not operate on tensors and is not recommended for new code. Prefer tf.keras.layers.TextVectorization which provides equivalent functionality through a layer which accepts tf.Tensor input. See the text loading tutorial for an overview of the layer and text handling in tensorflow.

I’ve been trying to reproduce the basic examples of the first and second labs to the new way. I’d appreciate some help with the last part (getting the text back from the tokens):

import tensorflow as tf
import numpy as np

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens = 100,
    standardize = 'lower_and_strip_punctuation',
    output_sequence_length = 3
)

vectorize_layer.adapt(sentences)

vectorize_layer.get_vocabulary()

sentences_to_tokens = vectorize_layer([
    'i love tensorflow',
    'i love playing with this'
])

# Is there a better way to do this?
def textualize(tokenized_sentence):
    sentence = []
    for token in tokenized_sentence:
        sentence.append(vectorize_layer.get_vocabulary()[token])
    return ' '.join(sentence)

for tokenized_sentence in sentences_to_tokens:
    print(textualize(tokenized_sentence.numpy()))

# i love [UNK]
# i love [UNK]
>>> from keras.layers import StringLookup
>>> string_lookup = StringLookup(vocabulary=vectorize_layer.get_vocabulary(include_special_tokens=False), invert=True)
>>> string_lookup(sentences_to_tokens - 1)
<tf.Tensor: shape=(2, 3), dtype=string, numpy=
array([[b'i', b'love', b'[UNK]'],
       [b'i', b'love', b'[UNK]']], dtype=object)>
>>> 

1 Like