Recommended way to tokenize new code

carloscapote · December 13, 2023, 12:23pm

The documentation page for the tf.keras.preprocessing.text.Tokenizer function recommends not using it for new code:

Deprecated: tf.keras.preprocessing.text.Tokenizer does not operate on tensors and is not recommended for new code. Prefer tf.keras.layers.TextVectorization which provides equivalent functionality through a layer which accepts tf.Tensor input. See the text loading tutorial for an overview of the layer and text handling in tensorflow.

I’ve been trying to reproduce the basic examples of the first and second labs to the new way. I’d appreciate some help with the last part (getting the text back from the tokens):

import tensorflow as tf
import numpy as np

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens = 100,
    standardize = 'lower_and_strip_punctuation',
    output_sequence_length = 3
)

vectorize_layer.adapt(sentences)

vectorize_layer.get_vocabulary()

sentences_to_tokens = vectorize_layer([
    'i love tensorflow',
    'i love playing with this'
])

# Is there a better way to do this?
def textualize(tokenized_sentence):
    sentence = []
    for token in tokenized_sentence:
        sentence.append(vectorize_layer.get_vocabulary()[token])
    return ' '.join(sentence)

for tokenized_sentence in sentences_to_tokens:
    print(textualize(tokenized_sentence.numpy()))

# i love [UNK]
# i love [UNK]

balaji.ambresh · December 13, 2023, 9:56pm

>>> from keras.layers import StringLookup
>>> string_lookup = StringLookup(vocabulary=vectorize_layer.get_vocabulary(include_special_tokens=False), invert=True)
>>> string_lookup(sentences_to_tokens - 1)
<tf.Tensor: shape=(2, 3), dtype=string, numpy=
array([[b'i', b'love', b'[UNK]'],
       [b'i', b'love', b'[UNK]']], dtype=object)>
>>>

Topic		Replies	Views
Is the Tokenizer deprecated and no longer recommended for being used!? Natural Language Processing in TensorFlow week-module-1	3	1082	February 20, 2023
C4W1_Assignment - Exercise 5 NLP with Sequence Models week-module-1	41	1394	May 28, 2024
Tokenize_labels() function in assignment? Natural Language Processing in TensorFlow week-module-2 , week-module-3 , week-module-4	7	818	October 23, 2023
How to tokenize data for NER NLP with Sequence Models week-module-3	1	389	September 23, 2023
Wk 4, Lab 2: token_list = tokenizer.texts_to_sequences([line])[0] Natural Language Processing in TensorFlow week-module-4	3	283	March 5, 2023

Recommended way to tokenize new code

Related topics