This question started as a question about how the function
tft.compute_and_apply_vocabulary works across a Dataset. But really my main uncertainty centers around the greater
preprocessing_fn in TFX Transform, which takes
inputs as this sole argument.
For clarity, I’m referencing this function in the Transform component.
I had thought that
inputs is dictionary mapping keys to values for a single example (like the DictReader output), given how the values are referenced by their keys.
inputs are used in the
preprocessing_fn like with:
tft.scale_to_0_1(inputs[key]) is the
inputs[key] a single example or a tensor of all the data for that given key?
tft.scale_to_0_1 functions must initially be called across the whole dataset to be able to effectively create the string-to-integer mapping or know the value to scale the tensors by.
I’m just confused if
inputs[key] is a single example or a tensor with all the examples (for a given key)?
Thanks for the help!
Thanks for your question!
The inputs is indeed a dictionary which maps the feature keys to the raw untransformed features.
I hope the following snippet from this link answers your question on the input tensor:
The core tf.Transform API requires a user to construct a
“preprocessing function” that accepts and returns
Tensors. This function is
built by composing regular functions built from TensorFlow ops, as well as
special functions we refer to as
Analyzers behave similarly to
TensorFlow ops but require a full pass over the whole dataset to compute their
output value. The analyzers are defined in analyzers.py, while this module
provides helper functions that call analyzers and then use the results of the
anaylzers to transform the original data.
The user-defined preprocessing function should accept and return
are batches from the dataset, whose batch size may vary. For example the
following preprocessing function centers the input ‘x’ while returning ‘y’
Please note that during serving, the single examples are transformed by the constants from the Transform Graph, so then there is no need to make the full pass of a batch before doing the transform anymore.
You can also have a look again at the Tensorflow Transform video from 07:30 onwards, where this concept is explained.
Fantastic, @mjsmid! Thank you very much for that help! I should have thought to review the Transform lecture again. I now recall the difference between the Tf ops and Analyzers.
And digging into the link you shared to the mappers.py file, I see that the
tft.apply_and_compute_vocabulary function does indeed call the
analyzers.vocabulary function (link), which is how the
tft.apply_and_compute_vocabulary scans across the whole dataset to compute the vocabulary.
Thank you for that additional point regarding serving.
This was exactly the information I needed to help solidify my understanding. Thanks again!!