Question on size of inputs in Transform preprocessing_fn

This question started as a question about how the function tft.compute_and_apply_vocabulary works across a Dataset. But really my main uncertainty centers around the greater preprocessing_fn in TFX Transform, which takes inputs as this sole argument.

For clarity, I’m referencing this function in the Transform component.

def preprocessing_fn(inputs):

I had thought that inputs is dictionary mapping keys to values for a single example (like the DictReader output), given how the values are referenced by their keys.

When the inputs are used in the preprocessing_fn like with:
tft.scale_to_0_1(inputs[key]) is the inputs[key] a single example or a tensor of all the data for that given key?

The tft.compute_and_apply_vocabulary or tft.scale_to_0_1 functions must initially be called across the whole dataset to be able to effectively create the string-to-integer mapping or know the value to scale the tensors by.

I’m just confused if inputs[key] is a single example or a tensor with all the examples (for a given key)?

Thanks for the help!

Hi @dzubke,
Thanks for your question!

The inputs is indeed a dictionary which maps the feature keys to the raw untransformed features.

I hope the following snippet from this link answers your question on the input tensor:

The core tf.Transform API requires a user to construct a
“preprocessing function” that accepts and returns Tensors. This function is
built by composing regular functions built from TensorFlow ops, as well as
special functions we refer to as Analyzers. Analyzers behave similarly to
TensorFlow ops but require a full pass over the whole dataset to compute their
output value. The analyzers are defined in analyzers.py, while this module
provides helper functions that call analyzers and then use the results of the
anaylzers to transform the original data.
The user-defined preprocessing function should accept and return Tensors that
are batches from the dataset, whose batch size may vary. For example the
following preprocessing function centers the input ‘x’ while returning ‘y’
unchanged.

Please note that during serving, the single examples are transformed by the constants from the Transform Graph, so then there is no need to make the full pass of a batch before doing the transform anymore.
You can also have a look again at the Tensorflow Transform video from 07:30 onwards, where this concept is explained.

Best regards,
MAarten

1 Like

Fantastic, @mjsmid! Thank you very much for that help! I should have thought to review the Transform lecture again. I now recall the difference between the Tf ops and Analyzers.

And digging into the link you shared to the mappers.py file, I see that the tft.apply_and_compute_vocabulary function does indeed call the analyzers.vocabulary function (link), which is how the tft.apply_and_compute_vocabulary scans across the whole dataset to compute the vocabulary.

Thank you for that additional point regarding serving.

This was exactly the information I needed to help solidify my understanding. Thanks again!!

1 Like