Question on size of inputs in Transform preprocessing_fn

dzubke · August 22, 2021, 3:50pm

This question started as a question about how the function tft.compute_and_apply_vocabulary works across a Dataset. But really my main uncertainty centers around the greater preprocessing_fn in TFX Transform, which takes inputs as this sole argument.

For clarity, I’m referencing this function in the Transform component.

def preprocessing_fn(inputs):

I had thought that inputs is dictionary mapping keys to values for a single example (like the DictReader output), given how the values are referenced by their keys.

When the inputs are used in the preprocessing_fn like with:
tft.scale_to_0_1(inputs[key]) is the inputs[key] a single example or a tensor of all the data for that given key?

The tft.compute_and_apply_vocabulary or tft.scale_to_0_1 functions must initially be called across the whole dataset to be able to effectively create the string-to-integer mapping or know the value to scale the tensors by.

I’m just confused if inputs[key] is a single example or a tensor with all the examples (for a given key)?

Thanks for the help!

mjsmid · August 23, 2021, 5:42pm

Hi @dzubke,
Thanks for your question!

The inputs is indeed a dictionary which maps the feature keys to the raw untransformed features.

I hope the following snippet from this link answers your question on the input tensor:

The core tf.Transform API requires a user to construct a
“preprocessing function” that accepts and returns Tensors. This function is
built by composing regular functions built from TensorFlow ops, as well as
special functions we refer to as Analyzers. Analyzers behave similarly to
TensorFlow ops but require a full pass over the whole dataset to compute their
output value. The analyzers are defined in analyzers.py, while this module
provides helper functions that call analyzers and then use the results of the
anaylzers to transform the original data.
The user-defined preprocessing function should accept and return Tensors that
are batches from the dataset, whose batch size may vary. For example the
following preprocessing function centers the input ‘x’ while returning ‘y’
unchanged.

Please note that during serving, the single examples are transformed by the constants from the Transform Graph, so then there is no need to make the full pass of a batch before doing the transform anymore.
You can also have a look again at the Tensorflow Transform video from 07:30 onwards, where this concept is explained.

Best regards,
MAarten

dzubke · August 24, 2021, 4:23pm

Fantastic, @mjsmid! Thank you very much for that help! I should have thought to review the Transform lecture again. I now recall the difference between the Tf ops and Analyzers.

And digging into the link you shared to the mappers.py file, I see that the tft.apply_and_compute_vocabulary function does indeed call the analyzers.vocabulary function (link), which is how the tft.apply_and_compute_vocabulary scans across the whole dataset to compute the vocabulary.

Thank you for that additional point regarding serving.

This was exactly the information I needed to help solidify my understanding. Thanks again!!

Topic		Replies	Views
C2 W2 preprocessing_fn AI Discussions	1	49	June 16, 2023
C2W2 exercise 6 preprocessing_fn Machine Learning Data Lifecycle in Production	6	698	November 11, 2022
Week2_lecture video Machine Learning Data Lifecycle in Production week-2	1	307	January 25, 2024
Week 1 Lab 2 - Using tf.squeeze when transforming the image data Machine Learning Modeling Pipelines in Production	1	530	July 26, 2022
C2W2 Lab, got 5/10 on preprocessing_fn task Machine Learning Data Lifecycle in Production	4	661	January 17, 2023

Question on size of inputs in Transform preprocessing_fn

Related topics