Limitation of EmbeddingBag and Manual pooling Classifier

Hi @ developer of this course

Video Title and Ungraded Lab Title: Building a Simple Text Classifier in PyTorch

I was going through the video and lab explanation of simple text classifier explaining embeddingbag and manual pooling techniques to classify distinctive fruit or vegetable classes.

But I just wanted to add that these both techniques would fail when it would come to create a text classifier for synonyms or opposite words as the model works more on words of embedding based on mean, max and sum.

I understood you probably wanted to explain the two concepts here, but mentioning these two techniques limitation would also would have added to content quality of the course.

Like for the product was labelled as fruit for lemon which is correct but the recipe description mentions it as **Lemon and Herd Roasted Chicken" as a fruit recipe in a poor labelling when it comes to realistic contextual understanding of this recipe as lemon is a flavor and the main ingredient is the chicken, so this actually wouldn’t come under neither fruit-recipe not vegetable recipe, rather a distant recipe category of Mixed recipe.:backhand_index_pointing_down:t2:

So model is clearly just working based on encoded labels instead of embeddingbag pooling or manual pooling..

Even Ungraded Lab: Fine Tuning Pre-Trained Text Classifier has the same labelling issue where avacado Toast is labelled as Vegetable here where as Avacado is considered and labelled as fruit in the Embeddingbag pooling and manual pooling text classifier (please compare the two images),:backhand_index_pointing_down:t2:

Hi @Deepti_Prasad,

Thank you for your keen observation regarding the limitations of EmbeddingBag and manual pooling, particularly how they struggle with deeper semantic relationships like synonyms. You are absolutely right that these techniques often fall short compared to more advanced architectures when dealing with complex context.

However, regarding the specific labeling issues you noticed (e.g., “Lemon and Herb Roasted Chicken” or the “Avocado Toast” inconsistency), it is important to note that the primary focus of these labs is to teach the mechanics of the pipeline, such as preprocessing, batching with collate_fn, and fine-tuning, rather than to build a production-ready food classifier.

The “poor labeling” is actually a known constraint of the dataset used for this pedagogical exercise, and the notebooks explicitly warn about this difference between the training labels (derived from ingredients) and the model input (restricted to titles).

In the “Building a Simple Text Classifier” Lab: The notebook explicitly sets this expectation in the Testing the Best Model on New Examples section:

Note: Remember, the model’s predictions are based only on the words in the recipe’s name. It was never shown the ingredients list, so it has no knowledge of whether fruits or vegetables are the dominant ingredient. A recipe’s name can sometimes be misleading, and the model’s classification will reflect only what it has learned from the title’s text.

It also explains earlier that the ground truth was generated by scanning ingredients (which the model never sees), creating the very disconnect you observed:

…It scanned each recipe’s ingredients for a predefined list of common fruit and vegetable keywords…

In the “Fine Tuning Pre-Trained Text Classifier” Lab: This context is reiterated. The Revisiting Recipe Dataset section reminds us we are using the same subset, and the Testing the Fine-tuned BERT Model section includes the exact same warning:

A recipe’s name can sometimes be misleading, and the model’s classification will reflect only what it has learned from the title’s text.

The takeaway here is the successful implementation of the process (building the dataset class, handling padding, training loops), rather than the quality of the model’s output on this specific, noisy dataset. The fact that the models struggle with “Lemon Chicken” or “Avocado Toast” effectively demonstrates exactly why we need the advanced techniques covered later in the courses.

Best,
Mubsi

1 Like

Hi @Mubsi

how can one categories two basic correlation between dataset and correct model output as separate entities and summarise it as successful implementation of process (building the dataset class, handling padding, training loops).

With respectful consideration, i highly disagree on this thought as the exercise is not only explaining various techniques of data handling such as embedding to manual pooling to fine tuning, but a successful outcome for the model to detect the right output i.e. correct ingredient type.

Here is my take how the output could have been used or addressed upho for this issue. In the model architecture in both labs only one linear layer was used to classify or detect the two recipe types causing this incorrect output (ofcourse the core issue improper scan detection of images because I don’t know how did the scanning detect the lemon for lemon flavored roasted chicken :thinking:, could it smell) and this is where it should have been addressed for a fruit to be correlated with the complete receipe statement and/or image by adding another hidden layer for adjunct/main ingredient that could detect chicken (but if you are telling the ground truth scanning couldn’t detect chicken, then surely the data processing step need to improvised for these labs)

Another linear hidden layer would have helped to correlate the fruit type or vegetable type with adjunct ingredient for the model to detect when lemon flavor was mentioned with chicken, it would have labelled it as neither fruit or vegetable type category.

My overall point is when these assignments are designed to explain a particular step or part of process in model training or data processing, every aspect of the outcome need to be considered while creating, designing or producing a successful representation of course and not mere collections of topics.

The reason because when a learner is looking to learn or understand a concept, they are either completely unaware of the topic or partially aware, and when such assignments come across, which overlooks at data labelling or processing to get correct model output, as a learner I am disappointed to not even be aware of this gruesome mistake because many models when go in production and fail, either team members look for adjusting to the outcome or finding the core issue behind model failure, and in this situation surely your response tells me to adjust with assignment presentation outcome than addressing the data mishandling in the assignment.

Regards

DP

Hi @Deepti_Prasad,

I appreciate your engagement, but I need to be very definite about the points you are raising. The issues you are describing, regarding dataset curation, model expectations, and “success”, are all explicitly addressed in the markdown text of the notebooks. While minor human oversight errors may exist, from the perspective of the technical implementation, the learning goals, and the pipeline mechanics, there are no mistakes in these notebooks.

“The Recipe Dataset” section in the notebook (Lab 3) clearly explains how the dataset was curated. It did not detect ingredients based on “smell.” The curation logic is fully transparent in the filter_recipe_datasetfunction provided in the helper_utils.py file (as mentioned in the markdown).

This function scans for specific fruit_keywords and vegetable_keywords. Because “Chicken” is not in the vegetable keyword list, and “Lemon” is in the fruit keyword list, the script categorizes it as Fruit. This logic is intentional to create an easy-to-use, simple dataset for the purposes of these notebooks, as explained in the text.

Data labeling was not overlooked. By definition, “data mishandling” implies corruption, data leakage, or improper formatting that breaks the pipeline, none of which is happening here. The “poor labeling” you refer to is a documented constraint of the dataset subset we are using.

The notebooks explicitly set the expectation for the model’s success: predicting the correct class out of the two specific labels used (Fruit or Vegetable). The notebook never claimed to use 5, 10, or 15 labels (like “Meat” or “Mixed”). The model was trained on binary labels (0 and 1). Expecting the model to output a category it was never trained on is outside the scope of these specific labs.

Regarding your technical suggestion, adding a hidden layer would not solve this for two reasons:

  1. Input Limitations: The markdown explicitly states: “the model’s predictions are based only on the words in the recipe’s name. It was never shown the ingredients list”. No amount of hidden layers can correlate “Lemon” with “Chicken” if the model cannot see “Chicken” in the input (again, there are only two labels, and chicken is not one of them).
  2. Ground Truth: Since the ground truth label in this dataset is “Fruit”, a more complex model with more layers would simply learn to replicate that label more accurately, not “correct” it.

The notebooks are a “successful implementation” because they successfully demonstrate the mechanics of the Deep Learning pipeline, tokenization, batching, and training loops, which is the stated goal.

We must balance realism with educational resources. If I were to use the entire Food.com dataset with full ingredients lists and multi-class labels, the computational requirements would far exceed what is available for these labs.

I strongly encourage you to read the markdown cells in the notebooks thoroughly. The limitations, scope, and objectives are clearly laid out there.

Best,
Mubsi