Hi @Deepti_Prasad,
I appreciate your engagement, but I need to be very definite about the points you are raising. The issues you are describing, regarding dataset curation, model expectations, and “success”, are all explicitly addressed in the markdown text of the notebooks. While minor human oversight errors may exist, from the perspective of the technical implementation, the learning goals, and the pipeline mechanics, there are no mistakes in these notebooks.
“The Recipe Dataset” section in the notebook (Lab 3) clearly explains how the dataset was curated. It did not detect ingredients based on “smell.” The curation logic is fully transparent in the filter_recipe_datasetfunction provided in the helper_utils.py file (as mentioned in the markdown).
This function scans for specific fruit_keywords and vegetable_keywords. Because “Chicken” is not in the vegetable keyword list, and “Lemon” is in the fruit keyword list, the script categorizes it as Fruit. This logic is intentional to create an easy-to-use, simple dataset for the purposes of these notebooks, as explained in the text.
Data labeling was not overlooked. By definition, “data mishandling” implies corruption, data leakage, or improper formatting that breaks the pipeline, none of which is happening here. The “poor labeling” you refer to is a documented constraint of the dataset subset we are using.
The notebooks explicitly set the expectation for the model’s success: predicting the correct class out of the two specific labels used (Fruit or Vegetable). The notebook never claimed to use 5, 10, or 15 labels (like “Meat” or “Mixed”). The model was trained on binary labels (0 and 1). Expecting the model to output a category it was never trained on is outside the scope of these specific labs.
Regarding your technical suggestion, adding a hidden layer would not solve this for two reasons:
- Input Limitations: The markdown explicitly states: “the model’s predictions are based only on the words in the recipe’s
name. It was never shown the ingredients list”. No amount of hidden layers can correlate “Lemon” with “Chicken” if the model cannot see “Chicken” in the input (again, there are only two labels, and chicken is not one of them).
- Ground Truth: Since the ground truth label in this dataset is “Fruit”, a more complex model with more layers would simply learn to replicate that label more accurately, not “correct” it.
The notebooks are a “successful implementation” because they successfully demonstrate the mechanics of the Deep Learning pipeline, tokenization, batching, and training loops, which is the stated goal.
We must balance realism with educational resources. If I were to use the entire Food.com dataset with full ingredients lists and multi-class labels, the computational requirements would far exceed what is available for these labs.
I strongly encourage you to read the markdown cells in the notebooks thoroughly. The limitations, scope, and objectives are clearly laid out there.
Best,
Mubsi