Evaluation prompt - part I :: v2 responses are more ambiguous than v1. Opposite to what is dictated by Andrew Ng

the automation to run evaluation on all test cases gives full score when I change the function find_category_and_product_v2 to find_category_and_product_v1.

When using find_category_and_product_v2 the score generated is :: Fraction correct out of 10: 0.8333333333333334.

When using find_category_and_product_v1 then score generated is :: Fraction correct out of 10: 1.0

But as per Andrew the find_category_and_product_v2 is more refined to produce more expected results.

Now as per my findings the caveat I found in find_category_and_product_v2 is, it is instructed to find the entire list of products if any one of them is matching from user_input by below few_shots ::
few_shot_user_1 = “”“I want the most expensive computer. What do you recommend?”“”
few_shot_assistant_1 = “”"
[{‘category’: ‘Computers and Laptops’,
‘products’: [‘TechPro Ultrabook’, ‘BlueWave Gaming Laptop’, ‘PowerLite Convertible’, ‘TechPro Desktop’, ‘BlueWave Chromebook’]}]
“”"

few_shot_user_2 = """I want the most cheapest computer. What do you recommend?"""
few_shot_assistant_2 = """ 
[{'category': 'Computers and Laptops', \

‘products’: [‘TechPro Ultrabook’, ‘BlueWave Gaming Laptop’, ‘PowerLite Convertible’, ‘TechPro Desktop’, ‘BlueWave Chromebook’]}]
“”"
and the msg_ideal_pairs_set is having near to matching products in ideal_answer for a customer_msg which is well satisfied by the response from the prompt of find_category_and_product_v1 without any change.

Modifications to find_category_and_product_v2 as per above findings didn’t cause find_category_and_product_v2 to produce full score so I did remove the line “Do not write any explanatory text after outputting the requested JSON.”
And this caused find_category_and_product_v2 to produce full score.

this caused me to think that is that word “outputting” caused the prompt to behave ambiguously ? If so I am ok with that as it’s not a proper English dictionary word. Though google search can sense it to give relevant info without conflicts or errors, so as per that open.ai is expected to do the same.

If not then I am totally confused because removal of effected instruction didn’t cause the prompt to produce full score but removal of an instruction which is not at all related to the reason of ambiguous result caused the prompt to produce full score ?

It would be very kind of anyone have experience the same and could guide me if I am thinking it correct or could explain the cause behind it.

Adding to your post, I have found a mistake in find_category_and_product_v2 which is pointing to the same answer for the opposite questions.

    few_shot_user_1 = """I want the most expensive computer. What do you recommend?"""
    few_shot_assistant_1 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    few_shot_user_2 = """I want the most cheapest computer. What do you recommend?"""
    few_shot_assistant_2 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    messages =  [  
    {'role':'system', 'content': system_message},    
    {'role':'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_1 },
    {'role':'user', 'content': f"{delimiter}{few_shot_user_2}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_2 },
    {'role':'user', 'content': f"{delimiter}{user_input}{delimiter}"},  
    ]