Evaluation prompt - part I :: v2 responses are more ambiguous than v1. Opposite to what is dictated by Andrew Ng

TanmayBanerjee · October 24, 2023, 3:10pm

the automation to run evaluation on all test cases gives full score when I change the function find_category_and_product_v2 to find_category_and_product_v1.

When using find_category_and_product_v2 the score generated is :: Fraction correct out of 10: 0.8333333333333334.

When using find_category_and_product_v1 then score generated is :: Fraction correct out of 10: 1.0

But as per Andrew the find_category_and_product_v2 is more refined to produce more expected results.

Now as per my findings the caveat I found in find_category_and_product_v2 is, it is instructed to find the entire list of products if any one of them is matching from user_input by below few_shots ::
few_shot_user_1 = “”“I want the most expensive computer. What do you recommend?”“”
few_shot_assistant_1 = “”"
[{‘category’: ‘Computers and Laptops’,
‘products’: [‘TechPro Ultrabook’, ‘BlueWave Gaming Laptop’, ‘PowerLite Convertible’, ‘TechPro Desktop’, ‘BlueWave Chromebook’]}]
“”"

few_shot_user_2 = """I want the most cheapest computer. What do you recommend?"""
few_shot_assistant_2 = """ 
[{'category': 'Computers and Laptops', \

‘products’: [‘TechPro Ultrabook’, ‘BlueWave Gaming Laptop’, ‘PowerLite Convertible’, ‘TechPro Desktop’, ‘BlueWave Chromebook’]}]
“”"
and the msg_ideal_pairs_set is having near to matching products in ideal_answer for a customer_msg which is well satisfied by the response from the prompt of find_category_and_product_v1 without any change.

Modifications to find_category_and_product_v2 as per above findings didn’t cause find_category_and_product_v2 to produce full score so I did remove the line “Do not write any explanatory text after outputting the requested JSON.”
And this caused find_category_and_product_v2 to produce full score.

this caused me to think that is that word “outputting” caused the prompt to behave ambiguously ? If so I am ok with that as it’s not a proper English dictionary word. Though google search can sense it to give relevant info without conflicts or errors, so as per that open.ai is expected to do the same.

If not then I am totally confused because removal of effected instruction didn’t cause the prompt to produce full score but removal of an instruction which is not at all related to the reason of ambiguous result caused the prompt to produce full score ?

It would be very kind of anyone have experience the same and could guide me if I am thinking it correct or could explain the cause behind it.

Julius · November 1, 2023, 8:57pm

Adding to your post, I have found a mistake in find_category_and_product_v2 which is pointing to the same answer for the opposite questions.

    few_shot_user_1 = """I want the most expensive computer. What do you recommend?"""
    few_shot_assistant_1 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    few_shot_user_2 = """I want the most cheapest computer. What do you recommend?"""
    few_shot_assistant_2 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    messages =  [  
    {'role':'system', 'content': system_message},    
    {'role':'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_1 },
    {'role':'user', 'content': f"{delimiter}{few_shot_user_2}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_2 },
    {'role':'user', 'content': f"{delimiter}{user_input}{delimiter}"},  
    ]

Topic		Replies	Views
Local explainability as emerging functionality? ChatGPT Prompt Engineering for Developers	1	65	April 30, 2023
L4 Chain of thought reasoning Building Systems with the ChatGPT API	5	184	July 25, 2023
L7_student notebook possible errors Building Systems with the ChatGPT API	0	106	March 1, 2024
Category Scores from Moderation OpenAI API Building Systems with the ChatGPT API	1	166	June 8, 2023
L7 Evaluation: Utils.py - A bit of prompt refinement needed II Building Systems with the ChatGPT API	0	138	March 4, 2024

Evaluation prompt - part I :: v2 responses are more ambiguous than v1. Opposite to what is dictated by Andrew Ng

Related topics