Output Constraints in ChatGPT - Getting Inconsistent Results

Hi all…

I’m wondering if you can rely on output constraints that you request in the prompt to an LLM?

In the lesson on Inferring this prompt requests a list of items separated by commas:

prompt = f"""
Determine five topics that are being discussed in the \
following text, which is delimited by triple backticks.

Make each item one or two words long. 

Format your response as a list of items separated by commas.

Text sample: '''{story}'''
"""
response = get_completion(prompt)
print(response)

Video Output:

My Output

It looks like Andrew’s notebook was also running against gpt-3.5-turbo. I’m not sure why my output wouldn’t return a comma-separated list or how can you reliably code an application around this type inconsistency? Even if I iterated on mine to return a comma separated list, can I be sure that this wouldn’t break in the future?

LLMs may produce unpredictable behavior. To mitigate that, I would suggest some things to consider:

  • make sure the temperature is set to 0
  • test new ways to phrase the prompt
  • use in-context learning, providing one or few examples
  • fine-tune the model for the desired task

Large Language Models are just statistical word sequence generators. You really can’t trust their output in any way, or expect them to act deterministically,

Thanks Leonardo… I know they’re unpredictable. I’m wondering about how to best provide constraints that may be a little more predictable. In the lesson, Andrew references specifically changing the prompt so that you get a predictable output to plug into an application.

I know that LLMs can be bad at things like math. Are you also suggesting that they might not be able to follow a request for comma separated answers? As you can see in my notebook, asking for comma separated answers results in a numeric list without any commas.

And, I understand that you can iterate on this and get the LLM to produce the desired output format. But my question is also why this has changed between the time the video was recorded and the time I ran my notebook? I.e. the prompt to return a comma separated list was reliable at the time that Andrew ran the code but it’s no longer reliable.

Edit:
@leonardo.pabon - Actually, the in-context learning is a great tip. Thanks. I just updated it to this:

Format your response as a list of items separated by commas.
The format should look like: [answer1, answer2, etc]

Which seems to make it reliable.

Interesting, but the premise of this course if that you can use LLMs with prompt engineering to include them in a larger application. That’s the reason they are teaching us how to structure the prompts. So they can’t be completely unreliable.

Yes, you can. But you cannot really trust the results.

See it as a exercise for iterative development :slight_smile: . When I run examples locally I quite often do not get the same answer as in the videos. This is however easily fixed by being more specific in the prompt.

prompt = f"""
Determine whether each item in the following list of \
topics is a topic in the text below, which
is delimited with triple backticks.

Give your answer as a list with 0 or 1 for each listed topic.\.

List of topics: {", ".join(topic_list)}

Text sample: '''{story}'''
"""
response = get_completion(prompt)
print(response)

The modification to the prompt was “for each listed topic.”