Error on ungraded lab M2_UGL_1

Deal all and team,

In first ungraded lab of Module 2, every building block is running properly, but I have an error message when running the whole workflow (last code cell).

It looks like there is an error with key “quarter” although this key does actually exist in coffee_sales.csv (and again, this works perfectly when computing the building blocs one after another).

Any idea what I should try ? Many thanks !

This is the error I get:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /usr/local/lib/python3.11/site-packages/pandas/core/indexes/base.py:3641, in Index.get_loc(self, key)
   3640 try:
-> 3641     return self._engine.get_loc(casted_key)
   3642 except KeyError as err:

File pandas/_libs/index.pyx:168, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:197, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7668, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7676, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'quarter'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[10], line 9
      6 image_basename="drink_sales"
      8 # Run the complete agentic workflow
----> 9 _ = run_workflow(
     10     dataset_path="coffee_sales.csv",
     11     user_instructions=user_instructions,
     12     generation_model=generation_model,
     13     reflection_model=reflection_model,
     14     image_basename=image_basename
     15 )

Cell In[9], line 41, in run_workflow(dataset_path, user_instructions, generation_model, reflection_model, image_basename)
     39     initial_code = match.group(1).strip()
     40     exec_globals = {"df": df}
---> 41     exec(initial_code, exec_globals)
     42 utils.print_html(out_v1, is_image=True, title="Generated Chart (V1)")
     44 # 3) Reflect on V1 (image + original code) to get feedback and refined code (V2)

File <string>:8

File /usr/local/lib/python3.11/site-packages/pandas/core/frame.py:4378, in DataFrame.__getitem__(self, key)
   4376 if self.columns.nlevels > 1:
   4377     return self._getitem_multilevel(key)
-> 4378 indexer = self.columns.get_loc(key)
   4379 if is_integer(indexer):
   4380     indexer = [indexer]

File /usr/local/lib/python3.11/site-packages/pandas/core/indexes/base.py:3648, in Index.get_loc(self, key)
   3643     if isinstance(casted_key, slice) or (
   3644         isinstance(casted_key, abc.Iterable)
   3645         and any(isinstance(x, slice) for x in casted_key)
   3646     ):
   3647         raise InvalidIndexError(key) from err
-> 3648     raise KeyError(key) from err
   3649 except TypeError:
   3650     # If we have a listlike key, _check_indexing_error will raise
   3651     #  InvalidIndexError. Otherwise we fall through and re-raise
   3652     #  the TypeError.
   3653     self._check_indexing_error(key)

KeyError: 'quarter'

Hello, @Guillaume_Autier,

If you open the csv file, it actually does not have the “quarter” column, and if you further investigate that, the column is generated with the function utils.load_and_prepare_data.

This is an existing problem with the latest version of the lab, so I think we need inputs from @lesly.zerna.

Cheers,
Raymond

@Guillaume_Autier, for the time being, to be able to use the “quarter” column, we may modify the user_instructions and ask the chatbot to use utils.load_and_prepare_data.

For example, I have added the following to the end of the existing message:

Load the csv file using utils.load_and_prepare_data(csv_file: str) .

We need that argument list to help the chatbot understand how to use the function.

However, we will need @lesly.zerna and the course team to decide whether this fix produces the expected result.

Cheers,
Raymond

I’m quite confused/intrigued by this.

generate_chart_code() clearly states at the top of its requirements in the prompt:

“1. Assume the DataFrame is already loaded as ‘df’.“

And every time I’ve called this function from the individual cell, it complies. The generated code goes straight from the imports to the filter operation:
import pandas as pd
import matplotlib.pyplot as plt
# Filter data for Q1 of 2024 and 2025
q1_2024 = df[(df['year'] == 2024) & (df['quarter'] == 1)]

Every time I execute this same function from the run_workflow() function, which passes the exact same user instructions and the exact same model (gpt-4o-mini), it always gives code that starts with the df load (which leads to the key error):

import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv('coffee_sales.csv')
# Filter for Q1 sales in 2024 and 2025
q1_sales = df[(df['quarter'] == 1) & (df['year'].isin([2024, 2025]))]

There are also some other minor differences between the two.
I appreciate the LLMs are generative, i.e. non-deterministic and hence you might expect some small differences in the outputs each time (small because the instructions are precise), but why is it always the same for the first independent cell call, and always the same for the run_workflow() call ? You’d surely expect that if there are variations, they’d happen in both each time you call the function. (I’ve called them only about 5-10 times, so maybe it would if done 100s of times?).

Any thoughts?

Hello @shahin,

We pass the following into run_workflow:

user_instructions="Create a plot comparing Q1 coffee sales in 2024 and 2025 using the data in coffee_sales.csv." # write your instruction here

which combines with the prompt you mentioned in generate_chart_code, resulting in something like below:

    prompt = f"""
    ...

    User instruction: Create a plot ... using the data in coffee_sales.csv.

    Requirements for the code:
    1. Assume the DataFrame is already loaded as 'df'.
    ...
    """

Note that I have only shown the parts relevant to what I am going to talk about and skip the rest with ....

Essentially, when we execute run_workflow, the final prompt mentioned two data sources: a csv on-disk and a df in-memory. It is then up to the chatbot to decide what to do - the csv or the df.

As for “same response”, it could be due to some OpenAI’s caching machanism. If we execute run_workflow for 5 times, only the first one takes some time to get a response. However, if we, for example, add one extra space to the front of the prompt each time so that the 3rd run has 3 spaces, the 5th has 5, and so on, then each and every run will take time to get a response and the responses would not be all the same.

Don’t forget to set a higher temperature if you wish to see more random responses.

As for the details about the caching machanism, unfortunately I cannot provide you with any official documentation because I am in Hong Kong and access to OpenAI is restricted.

Cheers,
Raymond

Thanks Raymond, I think my message was confusing because I conflated two separate issues.

Allow me to try again, this time focusing on just one:

In the 4th coding cell, we call the function generate_chart_code() with the following 3 parameters:
instruction=”Create a plot comparing Q1 coffee sales in 2024 and 2025 using the data in coffee_sales.csv.”

model="gpt-4o-mini"

out_path_v1="chart_v1.png"

The Python code it generates does what it is instructed to by Requirement number 1 which is hardcoded into the function generate_chart_code(), and generates Python without pd.read_csv(). All good.

Now, on the 11th line (just beneath “1) Generate code (V1)”), in the 9th coding cell, we again call the exact same function generate_chart_code(), albeit via run_workflow() from the 10th and final coding cell, with the following 3 parameters:
instruction=”Create a plot comparing Q1 coffee sales in 2024 and 2025 using the data in coffee_sales.csv.”

model="gpt-4o-mini"

out_path_v1=f"{image_basename}_v1.png"

This time, however, the Python code it generates ignores what it is instructed to by Requirement number 1 which is hardcoded into the function generate_chart_code(), and includes pd.read_csv() in the generated Python code. Not good.
# Load the dataset
df = pd.read_csv('coffee_sales.csv')

So:

  • the exact same LLM,
  • receiving exactly the same instructions,
  • outputs two different python scripts, consistently.

So, when called from one part of the notebook, the LLM always pays attention to Requirement1: Assume the DataFrame is already loaded as ‘df’.
But when called from another part of the notebook, the LLM always ignores it Requirement1.

This is what I do not understand.

(Incidentally, if I add the following extra Requirement 1b, then the call to this function via run_workflow() does not now ignore it and the generated Python has no pd.read_csv() and there is no key error):
1b. Don't ignore Requirement 1.

And if I remove 1b and if I also remove Requirement 1 altogether, then even the 1st call to generate_chart_code() results in generated Python that does include pd.read_csv(). This confirmed to me that this Requirement, hard-coded in this function, is effective, but only, it would seem, in the earlier part of the notebook, not the later part.

And if (after refreshing and clearing caches, start again and only call it in the 4th coding cell and the 9th coding cell (running only the 1st, 2nd, 3rd and 8th cells which are needed), i.e. without calling all the other functions in the steps between, I still get the same result, so it is not due to some change of state to the LLM prior to calling in the 9th cell).

Hello @shahin,

After reading through your last response, I still think I need to reiterate my earlier hypothesis:

Due to some caching mechanism in OpenAI (not our computers), same prompt yields same response.

The support of this hypothesis is that, reiterating again but with examples this time, if you loop the same prompt for 5 times (upper cell), only the first one takes a longer time to return a response, but if you send 5 different prompts (lower cell), all take longer.

  1. See the difference? second level (for new prompt) VS. sub-second level (for repeated prompt).

  2. For both cells, I replaced “a” with “one” (as circled) to make sure the prompt is never seen before the experiments.

  3. There should be some timeout for that OpenAI caching, but I am not sure how long.

  4. The hypothetical “OpenAI caching effect” will be gone not just when the prompt is different, but also when the parameters are different. For example, you change the temperature but keep the prompt unchanged, no caching effect.

Now, back to the confusing part in your last response, although both function: generate_chart_code and run_workflow receives the same instructions as you explicitly quoted, their prompts are still different, because the prompt doesn’t only take in instructions, but also out_path_v1 which is different for the two cases:

  • the cell for generate_chart_code uses "chart_v1.png" for out_path_v1
  • the cell for run_workflow uses "drink_sales_v1.png" for out_path_v1

Therefore, the following statement is not true because out_path_v1 is part of the instructions (I mean prompt here):

Now, because the prompt is different, caching won’t kick in, and it generates new response.

I have done an extra experiment that saw the read_csv came and went as I adjusted the temperature value:

Below is my understanding based on my experiments:

  1. You always get the exact same code when re-running a certain cell because of caching.

  2. You don’t get the same code when running the cell for generate_chart_code and the cell for run_workflow because their prompts are NOT identical but the values they use for out_path_v1 are different.

  3. Those changes in chatbot responses by your adding and removing requirements can be attributed to the random nature of the chatbot. Because you changed the requirement, you changed the prompt. Because you changed the prompt, the chatbot wouldn’t use the cache but generate new responses for you.

Cheers,
Raymond

1 Like

However, I still don’t have full confidence in my hypothesis because, as I said, access to OpenAI documentation is restricted in Hong Kong and I can’t find official document for supporting it.

Super effort, I greatly appreciate the time you put into this Raymond.

I think, in terms of my learnings, this feels maybe quite important-ish for me, so I think it’s worth the effort we’ve put in (and hopefully other students will learn something too..?) I want to be sure that I come away with:

  1. Knowing that such variations in output (which was unexpected for me) can occur. Because, if the prompt requirements states something quite semantically unambiguous like “hey LLM, don’t read in a csv.” and it writes python that reads in a csv, then it really emphasises that nothing can be taken for granted, which can creep in with such convincingly good results from LLMs most of the time. I guess this re-emphasises what Prof Ng says about the importance of evals.
  2. Knowing what potential causes of these are.
  3. Having some idea of how I might address it and prevent it (other than just unit tests etc), i.e. to make this impossible to occur (if that’s even possible). (But I guess one could make use of what Prof Ng mentions in the lectures about feeding back external msgs, i.e. error messages, in another reflection loop.. ?).

Following your example, I ran the generate_chart_code() function in a loop 4 times, changing the prompt (in terms of the output png filename).
This time, the first one, which is the only one with pd.read_csv() included, does so as a commented-out line :joy: . (And all 4 scripts are (subtly) different from each other, as expected I guess with caching not being employed). Anyway, diminishing returns and all that, I’ll leave it here. Looking forward to learn more about this ‘temperature’ thingy. Many thanks.

LLM output with first draft code

<execute_python>
import pandas as pd
import matplotlib.pyplot as plt

# Load the data
# df = pd.read_csv('coffee_sales.csv') # Uncomment this line to load the data

# Filter the DataFrame for Q1 of 2024 and 2025
q1_2024 = df[(df['year'] == 2024) & (df['quarter'] == 1)]
q1_2025 = df[(df['year'] == 2025) & (df['quarter'] == 1)]

# Group by coffee name and sum the prices
sales_2024 = q1_2024.groupby('coffee_name')['price'].sum()
sales_2025 = q1_2025.groupby('coffee_name')['price'].sum()

# Create a comparison DataFrame
comparison_df = pd.DataFrame({
    '2024': sales_2024,
    '2025': sales_2025
}).fillna(0)

# Plotting
comparison_df.plot(kind='bar', figsize=(10, 6))
plt.title('Q1 Coffee Sales Comparison: 2024 vs 2025')
plt.xlabel('Coffee Name')
plt.ylabel('Total Sales Price')
plt.legend(title='Year')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('0: chart_v1.png', dpi=300)

# Close all plots
plt.close()
</execute_python>

LLM output with first draft code

<execute_python>
import pandas as pd
import matplotlib.pyplot as plt

# Assuming df is already loaded
q1_sales = df[(df['quarter'] == 1) & (df['year'].isin([2024, 2025]))]

# Group by year and coffee name to get total sales
sales_data = q1_sales.groupby(['year', 'coffee_name'])['price'].sum().unstack().fillna(0)

# Plotting
plt.figure(figsize=(10, 6))
sales_data.plot(kind='bar', stacked=True)
plt.title('Q1 Coffee Sales Comparison (2024 vs 2025)')
plt.xlabel('Coffee Name')
plt.ylabel('Total Sales ($)')
plt.legend(title='Year')
plt.tight_layout()

# Save figure
plt.savefig('1: chart_v1.png', dpi=300)
plt.close()
</execute_python>

LLM output with first draft code

<execute_python>
import pandas as pd
import matplotlib.pyplot as plt

# Filter data for Q1 of 2024 and 2025
df_q1_2024 = df[(df['year'] == 2024) & (df['quarter'] == 1)]
df_q1_2025 = df[(df['year'] == 2025) & (df['quarter'] == 1)]

# Group by coffee_name and sum prices for both years
sales_2024 = df_q1_2024.groupby('coffee_name')['price'].sum()
sales_2025 = df_q1_2025.groupby('coffee_name')['price'].sum()

# Combine sales into a DataFrame for plotting
sales_comparison = pd.DataFrame({'2024': sales_2024, '2025': sales_2025}).fillna(0)

# Plotting
ax = sales_comparison.plot(kind='bar', figsize=(10, 6), width=0.8)
plt.title('Q1 Coffee Sales Comparison: 2024 vs 2025')
plt.xlabel('Coffee Name')
plt.ylabel('Total Sales (Price)')
plt.legend(title='Year')
plt.xticks(rotation=45)
plt.tight_layout()

# Save the figure
plt.savefig('2: chart_v1.png', dpi=300)
plt.close()
</execute_python>

LLM output with first draft code

<execute_python>
import pandas as pd
import matplotlib.pyplot as plt

# Filter data for Q1 of 2024 and 2025
q1_2024 = df[(df['year'] == 2024) & (df['quarter'] == 1)]
q1_2025 = df[(df['year'] == 2025) & (df['quarter'] == 1)]

# Group by coffee_name and sum the prices
sales_2024 = q1_2024.groupby('coffee_name')['price'].sum().reset_index()
sales_2025 = q1_2025.groupby('coffee_name')['price'].sum().reset_index()

# Merge the two years' data for comparison
sales_comparison = pd.merge(sales_2024, sales_2025, on='coffee_name', suffixes=('_2024', '_2025'))

# Plotting
plt.figure(figsize=(10, 6))
plt.bar(sales_comparison['coffee_name'], sales_comparison['price_2024'], label='2024', alpha=0.6)
plt.bar(sales_comparison['coffee_name'], sales_comparison['price_2025'], label='2025', alpha=0.6)

plt.title('Q1 Coffee Sales Comparison: 2024 vs 2025')
plt.xlabel('Coffee Name')
plt.ylabel('Total Sales (Price)')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()

# Save the figure
plt.savefig('3: chart_v1.png', dpi=300)
plt.close()
</execute_python>
1 Like

:joy:

Hello @shahin, you are welcome! I have learned something, too.

Cheers,
Raymond

I read your response again and something came up in my mind: the prompt actually asked the chatbot to just return the code and nothing else, which means it might just block the chatbot from even asking a clarifying question when something in the prompt is ambiguous.

Of course, even if we didn’t restrict it to only give us the code, I think the chatbot would still present the code instead of asking a clarification question first like what a human may do. Some chatbots would ask additional questions at the end, but it might just be in the chatbot’s training that it should always respond with something useful for the user to move on regardless of any ambiguity. It has to worth the tokens, at least.

Many thanks to you both for explanations and help !