Metadata-extraction-and-chunking - discrepancies between video and notebook

Olivier_De_Timmerman · August 3, 2024, 12:23pm

Hi,

The discrepancies start to appear in the “Run the document through the Unstructured API” part of the notebook.

Results from JSON(json.dumps(resp.elements[0:3], indent=2)) are different from those in the video. E.g. element ids are different, metadata shows category_depth instead of page_number, …

Results are also different in the “Find elements associated with chapters”, e.g. “text” is different, same type of differences as above, chapter_ids remains empty instead of being populated with key-value pairs, and the following error appears "---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[14], line 2
1 chapter_to_id = {v: k for k, v in chapter_ids.items()}
----> 2 [x for x in resp.elements if x[“metadata”].get(“parent_id”) == chapter_to_id[“ICE-HOCKEY”]][0]

Cell In[14], line 2, in (.0)
1 chapter_to_id = {v: k for k, v in chapter_ids.items()}
----> 2 [x for x in resp.elements if x[“metadata”].get(“parent_id”) == chapter_to_id[“ICE-HOCKEY”]][0]

KeyError: ‘ICE-HOCKEY’"

When manually modifying the list of chapters I do get some better results, but, so far, I have not been able to replicate exactly what I see in the video.

I don’t think it is blocking for understanding the concepts though.

Many thanks in advance for your feedback and help.

Cheers,
O.

lxwvictor · September 14, 2024, 3:17am

The value of chapters list is misaligned. Use below code block.

chapters = [
    "CHAPTER I THE SUN-SEEKER",
    "CHAPTER II RINKS AND SKATERS",
    "CHAPTER III TEES AND CRAMPITS",
    "CHAPTER IV TOBOGGANING",
    "CHAPTER V ICE-HOCKEY",
    "CHAPTER VI SKI-ING",
    "CHAPTER VII NOTES ON WINTER RESORTS",
    "CHAPTER VIII FOR PARENTS AND GUARDIANS",
]

Then change the code which produced the error to use the correct key.

chapter_to_id = {v: k for k, v in chapter_ids.items()}
[x for x in resp.elements if x["metadata"].get("parent_id") == chapter_to_id["CHAPTER V ICE-HOCKEY"]][0]

Topic		Replies	Views
Error in running the code Preprocessing Unstructured Data 4 LLM Applications	2	37	October 24, 2024
In lesson_3. I can not find the element associated with the chapter name list Preprocessing Unstructured Data 4 LLM Applications	12	71	January 30, 2025
? on using Metadata? LangChain for LLM Application Development	0	100	August 11, 2023
Parsing elements from Unstructured, confusion between ListItem <> NarrativeText Preprocessing Unstructured Data 4 LLM Applications dl-ai-learning-platform	0	7	March 30, 2025
L5_student output differs from video, confusing Building Systems with the ChatGPT API	1	14	October 24, 2024

Metadata-extraction-and-chunking - discrepancies between video and notebook

Related topics