Metadata-extraction-and-chunking - discrepancies between video and notebook

Hi,

The discrepancies start to appear in the “Run the document through the Unstructured API” part of the notebook.

Results from JSON(json.dumps(resp.elements[0:3], indent=2)) are different from those in the video. E.g. element ids are different, metadata shows category_depth instead of page_number, …

Results are also different in the “Find elements associated with chapters”, e.g. “text” is different, same type of differences as above, chapter_ids remains empty instead of being populated with key-value pairs, and the following error appears "---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[14], line 2
1 chapter_to_id = {v: k for k, v in chapter_ids.items()}
----> 2 [x for x in resp.elements if x[“metadata”].get(“parent_id”) == chapter_to_id[“ICE-HOCKEY”]][0]

Cell In[14], line 2, in (.0)
1 chapter_to_id = {v: k for k, v in chapter_ids.items()}
----> 2 [x for x in resp.elements if x[“metadata”].get(“parent_id”) == chapter_to_id[“ICE-HOCKEY”]][0]

KeyError: ‘ICE-HOCKEY’"

When manually modifying the list of chapters I do get some better results, but, so far, I have not been able to replicate exactly what I see in the video.

I don’t think it is blocking for understanding the concepts though.

Many thanks in advance for your feedback and help.

Cheers,
O.

The value of chapters list is misaligned. Use below code block.

chapters = [
    "CHAPTER I THE SUN-SEEKER",
    "CHAPTER II RINKS AND SKATERS",
    "CHAPTER III TEES AND CRAMPITS",
    "CHAPTER IV TOBOGGANING",
    "CHAPTER V ICE-HOCKEY",
    "CHAPTER VI SKI-ING",
    "CHAPTER VII NOTES ON WINTER RESORTS",
    "CHAPTER VIII FOR PARENTS AND GUARDIANS",
]

Then change the code which produced the error to use the correct key.

chapter_to_id = {v: k for k, v in chapter_ids.items()}
[x for x in resp.elements if x["metadata"].get("parent_id") == chapter_to_id["CHAPTER V ICE-HOCKEY"]][0]