C1M3 Assignment - how to run locally?

I’m trying to run the Module 3 assignment locally, but need to load the BBC data into the vector database. This code isn’t provided.

I tried this code, adapted from the previous lab…

from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.query import Filter
from tqdm import tqdm
import weaviate
from weaviate.util import generate_uuid5

vectorizer_config = [Configure.NamedVectors.text2vec_transformers(
    name="vector", # This is the name you will need to access the vectors of the objects in your collection
    source_properties=['article_content', 'description', 'guid', 'link', 'pubDate', 'title'], # which properties should be used to generate a vector, they will be appended to each other when vectorizing
    vectorize_collection_name = False, # This tells the client to not vectorize the collection name.
    # If True, it will be appended at the beginning of the text to be vectorized
    inference_url="http://127.0.0.1:5000", # Since we are using an API based vectorizer, you need to pass the URL used to make the calls
    # This was setup in our Flask application
)]

# LOADING THIS TAKES AN HOUR TO RUN, so delete carefully!


# Delete the collection in case it exists
#if client.collections.exists("bbc_collection"):
#    client.collections.delete("bbc_collection")

if not client.collections.exists('bbc_collection'):
    collection = client.collections.create(
        name='bbc_collection',
        vectorizer_config=vectorizer_config, # The config we defined before,
        reranker_config=Configure.Reranker.transformers(), # The reranker config

        properties=[  # Define properties
            Property(name="article_content",vectorize_property_name=True,data_type= DataType.TEXT),
            Property(name="description",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="guid",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="link",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="title",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="pubDate", data_type=DataType.DATE),
        ]
    )

    # Set up a batch process with specified fixed size and concurrency
    with collection.batch.fixed_size(batch_size=100, concurrent_requests=1) as batch:
        # Iterate over a subset of the dataset
        for document in tqdm(bbc_data): # tqdm is a library to show progress bars
            # Generate a UUID based on the article_content text for unique identification
            uuid = generate_uuid5(document)

            # Add the object to the batch with properties and UUID.
            # properties expects a dictionary with the keys being the properties.
            batch.add_object(
                properties=document,
                uuid=uuid,
            )
else:
    collection = client.collections.get("bbc_collection")

… and it sort of works, as seen here:

print(f"The number of elements in the collection is: {len(collection)}")

The number of elements in the collection is: 9973

But this doesn’t match the lab (which says 75256).

And then I get errors downstream:

KeyError                                  Traceback (most recent call last)
Cell In[13], line 4
      2 print("Printing the properties (some will be truncated due to size)")
      3 print_object_properties(object.properties)
----> 4 print("Vector: (truncated)",object.vector['main_vector'][0:15])
      5 print("Vector length: ", len(object.vector['main_vector']))

KeyError: 'main_vector'

Can you share the code you used to create the cached collection in the lab?

2 Likes

Dear @dougdonohoe,

Had you run it on the Coursera platform?


Keep Learning AI with DeepLearning.AI - Girijesh

Yes.

1 Like

So you want to run it locally?


Keep Learning AI with DeepLearning.AI - Girijesh

Yes, which is why the subject includes “how to run locally?”.

1 Like

Dear @dougdonohoe,

Great!
Please send me the code via DM (Direct Message).
You can do this by clicking on my name and then click Message.

Kindly note that sharing code in the community is against the code of conduct, so please do not post it here.


Keep Learning AI with DeepLearning.AI - Girijesh

I don’t have any code to share other than what I posted in the original message, which was just what I attempted to do in order to run locally. I did not share any solutions to assignments.

I don’t want a private reply - I was hoping this was something would benefit any community member trying to run these labs locally. We are, after all, allowed to download labs for future use. Some of these labs rely on cached vector databases which makes the local labs unusable unless we an recreate that data.

That’s what I’m asking for. I shared the code snippet to demonstrate that I attempted to figure this out on my own, as I would expect of any good engineer.

So the questions is, as a learning platform, is Coursera willing to share this small detail: how they loaded the BBC data into the vector database so we can reproduce locally?

Thanks,

Doug

1 Like

Hi Doug, how successful you are with your quest? I’m asking because i’m facing the exact same problem.

To be more specific, i’m using windows, and the weaviate used in this assignment is the embedded one, which runs only in linux.

So i managed to install docker compose (using wsl2), weaviate, etc, but when i load the collection it’s always zero sized, researching i found that (pretty obviosly) my weaviate doesn’t have the bbc_collection, and i couldn’t find it anywhere inside the downloaded zip.
The problem is, even finding it i have to learn how to put the files in my docker weaviate, and, of course pray to work (i’m pretty confident that it’ll not run).

So, plan B is to know how to repopulate the bbc_data in the bbc_collection, solving the vector problem that you’ve shown.

Hi @sandrix - I didn’t pursue this any further. I was able to complete the assignment in the online notebook, so gave up on running locally.

Yes, there is nothing in the zip file, so we would have to load the data ourselves. When I attempted with my code above, two other things I should note. First, it took over an hour to run through all the data (probably why they cached it for us). Second, I was getting warnings about dates (something about UTC). I’m not sure if this was the reason not all data got loaded. I could probably figure it out given more time, but it’s just not worth it. Maybe the folks at Coursera will respond to this thread, but its not clear anybody from the course itself pays any attention here.

Hi @dougdonohoe,

Thank you for bringing up this topic, and apologies for my delayed response—I was out of the office. You raise a great point, and I’m currently working on sharing the exact code used to vectorize the entire database.

Unfortunately, due to the very large size of the database, it isn’t possible to provide it for download within the Coursera environment. I realize this may be frustrating, and I appreciate your understanding.

Once I’ve finalized and updated the code, I’ll be sure to share it here so you’ll have access as soon as possible. Please let me know if you have any other questions in the meantime.

Best,

Lucas

Hi @lucas.coutinho,

Thanks for replying.

How big is it? I’m asking because there was some assignments (or labs, can’t remember now) that reached 3GB+ in size, so, for me it’s not a big deal.

Regards,

Sandro