I’m trying to run the Module 3 assignment locally, but need to load the BBC data into the vector database. This code isn’t provided.
I tried this code, adapted from the previous lab…
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.query import Filter
from tqdm import tqdm
import weaviate
from weaviate.util import generate_uuid5
vectorizer_config = [Configure.NamedVectors.text2vec_transformers(
name="vector", # This is the name you will need to access the vectors of the objects in your collection
source_properties=['article_content', 'description', 'guid', 'link', 'pubDate', 'title'], # which properties should be used to generate a vector, they will be appended to each other when vectorizing
vectorize_collection_name = False, # This tells the client to not vectorize the collection name.
# If True, it will be appended at the beginning of the text to be vectorized
inference_url="http://127.0.0.1:5000", # Since we are using an API based vectorizer, you need to pass the URL used to make the calls
# This was setup in our Flask application
)]
# LOADING THIS TAKES AN HOUR TO RUN, so delete carefully!
# Delete the collection in case it exists
#if client.collections.exists("bbc_collection"):
# client.collections.delete("bbc_collection")
if not client.collections.exists('bbc_collection'):
collection = client.collections.create(
name='bbc_collection',
vectorizer_config=vectorizer_config, # The config we defined before,
reranker_config=Configure.Reranker.transformers(), # The reranker config
properties=[ # Define properties
Property(name="article_content",vectorize_property_name=True,data_type= DataType.TEXT),
Property(name="description",vectorize_property_name=True, data_type=DataType.TEXT),
Property(name="guid",vectorize_property_name=True, data_type=DataType.TEXT),
Property(name="link",vectorize_property_name=True, data_type=DataType.TEXT),
Property(name="title",vectorize_property_name=True, data_type=DataType.TEXT),
Property(name="pubDate", data_type=DataType.DATE),
]
)
# Set up a batch process with specified fixed size and concurrency
with collection.batch.fixed_size(batch_size=100, concurrent_requests=1) as batch:
# Iterate over a subset of the dataset
for document in tqdm(bbc_data): # tqdm is a library to show progress bars
# Generate a UUID based on the article_content text for unique identification
uuid = generate_uuid5(document)
# Add the object to the batch with properties and UUID.
# properties expects a dictionary with the keys being the properties.
batch.add_object(
properties=document,
uuid=uuid,
)
else:
collection = client.collections.get("bbc_collection")
… and it sort of works, as seen here:
print(f"The number of elements in the collection is: {len(collection)}")
The number of elements in the collection is: 9973
But this doesn’t match the lab (which says 75256).
And then I get errors downstream:
KeyError Traceback (most recent call last)
Cell In[13], line 4
2 print("Printing the properties (some will be truncated due to size)")
3 print_object_properties(object.properties)
----> 4 print("Vector: (truncated)",object.vector['main_vector'][0:15])
5 print("Vector length: ", len(object.vector['main_vector']))
KeyError: 'main_vector'
Can you share the code you used to create the cached collection in the lab?