I was wondering if we can create a dataset from the transcripts of individual videos to build an LLM application with RAG + vector DB for educational purposes. I would like to practice some of the new skills that we have learned from the short courses.
In summary, would it be okay to create and share such dataset in the following format:
Course: Advanced Retrieval for AI with Chroma
Video 1: Introduction
Video 2: Overview of embeddings-based retrieval
Yes, that would be nice. I mean I can replace the dataset with something else to do the same but I thought it would be nice to share this with the community as well, and would be a nice demonstration of the content of the course.
Thank you so much for your answer! All clear now and my only intention is to be able retrieve the relevant information and where it’s being told in the courses. It’s only for educational purposes with no monetization. For example, I will ask What is RAG? and I will retrieve the answer + where it’s told (e.g. Course: Advanced Retrieval for AI with Chroma - lecture 1 - Introduction).
@snnclsr, this is a great initiative! This is aligned with our code of conduct, as we really appreciate when learners collaborate to share knowledge.
However, just keep in mind that this will be fine as long as you stick with free resources that are available for everyone in this forum. This includes the slides and transcripts as well. This does not include any ungraded lab or assignment, since they are not accessible by everyone in this forum.
Thanks again for this initiative, and for sharing it with the community. We’re excited to learn the different ways that people use it to improve their learning.
Just my 1 cent worth, as a learner… I’m not sure that the transcripts in their raw source form will work well for @snnclsr 's purpose. For myself, I’ve started downloading the transcripts and then I revise, edit, and simplify the text so that I can understand it better. I also get rid of various ‘asides’ and tangential comments that distract from the key points, and add cross-references to external content as needed. In other words, I think it could be useful to clean up the transcripts so that they work best for your purpose. Perhaps you’ve already considered doing this?