Where is the magic? Doing analysis on a data set

I enrolled in this course because, like many, I want to make our ISO documentation a little more accessible. I’ve finished the course but haven’t yet set up a system to run against my own data due to the time investment required to make an actual MVP (rather than a POC in a notebook).

While reflecting on what I learned in the course and planning the MVP, it occurred to me that there are two places where the magic happens but the main magic is not interacting with the LLM. Instead, the main magic is the vector store / search (a system that successfully associates semantic values). The second bit of magic is indeed the LLM’s ability to present the search results in a more human-like manner.

As I continued to consider the pros and cons and approaches to the MVP, it also occurred to me that there are certain types of questions that the system won’t be much help with. For example, using our ISO 27001 documentation as the data source, if I request: “Provide examples where the ISO 27001 policies contradict the work instructions or vice versa”, it seems unlikely I’ll get the response I’m looking for because I’m requesting an analysis of all the data whereas this system would first find documents similar to the request (which could be few or none) and then present the response in the LLM format.

This makes me wonder if there is any way to “extend” the LLM to include my own data rather than doing a search and passing in results in the context. Could I instead train my own, tiny LM, run the query above against my own model, and submit the output in the context to GPT for a more accurate response, or is that what a vector store is actually doing?

I would really love to hear what you all think.

You should check out Generetive AI course we just launched, it gives a few examples on how to extend training of an LLM with a specific dataset, which could be helpful to you.

About training your own language model from scratch its not a straightforward idea because the model will be very simplistic and wont offer an LLMs capabilities or behaviour, unless you have a lot of computing power, data and experience in building a large enough one.

Great tip about the generative AI course. I’ll definitely check that out. I’ve trained computer vision models in the past using fastai, which works more than well-enough for my needs. I totally understand what you are saying about training your own model but wasn’t sure if there was a way to use a pre-trained model and get something valuable out of it. Thanks for the tips!

1 Like

Is this the course you are referring to, @gent.spah?

Yes thats the one I am sugesting!