Cool development for fast, local CPU driven LLM inference

All of Justine’s work is both ‘open’-- And pretty impressive.


It is actually an impressive read especially related to Mac studio about CPU performance and fast tokenisation.

Thank you for the share @Nevermnd

Probably discussing this article will help gain more knowledge about CPU performance related to LLaMA and other factors.


Actually @Deepti_Prasad, though I haven’t gotten to LLMs yet, it does sound like you have, so I had a question for you I’d been wondering about:

Now you don’t notice this on CoPilot for desktop, nor ever on Gemini, and… Oh shoot… they must have just updated or changed this very recently.

I’m not noticing it any longer on GPT-4 via Bing (the free way to use it), otherwise I was going to show you a screen shot. Perhaps you’ve seen it yourself or otherwise you have to trust me.

When you asked Bing Chat a question, before responding it would display something of a ‘condensed version’ of your question, as if it were being parsed down to the main points on which to perform inference.

Is this actually part of the process, or is/was it just displayed for user recognition ? Also it seems to me the process of reducing the inquiry must be driven by a different mode/model than the actual inference on the question itself(?).

Would be interested in your thoughts.

Hello @Nevermnd

Response is lengthy, so take your time to read :slight_smile:

I would better answer to your question after seeing this either as a image or some pretext form as I am not sure of your question, You basically stating a question when asked by you gets parsed to the main subject of interest?? You can probably try it again and show me as I haven’t use much of chatGPT or Copilot

Here is a link response on the issue you are mentioning

All and all any model tries to get a better response when you feed a data again and again, so if you want a detail response for your query, you could include a statement as your response is incomplete and post the question again for the GPT4, and again until you get the kind of detail response you are looking for.

You also need to understand, I don’t know if I remember when I did ChatGPT course, I clearly remember the instructors first creating a chat console which gives a detailed response but because the users would find this lengthy and boring to read completely. They tried creating a better model to give concise answer, and hence you are getting such response.

I don’t know if you have done NLP specialisation, when it is Question and Answer algorithm, an attention model is created where a text is tokenised, masked and encoder-decoder layer is created, where attention is focused on main words, and then further the response is based on that hidden unit response from the vocabulary size, so the response is better summarised version of the question being asked.

For example imagine you asked me about this chat prompt, but I am responding to you about ChatGPT, its history and then I added some more new information about fine tuning LLM at the end!!!

I am pretty sure half of the reader or questionnaire asking this prompt would have been annoyed by my response, so GPT does the same, it gives a concise answer after you ask a question (it gathers all the information related to the question and then gives you a brief response) so you are not annoyed to read the response :upside_down_face:

I have tried this using ColabAI, when I want to ask any doubt about a code, and when I am not happy with the response, I feed the question again and again with added prompt on how better I want answer, and at the end it gives me the response I am looking for.

No matter what I personally still believe reading and finding is slow burner process of gaining knowledge but it is the best medium for solid foundation for knowledge and inclusion of chatbot for such medium should be used only for finding source rather than learning or gaining knowledge.

But here I wanted to share with you about Fine-tuning LLM and how it is not a simple objective about LLMs getting retrained or fine-tuned. It is a very small ppt but quite impressive.

Finetuning Large Language Models - Short course-11.pdf (1.5 MB)


@Deepti_Prasad will reply more thoroughly once I complete a few things. I noticed this still occurs on Bing Chat (GPT-4)-- Only now the altered prompt shows up super briefly that you hardly notice it. I had to take a video screen cap to be able to still capture the frame:

I mean, obviously my question here is super simple-- But you can see it still tries to condense it. You could try asking it something much more complicated I guess. Just sort of wondered if this condensation was considered a necessary part of the process.

Also interesting to think about (other than subject, object, verb, etc) how it is determining which parts of the question are ‘important’.

Based on this image, the chatbot is still responding, it hasn’t responded yet. Notice at the down blue box highlighted stop responding !!! that means the chatbot is still gathering all information to respond to you :slight_smile:

You need to wait until that stop responding replaces with
I’ve provided a more detailed response this time. If you have further inquiries or need additional information, feel free to ask! :blush:

I just tried your same prompt question and got a detailed response :joy:
in fact I asked why did it respond to you incompletely and chatbot apologise for it :rofl: (sharing an image)


Yes, of course I understand that-- But for a ‘next word prediction engine’ it also seems to be upfront making the decision that some of my words are ‘not important’.

For example, here I’ve modified the prompt ever so slightly only by adding the word ‘fancy’. I mean yes, yes, typically whether something as ‘banal’ as a stepper motor you don’t really consider something need to be ‘fancy’. But maybe that is what I want and what I am asking for.

And so I was wrong-- though I may still have many qualms as to whether this is ‘knowledge’, at least it didn’t seem to forget.

And honestly I don’t use these interfaces much either, less for code, but when I wish to arrange and then rearrange a concept-- I feel it works pretty well.

I also get, truthfully, I personally not one wildly interested-- lets say less in the technology, but more in its application.

I mean it does not surprise, because the same s*** happened with SEO, but I pointed out somewhere, elsewhere, if ‘prompt engineering’ is really a thing, then basically thus far we have failed at the NLP task.

I mean it supposed to understand our language right, not theirs.

In any case (nor do I play with ChatGPT all that much either; However, every now and then I do find it useful) on my second query it did take the ‘fancy’ into account-- Thus my original question, is statement reduction a standard part of the process ? I can only think no, it is something at least for a time Microsoft implemented.