Inferring and NLP Tasks - is there a model comparison?

This is not an issue but a curiosity, was wondering if there are any scoreboards regarding NLP capabilities as for example I am sure GPT-4 is quite good for NLP advanced tasks, but also has the highest costs of inference, followed by the much cheaper GPT-3,
But are there any skill estimation vs cost? or even, is this assessed for other LLM/Foundation Models which could also be installed in local or in a container?

Just thinking loud… but if anybody knows, I’d love to hear…

Large models like GPT-4 offer impressive capabilities, but we all know that they come with significant computational costs. The challenge and ongoing research are to find ways to get the best of both worlds: high performance with manageable costs. I’ll advise checking out AI research conferences like NeurIPS, ICLR, and ACL, where many of these advancements and findings are presented. You can also check out model hubs like huggingface, as they even document things like carbon footprints of LLMs.
Also, there are several benchmarks for comparison. These are presented in the models’ papers.

1 Like

I’ve trying AI to research for me on this, not too successfully :wink: - so thanks for the suggestion, I find this model hub of Hugging Face outstanding! I have been trying to find the information you suggest with not too much success…
Do you have links to those papers which compare and score models? that would be really helpful
Hugging face models information does not contain comparisons, except dowlnoads and likes… Models - Hugging Face
but not a F1 score (or other kind)
I checked the documentation and also can’t find that.
Thanks anyway, appreciated :slight_smile: - if you have one or two of those links, I would take my time to check them :wink:

Hi @joslat,

When evaluating Large Language Models (LLMs), it’s not common to use metrics like F1 scores in isolation. Instead, they are predominantly compared based on their performances on benchmark datasets designed for specific tasks. The nature of these models means that one might outshine another on a particular task, but the roles could reverse on a different task.

For instance, take a look at this research paper:

The paper delves into a comparative analysis of different LLMs, focusing specifically on Natural Language Generation (NLG).

If you have a specific NLP task in mind, it would be more insightful to identify which models excel in that domain. Platforms like Hugging Face provide a valuable resource in this regard, showcasing a plethora of models and their performances across various tasks.

I hope this helps clarify things a bit more.

1 Like

Thanks a ton @lukmanaj !! Greatly appreciate it! I am digging into the Generative AI world like there is no tomorrow those days and first, polishing my prompt-engineering and understanding - have been using them for a while but now, it is my job, so I’d better do it well :slight_smile:
And I understand the best thing to do may be to setup an scenario for the concrete tasks with a battery of tests concrete for each task, and then evaluate them myself.
Ideally something that can swap those LLMs/Foundation Models like LangChain or Semantic Kernel…

It was enlightening and eye opening, so thanks! My understanding of Hugging Face was just a library that optimized a lot of tasks in regards NLP and more, but it is way more.

1 Like

You’re welcome @joslat. Looking forward to what you come up with. Kindly share when the time comes.