Transformers + Attention / or LLMs in other contexts: (I.e. AlphaFold, ForceGen, etc)

Nevermnd · March 12, 2024, 7:38am

Hi,

So I haven’t gotten to the LLM stage yet, though it at least seems interesting.

I didn’t know if anyone could point me to a good open-source reference to using LLMs for ‘non-language’ applications. In examples given they are doing protein folding and gene / cell identification.

In particular I am curious how they structure their datasets for the training phase.

But I am not really sure where to look (?)

Nevermnd · March 13, 2024, 8:19pm

Just a small bump on this–

And, ‘full disclosure’ I also asked basically the same question on Hugging Face, and found no response/traction there either so far…

Though I mean someone here must be working on something similar. And again, my question does not regard model, but rather the prep for the data structure.

dan_herman · March 13, 2024, 9:07pm

In particular I am curious how they structure their datasets for the training phase.

I think people are ignoring this question, because it doesn’t make sense in the context of an LLM.

A large language model will generate text based on input sequences, which can be referred to as ‘prompts’.

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, such as predicting protein structures.

You would need to use a prompt engineering framework, where you could design prompts and interact with an LLM through an API. I have a little experience working in the biomedical context, where we use a prompt framework to classify unlabeled medical data. I used a framework called OpenPrompt.

From my understanding, these projects involved custom development. They are not available through an off the shelf open source software package.

Hope that helps!

Nevermnd · March 13, 2024, 9:35pm

Anything helps @dan_herman.

Quite obviously I’m not looking for ChatGPT for a solution (why I specifically mention the ‘Transformers’ structure)-- People have figured out ‘Hey, we can use this on other data’. I even have a hunch they are not feeding inputs to the model as NLP, or in the case of protein folding, more like 'okay: ‘Here are my aminos’. Though even in this case, bases (or genes, in the other case) don’t have any ‘obvious’ order or organization, in contrast to language where you can say “This is a sentence”-- Or if they did, and we understood that, we probably wouldn’t be running ML models on them.

Yet aside from hearing about it, I haven’t been sure where to search for more details.

I will look more into your provided suggestion though.

Thanks,
-A

paulinpaloalto · March 14, 2024, 2:36am

The other possibility is that this topic is quite a bit more advanced and the people who have experience doing that sort of thing don’t hang out here. They’re already at Google DeepMind. I have not looked into any of this, but AlphaFold is the famous example in the “protein folding” space. I don’t know anything about it other than having listened to some podcasts about it, but I’m not sure you’d call it an “LLM”. My guess would be that it uses Transformers underneath, which is what LLMs use.

My suggestion would be to try googling AlphaFold and see what they say on their website.

Nevermnd · March 14, 2024, 9:53am

This is very possible.

I know tons of people are jumping on the ‘traditional’ LLM train in terms of applications, but, just as an example I am more interested in the uses of AI for say, how they managed to figure out how to improve on Strassen’s matrix multiplication algorithm.

These are the types of optimization problems that really interest me.

I will try Googling for papers, though of course, sometimes it is always easier to run into someone who says ‘Oh ! I am working on these types of things’ and ask a few questions.

paulinpaloalto · March 14, 2024, 4:36pm

I took a look and the Google DeepMind website has lots of information. Have a look at the AlphaFold section there. They refer to a paper as well, but I did not go far enough to find the link to that or try to read it to see what they say about the input data, if anything. But just reading the high level info, they are dealing with proteins, which are large and complex molecular structures, so the data representation is not going to be a simple matter and will probably require some knowledge of chemistry and biology beyond the secondary school level to understand. Just my guess, though. Let us know if you find the type of information you are looking for.

Topic		Replies	Views
Week 2: Intuition check for Step 2.1 in "Perform Full Fine-Tuning" Generative AI with Large Language Models week-module-2	3	431	July 24, 2023
Medical concepts extraction from documents through LLM finetuining Generative AI with Large Language Models week-module-2	13	935	September 13, 2023
Question on how Base LLMs are trained Generative AI with Large Language Models week-module-2	4	449	August 3, 2023
Fine-tuning an LLM on non-Q&A and unlabeled dataset Finetuning Large Language Models	0	377	September 30, 2023
Prompt based task training Generative AI with Large Language Models week-module-2	2	517	July 11, 2023

Transformers + Attention / or LLMs in other contexts: (I.e. AlphaFold, ForceGen, etc)

Related topics