Proteins are complex biological molecules that play a vital role in numerous essential and diverse life processes. They carry out a variety of biological tasks in organisms, from enabling human vision to the intricate molecular machinery that transforms solar energy into chemical energy in plants. Proteins consisting of 20 different types of amino acids may fold into complex 3D structures. Because of their structure, they have more room to move around, and scientists can better understand how they work, allowing them to develop strategies to mimic, change, or inhibit that behavior.
However, using the amino acid formula alone will not allow researchers to determine the final structure right away. It can be done by simulations or experimentation, but the procedure takes a long time. Recent advancements in artificial intelligence development may lead to a new understanding of protein structure on an evolutionary scale. The ability to predict protein structure for 200 million cataloged proteins has only lately become possible. Large-scale gene sequencing research has revealed billions of protein sequences, and characterizing their structures would need a breakthrough in folding speed.
Meta AI recently announced an AI development that accelerates protein folding by using huge language models to build the first thorough database at the scale of hundreds of millions of proteins, making progress in this direction. The dataset is the largest ever seen among various other protein structure databases and is capable of predicting more than 600 million structures. Compared to current state-of-the-art protein structure prediction methodologies, language models can speed up the prediction of an atomic-level three-dimensional structure by up to 60 times.
The 15 billion parameter ESM-2 transformer-based model, the ESM Metagenomic Atlas (a database of predicted protein structures), and an API that allows researchers to use the model were both made public by the team. The ability to understand the structure of billions of proteins that catalog gene-sequencing technology will be made available for the first time thanks to this advancement, according to researchers. Scientists may learn more about the diversity of the natural world and make discoveries that could help treat illnesses, clean up the environment, and create renewable energy by using the protein forms in this database, which scientists have not yet seen.
Proteins can be compared to the text of an essay. They can be expressed as strings of letters where each character represents one of the 20 amino acids, similar to how language is written. Each protein sequence forms a three-dimensional shape, which is significantly responsible for the protein’s biological activity. However, there are significant and fundamental distinctions between them. Protein sequences have statistical patterns that reveal details about the protein’s folded structure.
AI is used in evolutionary scale modeling (ESM) to learn to interpret these patterns. A language model was trained on the sequences of millions of natural proteins in 2019 using masked language modeling, a self-supervised learning method. This helped understand specific details regarding the composition and purpose of proteins. The ESM-2 next-generation protein language model was built upon this methodology. The team noticed that information in the internal representations of the model that enables 3D structure prediction at an atomic level emerges as the model is scaled up from 8M to 15B parameters.
Even with the resources of a major research organization, it may take years to predict protein sequences using current state-of-the-art computer technologies. A breakthrough in prediction speed is essential to make predictions at the metagenomics scale. The researchers discovered that the speed of structure prediction could be increased up to 60 times by utilizing a language model of protein sequences. This is quick enough to forecast results for a complete metagenomics database in a matter of weeks and is scalable to databases considerably bigger than Meta’s ESM Metagenomic Atlas.
Modern techniques for predicting structure require extensive protein datasets to scan through to find related sequences. For the techniques to extract the patterns associated with structure, a collection of evolutionarily connected sequences must be used as input. During its training on protein sequences, the language model picks up these evolutionary patterns, enabling a high-resolution three-dimensional structure prediction straight from the protein sequence.
Humans can gain a new perspective on biology and comprehend the vast expanse of natural variation with the aid of AI. Even the most sophisticated computing tools have been unable to fully comprehend the language of proteins, which is beyond human comprehension. AI has the potential to help us understand this language. ESMFold demonstrates how AI can provide new tools to comprehend the natural world and reveals connections between different fields. For example, large language models, which are the driving force behind advancements in machine translation, natural language understanding, speech recognition, and image generation, can also learn in-depth biological knowledge.
According to Meta, with work on metagenomics spanning several fields, including biology, chemistry, and artificial intelligence, it is crucial to collaborate, share their findings, and draw on the insights of others. They anticipate that ESM-2 and the ESM Metagenomic Atlas will support researchers working to understand the evolutionary past of diseases and the effects of climate change. Meta AI is working on extending language models in order to use them to create new proteins and help with problems related to health, sickness, and the environment.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Evolutionary-scale prediction of atomic level protein structure with a language model'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, code, tool and reference article. Please Don't Forget To Join Our ML Subreddit
Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.