Solving the Protein Folding Problem: A Journey from Experiments to AI Algorithms

0
52
Image Credit: Google DeepMind AlphaFold/Isomorphic Labs

The Nobel Prize in Chemistry this year demonstrates the rise of the power of computational and AI tools to assist scientists towards greater inventions.

In the early 20th century, Alois Alzheimer, a psychiatrist and neuropathologist, observed some abnormal webs and tangles under the microscope in postmortem brain samples of people who suffered from early-onset memory loss (dementia). He, however, could not identify what these were made of. Over the years, scientists have identified these as clumps of misfolded proteins; the condition of dementia is now named after Alzheimer’s.

Proteins are chains of amino acids, one of the building blocks of life, so much so that many scientists believe that the formation of amino acids on earth is a significant step to the origin of life. There are thousands of proteins in the human body that perform diverse functions: name a function, and there is invariably a protein associated with it.

Intriguingly, proteins, made by forming chains of different combinations of a mere 20 amino acids can show this large diversity of functions. All boils down to the way protein is folded, or as scientists call ‘native structure’: how the string of amino acids is arranged in 3D space.

A protein can perform its assigned function only if properly folded. A denatured protein (one that has lost its 3D structure, like an open random coil) or a wrongly folded one doesn’t. Misfolding of proteins is linked to several debilitating conditions like Parkinsons’, Amyotrophic Lateral Sclerosis (ALS), and Alzheimer’s, as was observed in the microscope by Alzheimer.

There are however millions of ways in which a protein could fold. Imagine millions of rugged valleys over a vast landscape and the goal is to throw a stone into the deepest valley. The same analogy applies to finding the native structure of proteins. This is termed as the ‘protein folding problem’.

Christian Anfinsen of the National Institute of Health (NIH) in 1961, observed that denatured proteins can fold back to their original functional state in a matter of few seconds. That the proteins manage to do so in such a short time frame, even in the presence of a multitude of possibilities, is a paradox, now famously known as Levinthal’s Paradox after Cyrus Levinthal, a scientist at MIT who proposed this in 1968.

These observations suggested that the 3D structure is coded in the sequence itself and that some important physical forces are in play that direct the protein to be folded a certain way, making its most stable state (the native state) easily accessible, rather than searching for the stable state randomly.

Protein Structure: The History

Scientists have been working on identifying the structure of proteins since 1930. The landmark discovery was when Kendrew and Marx Perutz figured out the structure of myoglobin and hemoglobin the oxygen-storing and carrying proteins respectively, using X-ray crystallography, pretty much like photographing the atoms of protein with X-rays. Identifying the protein structure, though, is not an easy task. It requires weeks or months of painstaking experiments and analysis.

In the early days of crystallography, scientists would spend years trying to crystallize proteins to study their structures, and many proteins simply couldn’t be crystallized. Several more months were required to analyze the experimental outcomes and come up with a sensible structure. Over the years, the field of structural biology evolved, with people finding structures of more and more proteins. More advanced tools including cryoelectron microscopy and NMR were being used routinely to study protein structure.

John Kendrew (left) and Max Perutz with their model proteins
John Kendrew (left) and Max Perutz with their model. Credit: Medical Research Council Laboratory of Molecular Biology, UK

Dr. Mohd Taher, a postdoctoral researcher working on proteins and enzymes, in the Department of Chemistry, University of Illinois, Urbana-Champaign, USA says, “Seeing is believing”. Although researchers were successful in identifying the protein structures, one question remained largely unanswered: given a sequence of amino acids, is it possible to predict the native structure?

This quest inspired a group of structural biologists to start a friendly competition every two years called CASP (Critical Assessment of Protein Structure Prediction), with the motive of enhancing the pace of the advances, where the participants used their models to predict structures of proteins whose structures are not yet publicly available.

Some of the earlier winners tried predicting the structures based on the physicochemical properties of amino acids and how they interact with each other to model how these interactions will direct the 3D structure formation. Some came up with the idea of looking at several related proteins to find the pattern of how similarly coded regions fold.

Yet others looked at amino acids that got mutated together during evolution and postulated that if they changed together, they should be close to one another influencing one another in the folded state. The success of prediction, however, remained bleak, mostly with less than 50 percent accuracy.

A gamechanger in protein folding problem

However, the CASP competition of 2020 was a game-changer in the field of structure prediction. Researchers from Google’s startup Deepmind, John Jumper, David Hassabis, and their team showcased their algorithm, AlphaFold2, built with improved deep learning algorithms, which used “transformers” to learn from hundreds of thousands of known protein structures, and used this learning to predict the structures of a new protein.

The earlier version, AlphaFold1, presented at CASP in 2018 with algorithms based on convolutional neural networks was placed among the first 5 competitors. Deepmind’s algorithm outshone other competitors by a large margin. The jury of the CASP was in for a surprise by the result in front of them: AlphaFold2 managed to produce structures that were more than 90 percent accurate on the tested proteins.

AlphaFold 2 performance, experiments, and architecture of proteins
AlphaFold 2 performance, experiments, and architecture. Credit: Wikimedia

The team described that they designed novel “training procedures based on the evolutionary, physical and geometric constraints of protein structures.” In the study published in Nature, they discuss the structure of the neural network used to train AlphaFold.

“The complex layers of neural networks succeeded in learning the outcomes of the physical processes of protein folding, capturing effects such as the propensity of some amino acids to form certain shapes, like alpha helix and beta-sheets, and the interactions of amino-acids with the surrounding environment (water and other amino acids)”, says Taher.

AlphaFold had managed to predict the protein structure of an amino acid sequence in mere minutes as compared to experiments that took several months. “AlphaFold however, cannot replace experiments. The final validation requires an experimental structure determination”, says Dr. Natesh Ramanathan, Associate professor in the School of Biology and Center for High-Performance Computing (CHPC), Indian Institute of Science Education and Research, Thiruvananthapuram, India.

Talking about the significance of prediction tools in protein research, Dr. Natesh said “These computational tools help in speeding up experimental identification of protein structures, allowing researchers to focus on more advanced problems.”

Is AlphaFold memorizing instead of learning?

The success of AlphaFold in the accurate prediction of protein structures is no doubt one of the best examples of the AI revolution in science. However, there is still scope for improvement. A recent case study by a team at NIH, Bethesda, USA showed that AlphaFold fails to predict the structures of proteins that can switch shapes as part of their function.

They showed evidence that the algorithm at some point had started to memorize the patterns rather than learning them, leading to incorrect predictions for more complicated structures. Dr. Natesh says, “As is the case for any method of bioinformatics, AlphaFold too is only as good as the database it is trained on.”

However, when scientists rely on increasingly sophisticated computational tools to predict protein structure, there also comes a downside: it is quite difficult to decode what are the important factors that contribute to the final result. AI algorithm works as a black box that spits out protein structures, leaving the researchers still wondering what factors led to this structure.

Digitally rendered image of a protein structure prediction by AlphaFold
The success of AlphaFold in the accurate prediction of protein structures is no doubt one of the best examples of the AI revolution in science. Credits: DeepMind

It is also not clear if the models have learned some new physics that humans have not yet figured out. It is an interesting question since machine learning algorithms are designed to identify patterns that might be invisible to humans. This might be the case, but it is difficult to tweak this information.

While the researchers can now predict more accurate structures, the fundamental questions, what the complete physics underlying protein folding is, and how the process happens so fast remain. According to Dr. Natesh, “In the A to Z of protein folding problem, steps B to Y are still unsolved”.  But for many, many important applications, one can work with the output structure, without worrying much about how the algorithms zeroed in on it.

From prediction to design

While many were interested in solving the protein folding problem, David Baker of the Institute of Protein Design, University of Washington, wished to go a step further. One of the regular participants in CASP, Baker was working on protein structure prediction, developing an algorithm called Rosetta, based on modeling the interactions between amino acids to predict the structure.

He envisaged an idea, why not use the existing knowledge of preferences of protein folding, to design a completely new protein, a string of amino acids that might fold into a shape for a specified function? This is essentially the reverse problem of the one that AlphaFold addresses.

This problem is considerably different from protein engineering, which has been around for a while: modifying existing proteins to improve efficiency or perform new functions. Smaller steps in this direction were taken by other research groups by the end of the 1980’s, to make short strings of amino acids called peptides, inspired by naturally occurring proteins.

The arrangements were predicted taking into account that some of the amino acids are hydrophobic (molecules that stay away from water) in nature while some are hydrophilic (molecules that like to interact with water). But it was David Baker’s group in 2003 that succeeded in the remarkable feat of computationally designing an entirely new protein whose structure or amino acid sequence bore no similarities to the known protein structures.

“This was something that was never achieved before”, said Dr. Natesh.  “Not only did they design a protein made of 93 amino acids (now called Top7) “de-novo” (meaning anew)using computational tools, but they validated it using crystallographic techniques.”

This had profound implications in many different fields including medicine, health, and biotechnology. A group in the Institute of Protein Design used computational protein design to develop a vaccine for the SARS-CoV virus. “It’s exciting”, says David Baker, in an interview on the Nobel Prize website.

Way Forward

Both these feats, which jointly won the Nobel Prize in Chemistry this year, demonstrate the rise of the power of computational and AI tools to assist scientists towards greater inventions. Together, they have opened new avenues for innumerable applications.

The timelines have been compressed drastically with the AlphaFold. Designing proteins for different functions ranging from medicines to molecules that catalyze difficult reactions, be it capturing methane or carbon dioxide from the atmosphere or helping break down plastics, could lead to sustainable solutions.

However, protein structure, albeit a significant aspect, is not the only one to address real-life problems. To create a new drug, information is required on how a drug interacts with a protein. One needs to understand the behavior of proteins in the more complex environment of the living cells. Scientists are already working on tackling these challenges, one step at a time.

References:

Also, Read: Microbial life & the Space industry— Do we have all bases covered?