Introduction
If you need some background information about protein folding, refer to
Introduction to Protein Folding 
β’
There are three pillars in Protein science: squence, structure, and function.
β¦
Sequence: NGS (next-generation sequencing) helped to acquire the protein sequences of the entire genome across thousands of species.
β¦
Structure: recent AI models (such as AlphaFold, RosettaFold, β¦) utilized the PDB data to predict 3D protein structures with great accuracy in reasonable amount of time.
β¦
Function: lacks methods that are both highly accurate & high-throughput.
β’
Biological βfunctionβ of a protein is a highly abstract and a qualitative concept, but some can be objectively measurable.
For example,
i.
What are the conformational states of a certain protein can be in?
ii.
Which other molecules an a protein bind to in these different conformations?
iii.
What is the probability of these conformational & binding states in a specific conditions?
Previous methods
β’
Cryo-EM (Cryo-Electron Microscopy)
β¦
Can resolve multiple conformational states with their probabilities
β¦
But costly and time-consuming
β’
MD (Molecular Dynamics) simulation
β¦
Can explore dymanics with molecular force-fields.
β¦
But require epic computational costs (even supercomputers can only handle small-sized proteins) and force-fields are far from perfect.
β Available technologies can be accurate, but are not scalable.
About BioEmu
A scalable generative model that can βemulateβ protein equilibrium ensembles.
β’
First posted at December 05, 2024 (bioRxiv)
β’
Work of AI for Science group in Microsoft Research
β’
BioEmu can approximately sample protein conformations within a few GPU-hours per experiment β A high-throughput and accurate biomolecule structure emulator.
What BioEmu can do
1.
Predict protein conformational changes
β’
Large domain motions
β’
Local unfolding
β’
Find cryptic binding pockets
2.
Emulate equilibrium distributions
3.
Predict experimentally-measured stabilities of folded states
What BioEmu cannot do
1.
BioEmu cannot deal with multimeric proteins
2.
BioEmu cannot deal with other biomolecule types (ligand, carbohydrates, nucleotides, lipids β¦)
3.
BioEmu cannot deal with various thermodynamic condition (only fitted in 300K)
Model Architecture
Looks like a chimera
of AF2 & AF3
Fig 1b. ML model architecture consisting of protein sequence encoder and denoising diffusion model
Protein sequence encoder
β’
For each input protein sequence, single & pair representations are pre-computed once with the pre-trained AF2 evoformer, and stored for fast retrieval.
β’
This single & pair representations is fed into the diffusion model as condition.
Coarse-grained protein structure representation
β’
BioEmu only models five backbone heavy atoms (no sidechain & hydrogen atoms) of protein residue.
β’
Similarly to AF2, each amino acid residue is represented as triangular frame, and they have ideal backbone atom position matrix.
example: Alanine
Conditional diffusion model
β’
BioEmu generates (samples) protein structures from sequence representation.
β’
Forward diffusion process:
variables
β’
Diffusion module can be parallelized across a batch of random seeds.
β’
Authors used 100 denoising steps.
β’
Score model
Fig 1c. Architecture of the score model used in the denoising diffusion model
β¦
The score model is the essential part of the diffusion module, that predicts translation score and rotation score for each residue.
β¦
Use IPA (invariant point attention) operation.
β The updates to atom positions are equivariant under rotation and translation of the whole structure.
Algorithm
Dataset composition 
Fig 1e. Data processing pipeline for pretraining
1.
AlphaFoldDB
Purpose: To pretrain BioEmu, encouraging protein conformational diversity within sequence cluster.
AFDB snapshot was downloaded in July 2024.
Authors performed some preprocessing to identify sets of similar sequences with heterogeneous predicted structures.
2.
PDB (Protein Data Bank)
Purpose: To compare the structural diversity performance of BioEmu (comparison with the model trained with AFDB).
PDB snapshot was downloaded in Nov. 23rd 2023.
3.
Molecular Dynamics simulation data
Purpose: To fine-tune BioEmu to cover vast conformational diversity.
a.
In-house MD dataset
Authors internally built in-house MD dataset specifically for BioEmu with certain conditions (temperature, solvent, pressure, β¦)
Below are the in-house MD dataset list:
β’
Octapeptides, CATH1, CATH2, MEGAsim, Complexin
b.
Public MD dataset
Authors also exploited the following public MD datasets:
β’
DESRES fast-folding proteins, DDR1, SETD8, SARS-CoV-2 exascale, SARS-CoV-2 non-exascale, MHC2 peptide simulations, Barnase-Barstar
4.
Experimental thermodynamics data
Purpose: To fine-tune BioEmu to predict stability of a certain folded state.
Authors extracted the (βdG_MLβ) and (βddG_MLβ) values and the corresponding amino acid sequences for wildtypes and mutants within the curated set of MEGAscale datset.
Training scheme 
Fig 1d. Data integration and model training pipeline
1.
Pretraining on AFDB
β’
Starting from a pretrained & weight-freezed sequence encoder of AF2, the structure module (diffusion model) is trained from scratch.
β’
AFDB was processed to have high sequence diversity and varied conformations for each sequence.
β’
The pretrained model itself can also predict diverse conformations, but does not quantitatively match the probabilities of different states β major reason for fine-tuning!
2.
Fine-tuning on MD & Experimental thermodynamics data 
i.
Fine-tuning with charmm22* force field MD data (DESRES-fastfolders dataset)
β’
The best pretrained model was further fine-tuned with DESRES-fastfolders dataset, and the performance was visualized at Fast-folding proteins
ii.
Fine-tuning with Amber force field MD data (other MD datasets) and experimental folding free energies
β’
The best pretrained model was fine-tuned with the Amber MD datasets and additional experimental thermodynamics data if it was from the MEGAscale dataset.
β’
To retain the pretraining performance, 5% of the training set was filled with randomly selected AFDB data.
β’
Reweighting MD with Markov models and experimental data
Reason for reweighting MD data:
β¦
MD simulations are too short to represent the whole conformational space
β¦
The data distribution generated by MD is often biased towards the seeding structure
β Each data need to be re-weighted to guide the model to generate conformations according to the equilibrium distribution.
1.
MSM (Markov state models) reweighting for small ONE-octapeptide dataset
Too much detail β read suppl. S.3.5.1
2.
Reweighting with Experimental folding free energies
β’
A subset of data has the folding free energy () annotations, which is related to the probability under the Boltzmann distribution. (i.e. the probability of the specific folded state)
β’
Relationship between the probability of being in the folded state () and
β’
can be expressed as the expectation of foldedness
Authors took the form of as:
β’
: Heaviside step function
β’
: the fraction of native contacts
β’
Since the distribution of FNC for each system is generally separted into peaks near 1 and 0, authors used a kernel density estimate to smoothen the distribution.
β’
Property prediction fine-tuning (PPFT) 
Fig 1f. Experimental property training for finetuning
β¦
Although the reweighting guided the model, authors also made the model to predict with novel training scheme.
β¦
Purpose:
βͺ
Aim faster convergence (especially where unfolded states are rare) without too much computational cost
β¦
Loss term:
Above loss detours the mode collapse issue
where and are two i.i.d samples with the same protein sequence, and is computed as:
To enable backpropagation, authors modified the definition of the foldedness with the following definition:
β¦
During PPFT, the diffusion model denoises the structure with shorter (8) denosing steps (similar to mini-rollups in AF3)
Authors insist that the βfoldednessβ is a coarse-grained feature of a certain protein, and can be predicted without full diffusion rollups.
BioEmu Results
Sampling conformational changes related to protein function
Domain motions
Fig 2a. Large-scale domain motions (opening/closing, rotation, repacking)
β’
Left column: coverage (= % of reference structures that are sampled by at least 0.1% of samples within a given distance of the respective metric)
β’
i - iii) RMSD to reference PDBs
β BioEmu predicts 85% of the reference experimental structures with β€ 3Γ
RMSD
Local unfolding
Fig 2b. Local unfolding or unbinding of parts of the protein
β’
Left column: coverage (= % of reference structures that are sampled by at least 0.1% of samples within a given distance of the respective metric)
β’
i - iii) Fraction of native contacts and its free energy
Fraction of native contacts = coverage of the contacts which exist in a reference folded structure
β BioEmu predicts the local unfolding transitions (overall 72% of locally folded and 74% of locally unfolded states)
Cryptic pockets
Fig 2c. Formation of cryptic binding pockets that are not present in the apo ground state.
β’
Left column: coverage (= % of reference structures that are sampled by at least 0.1% of samples within a given distance of the respective metric)
β’
i - iii) RMSD to reference PDBs
β BioEmu showed strong preference for holo states and predicted the cryptic poket in 85% of cases, while it only succeeded in predicting 49% of the apo structures.
Emulating MD equilibrium distributions
Fast-folding proteins
Fig 3a. Fast-folding proteins simulated by DESRES Anton supercomputer compared with a BioEmu (finetuned with DESRES fastfolder dataset except the test protein) output.
i.
From left to right,
β’
Folded and partially / unfolded structures predicted by BioEmu (green) and ground truth MD (gray)
β’
Free energy surfaces (in kcal/mol) of ground truth MD and BioEmu
β’
Secondary structure content compared over the whole ensemble of structures
β BioEmu shows very similar folding patterns and energy / 2ndary structure landscapes.
ii.
Computational cost (in GPU hours) for MD (magenta: full DESRES dataset; yellow: single folding-unfolding roundtrip) and 10k samples from BioEmu (cyan)
β BioEmu requires significantly lower cost
iii.
MAE (mean average error) of free energy differences of macrostates and fraction of unphysical model samples due to clashes
β Fine-tuning helped to reduce the error
CATH domains
Fig 3b. CATH domains results
i.
From left to right,
β’
Folded and partially / unfolded structures predicted by BioEmu (green) and ground truth MD (gray) and structurally flexible motifs are color-coded (cyan: helical; magenta: sheet)
β’
Free energy surfaces (in kcal/mol) of ground truth MD and BioEmu
β’
Secondary structure content compared over the whole ensemble of structures
β BioEmu covers most regions of MD simulation spaces
ii.
MAE (mean average error) of free energy differences of macrostates and fraction of unphysical model samples due to clashes
iii.
Macrostate free energy MAE and state coverage as function of training data of a specialized CATH-only model
β MAE and state coverage gets better with more training data.
Predicting protein stabilities
With PPFT (property prediction fine-tuning), BioEmu was trained to predict protein stability, as a form of βfoldednessβ.
The prediction errors were measured as the folding free energy and classify protein structures as folded or unfolded based on their fraction of native contacts.
Fig 4a. Comparison of experimental measurements of folding free energies with model predictions
β BioEmu achieved a mean absolute error below 0.8 kcal/mol and a Spearman correlation coefficient above 0.65 for proteins in the MEGAscale dataset.
Fig 4b. Validation that very stable proteins are consistently predicted as folded.
Fig 4c. Validation that intrinsically disordered proteins (IDPs) are predicted as unfolded.
Fig 4d. Analysis of the effect of three destabilizing mutants on the folded structures as predicted by the model
β Certain effects of mutations can be analyzed with BioEmuβs predictions
Takeaways
β’
BioEmu is a generative ML system to approximately sample the equilibrium distributions of proteins.
β’
BioEmu was trained with AFDB, MD simulation structures, and some experimental properties (folding free energies).
β’
Authors utilized a novel training scheme called PPFT (property prediction fine-tuning).
β’
BioEmu and MD simulation are complementary. BioEmu need and can be improved with more MD simulation data.
β’
Current version of BioEmu can only deal with single protein chains at a fixed thermodynamic condition.