BioEmu Explained

Created

2025/04/27 03:37

Language

🇺🇸

Latest checked date

2025/04/27

Status

Done

Type

Article

1 more property

What BioEmu cannot do

Model Architecture

Dataset composition 🔥

Training scheme 🔥

BioEmu Results

Sampling conformational changes related to protein function

Emulating MD equilibrium distributions

Predicting protein stabilities

Takeaways

Reference

Introduction

If you need some background information about protein folding, refer to Introduction to Protein Folding 

•

There are three pillars in Protein science: squence, structure, and function.

◦

Sequence: NGS (next-generation sequencing) helped to acquire the protein sequences of the entire genome across thousands of species.

◦

Structure: recent AI models (such as AlphaFold, RosettaFold, …) utilized the PDB data to predict 3D protein structures with great accuracy in reasonable amount of time.

◦

Function: lacks methods that are both highly accurate & high-throughput.

•

Biological “function” of a protein is a highly abstract and a qualitative concept, but some can be objectively measurable.

For example,

What are the conformational states of a certain protein can be in?

ii.

Which other molecules an a protein bind to in these different conformations?

iii.

What is the probability of these conformational & binding states in a specific conditions?

Previous methods

•

Cryo-EM (Cryo-Electron Microscopy)

◦

Can resolve multiple conformational states with their probabilities

◦

But costly and time-consuming

•

MD (Molecular Dynamics) simulation

◦

Can explore dymanics with molecular force-fields.

◦

But require epic computational costs (even supercomputers can only handle small-sized proteins) and force-fields are far from perfect.

→ Available technologies can be accurate, but are not scalable.

About BioEmu

A scalable generative model that can ‘emulate’ protein equilibrium ensembles.

•

First posted at December 05, 2024 (bioRxiv)

•

Work of AI for Science group in Microsoft Research

•

BioEmu can approximately sample protein conformations within a few GPU-hours per experiment → A high-throughput and accurate biomolecule structure emulator.

What BioEmu can do

Predict protein conformational changes

•

Large domain motions

•

Local unfolding

•

Find cryptic binding pockets

Emulate equilibrium distributions

Predict experimentally-measured stabilities of folded states

What BioEmu cannot do

BioEmu cannot deal with multimeric proteins

BioEmu cannot deal with other biomolecule types (ligand, carbohydrates, nucleotides, lipids …)

BioEmu cannot deal with various thermodynamic condition (only fitted in 300K)

Model Architecture

Looks like a chimera of AF2 & AF3

Input embedder of AF2 + Diffusion module that resembles AF3

Fig 1b. ML model architecture consisting of protein sequence encoder and denoising diffusion model

Protein sequence encoder

•

For each input protein sequence, single & pair representations are pre-computed once with the pre-trained AF2 evoformer, and stored for fast retrieval.

•

This single & pair representations is fed into the diffusion model as condition.

Coarse-grained protein structure representation

•

BioEmu only models five backbone heavy atoms (no sidechain & hydrogen atoms) of protein residue.

•

Similarly to AF2, each amino acid residue is represented as N−Cα−CN-C_\alpha - CN−Cα​−C triangular frame, and they have ideal backbone atom position matrix.

example: Alanine

Conditional diffusion model

•

BioEmu generates (samples) protein structures from sequence representation.

\mathbf{x}_0 \sim p_\theta (\mathbf{x} | S)

•

Forward diffusion process:

d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + \mathbf{G}(\mathbf{x}, t)d\mathbf{w}

variables

•

Diffusion module can be parallelized across a batch of random seeds.

•

Authors used 100 denoising steps.

•

Score model sθ(x,h,z,t)s_\theta(\mathbf{x}, \mathbf{h}, \mathbf{z}, t)sθ​(x,h,z,t)

Fig 1c. Architecture of the score model used in the denoising diffusion model

◦

The score model is the essential part of the diffusion module, that predicts translation score and rotation score for each residue.

◦

Use IPA (invariant point attention) operation.

→ The updates to atom positions are equivariant under rotation and translation of the whole structure.

Algorithm

Dataset composition

Fig 1e. Data processing pipeline for pretraining

AlphaFoldDB

Purpose: To pretrain BioEmu, encouraging protein conformational diversity within sequence cluster.

AFDB snapshot was downloaded in July 2024.

Authors performed some preprocessing to identify sets of similar sequences with heterogeneous predicted structures.

PDB (Protein Data Bank)

Purpose: To compare the structural diversity performance of BioEmu (comparison with the model trained with AFDB).

PDB snapshot was downloaded in Nov. 23rd 2023.

Molecular Dynamics simulation data

Purpose: To fine-tune BioEmu to cover vast conformational diversity.

In-house MD dataset

Authors internally built in-house MD dataset specifically for BioEmu with certain conditions (temperature, solvent, pressure, …)

Below are the in-house MD dataset list:

•

Octapeptides, CATH1, CATH2, MEGAsim, Complexin

Public MD dataset

Authors also exploited the following public MD datasets:

•

DESRES fast-folding proteins, DDR1, SETD8, SARS-CoV-2 exascale, SARS-CoV-2 non-exascale, MHC2 peptide simulations, Barnase-Barstar

Experimental thermodynamics data

Purpose: To fine-tune BioEmu to predict stability of a certain folded state.

Authors extracted the

\Delta G

(”dG_ML”) and

\Delta \Delta G

(”ddG_ML”) values and the corresponding amino acid sequences for wildtypes and mutants within the curated set of MEGAscale datset.

Training scheme

Fig 1d. Data integration and model training pipeline

Pretraining on AFDB

•

Starting from a pretrained & weight-freezed sequence encoder of AF2, the structure module (diffusion model) is trained from scratch.

•

AFDB was processed to have high sequence diversity and varied conformations for each sequence.

•

The pretrained model itself can also predict diverse conformations, but does not quantitatively match the probabilities of different states → major reason for fine-tuning!

Fine-tuning on MD & Experimental thermodynamics data 

Fine-tuning with charmm22* force field MD data (DESRES-fastfolders dataset)

•

The best pretrained model was further fine-tuned with DESRES-fastfolders dataset, and the performance was visualized at Fast-folding proteins 

ii.

Fine-tuning with Amber force field MD data (other MD datasets) and experimental folding free energies

•

The best pretrained model was fine-tuned with the Amber MD datasets and additional experimental thermodynamics data if it was from the MEGAscale dataset.

•

To retain the pretraining performance, 5% of the training set was filled with randomly selected AFDB data.

•

Reweighting MD with Markov models and experimental data

Reason for reweighting MD data:

◦

MD simulations are too short to represent the whole conformational space

◦

The data distribution generated by MD is often biased towards the seeding structure

→ Each data need to be re-weighted to guide the model to generate conformations according to the equilibrium distribution.

MSM (Markov state models) reweighting for small ONE-octapeptide dataset

Too much detail → read suppl. S.3.5.1

Reweighting with Experimental folding free energies

•

A subset of data has the folding free energy (ΔG\Delta GΔG) annotations, which is related to the probability under the Boltzmann distribution. (i.e. the probability of the specific folded state)

•

Relationship between the probability of being in the folded state (pfoldedp_\text{folded}pfolded​) and ΔG\Delta GΔG

\frac{p_\text{folded}}{1 - p_\text{folded}} = \exp \Big( -\frac{\Delta G}{k_B T} \Big)

•

pfoldedp_\text{folded}pfolded​ can be expressed as the expectation of foldedness

p_\text{folded} = E_\mathbf{x}[f(\mathbf{x})]

Authors took the form of

f

as:

f_\text{FNC}(\mathbf{x}) = H(Q(\mathbf{x}) - Q_\text{threshold})

•

HHH: Heaviside step function

•

Q(x)Q(\mathbf{x})Q(x): the fraction of native contacts

•

Since the distribution of FNC for each system is generally separted into peaks near 1 and 0, authors used a kernel density estimate to smoothen the distribution.

•

Property prediction fine-tuning (PPFT) 

Fig 1f. Experimental property training for finetuning

◦

Although the reweighting guided the model, authors also made the model to predict ΔG\Delta GΔG with novel training scheme.

◦

Purpose:

▪

Aim faster convergence (especially where unfolded states are rare) without too much computational cost

◦

Loss term:

L_{dG} = \frac{1}{B} \sum^B_{i=1} (f(\mathbf{x}^0_i) - f_\text{target})(f(\mathbf{x}^1_i) - f_\text{target})

Above loss detours the mode collapse issue

where

\mathbf{x}^0_i

and

\mathbf{x}^1_i

are two i.i.d samples with the same protein sequence, and

f_\text{target}

is computed as:

E_\mathbf{x}[f(\mathbf{x})] = \frac{\exp \Big( - \frac{\Delta G}{k_B T} \Big)}{ 1 + \exp \Big( - \frac{\Delta G}{k_B T} \Big)}

To enable backpropagation, authors modified the definition of the foldedness with the following definition:

f_\text{dRMSD}(\mathbf{x}) = \sigma (-24 (\text{RMSD}(D(\mathbf{x}) - D(\mathbf{x}_0) - 0.4))

◦

During PPFT, the diffusion model denoises the structure with shorter (8) denosing steps (similar to mini-rollups in AF3)

Authors insist that the ‘foldedness’ is a coarse-grained feature of a certain protein, and can be predicted without full diffusion rollups.

BioEmu Results

Sampling conformational changes related to protein function

Domain motions

Fig 2a. Large-scale domain motions (opening/closing, rotation, repacking)

•

Left column: coverage (= % of reference structures that are sampled by at least 0.1% of samples  within a given distance of the respective metric)

•

i - iii) RMSD to reference PDBs

→ BioEmu predicts 85% of the reference experimental structures with ≤ 3Å RMSD

Local unfolding

Fig 2b. Local unfolding or unbinding of parts of the protein

•

Left column: coverage (= % of reference structures that are sampled by at least 0.1% of samples  within a given distance of the respective metric)

•

i - iii) Fraction of native contacts and its free energy

Fraction of native contacts = coverage of the contacts which exist in a reference folded structure

→ BioEmu predicts the local unfolding transitions (overall 72% of locally folded and 74% of locally unfolded states)

Cryptic pockets

Fig 2c. Formation of cryptic binding pockets that are not present in the apo ground state.

•

Left column: coverage (= % of reference structures that are sampled by at least 0.1% of samples  within a given distance of the respective metric)

•

i - iii) RMSD to reference PDBs

→ BioEmu showed strong preference for holo states and predicted the cryptic poket in 85% of cases, while it only succeeded in predicting 49% of the apo structures.

Emulating MD equilibrium distributions

Fast-folding proteins

Fig 3a. Fast-folding proteins simulated by DESRES Anton supercomputer compared with a BioEmu (finetuned with DESRES fastfolder dataset except the test protein) output.

From left to right,

•

Folded and partially / unfolded structures predicted by BioEmu (green) and ground truth MD (gray)

•

Free energy surfaces (in kcal/mol) of ground truth MD and BioEmu

•

Secondary structure content compared over the whole ensemble of structures

→ BioEmu shows very similar folding patterns and energy / 2ndary structure landscapes.

ii.

Computational cost (in GPU hours) for MD (magenta: full DESRES dataset; yellow: single folding-unfolding roundtrip) and 10k samples from BioEmu (cyan)

→ BioEmu requires significantly lower cost

iii.

MAE (mean average error) of free energy differences of macrostates and fraction of unphysical model samples due to clashes

→ Fine-tuning helped to reduce the error

CATH domains

Fig 3b. CATH domains results

From left to right,

•

Folded and partially / unfolded structures predicted by BioEmu (green) and ground truth MD (gray) and structurally flexible motifs are color-coded (cyan: helical; magenta: sheet)

•

Free energy surfaces (in kcal/mol) of ground truth MD and BioEmu

•

Secondary structure content compared over the whole ensemble of structures

→ BioEmu covers most regions of MD simulation spaces

ii.

MAE (mean average error) of free energy differences of macrostates and fraction of unphysical model samples due to clashes

iii.

Macrostate free energy MAE and state coverage as function of training data of a specialized CATH-only model

→ MAE and state coverage gets better with more training data.

Predicting protein stabilities

With PPFT (property prediction fine-tuning), BioEmu was trained to predict protein stability, as a form of “foldedness”.

The prediction errors were measured as the folding free energy

\Delta G = G_\text{folded} - G_\text{unfolded}

and classify protein structures as folded or unfolded based on their fraction of native contacts.

Fig 4a. Comparison of experimental measurements of folding free energies with model predictions

→ BioEmu achieved a mean absolute error below 0.8 kcal/mol and a Spearman correlation coefficient above 0.65 for proteins in the MEGAscale dataset.

Fig 4b. Validation that very stable proteins are consistently predicted as folded. Fig 4c. Validation that intrinsically disordered proteins (IDPs) are predicted as unfolded.

Fig 4d. Analysis of the effect of three destabilizing mutants on the folded structures as predicted by the model

→ Certain effects of mutations can be analyzed with BioEmu’s predictions

Takeaways

•

BioEmu is a generative ML system to approximately sample the equilibrium distributions of proteins.

•

BioEmu was trained with AFDB, MD simulation structures, and some experimental properties (folding free energies).

•

Authors utilized a novel training scheme called PPFT (property prediction fine-tuning).

•

BioEmu and MD simulation are complementary. BioEmu need and can be improved with more MD simulation data.

•

Current version of BioEmu can only deal with single protein chains at a fixed thermodynamic condition.

Reference

https://www.biorxiv.org/content/10.1101/2024.12.05.626885v2

bioemu

microsoft