Search

AlphaFold2 & AI Drug Discovery

0. Presenter info

Currently working at Galux

1. Protein?

1.1 Central Dogma: DNA → RNA → Protein

Source: Wikipedia
Source: CK-12 Foundation
The central dogma of molecular biology is an explanation of the flow of genetic information within a biological system. It is often stated as "DNA makes RNA, and RNA makes protein", although this is not its original meaning. It was first stated by Francis Crick in 1957, then published in 1958:
Source: Genome Research Limited
The Central Dogma. This states that once "information" has passed into protein it cannot get out again. In more detail, the transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein may be possible, but transfer from protein to protein, or from protein to nucleic acid is impossible. Information here means the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein.
Information-carrying biopolymers: DNA, RNA, Protein
Smallest “tokens” of each biopolymers:
DNA: A, T, G, C
RNA: A, U, G, C
Protein: 20 amino acids

2. Protein Folding?

2.1. from 1D to 3D

Source: Wikipedia
Protein folding is the physical process in which a polypeptide (a protein chain) is synthesized by a ribosome (involving translation by messenger RNA) from an unstable, random coil into a linear chain of amino acids, resulting in protein's three-dimensional structure. This is typically a 'folded' conformation, by which the protein becomes biologically functional. The folded structure is not fixed. It is dynamic, depending on the local environment (solvent, temperature, pH, interacting proteins, …)
Source: Wikipedia
Source: iGenetics 3rd ed.
Primary structure: Linear amino acid sequence
Secondary structure: First step of folding. Typical local patterns in the folding process, easily stabilized by intramolecular hydrogen bonds
Source: bioninja.com
Tertiary structure: Folded structure of single polypeptide chain. Usually, hydrophilic sides are facing the aqueous environment surrounding the protein, and the hydrophobic sides are facing the hydrophobic core of the protein.
Quaternary structure: Assembly of tertiary structures.

2.2. Structure defines Function

Structure is determined by various methods
X-ray crystallography
Nuclear Magnetic Resonance (NMR)
Cryo-EM
→ ground truth structure labels, but extensive trial & error, years of work, expensive equipment.
Commonly observed structural motifs

2.3. Modeling of protein folding

The search space of protein folding is enormous. Brute-force approach (1050\approx 10^{50}) is not even possible to see the results. But in biological systems (e.g. our body), it is done in a few milliseconds ~ few hours.
Several computational methods have been developed.
Molecular dynamics (MD) simulations: high computational cost (limited to peptides and very small proteins)
Anton: massively parallel supercomputer
Rosetta
example gif

2.3. CASP: Critical Assessment of protein Structure Prediction

Since the late 1960s, understanding and simulating the protein folding process has been an important challenge for computational biology.
Biennial competition from 1994 (CASP1).

2.4. Baker group

World-leading group in protein structure modeling

3. AlphaFold2 (AF2)

3.1. About AlphaFold

AlphaFold1: CASP13 (2018), AlphaFold2: CASP14 (2020)
DeepMind said that they began developing AF in 2016 (after AlphaGo’s victory against Lee Sedol).
AF2 paper has received over 18,000 citations (top 100 most cited papers of the last decade, 900 most cited papers of all time).
Source: DeepMind blog
Source: DeepMind blog
AF team and EMBL made AF predictions freely available.
22 July 2021: over 1 million structures (human, yeast, fruit fly, mouse, …)
28 July 2022: Expanded to 200 million structures
Source: DeepMind blog
Domain knowledge-based inductive bias 와 extreme engineering의 결정체..

3.2. Keywords and novelties of AF2

1.
End-to-end
2.
Evoformer
3.
Backbone frame, torsion angle
4.
Invariant point attention (IPA)
5.
Intermediate loss
6.
Masked MSA loss
7.
Self-distillation
8.
Self-estimates of accuracy

3.3. Background

Multiple sequence alignment (MSA)
Residues and side chains
Distogram
Rigid body assumption

3.4. AF2 Model I/O and Overview

1.
Input embedding
Obtain sequence-related info and structure-related info
2.
Evoformer
Efficient & powerful self-attention for information exchange
3.
Structure module
Explicitly predict structure with the Evoformer embedding
4.
Recycling
Refinement of prediction

3.5. Input embedding

Input: Amino acid sequence Output: MSA representation (Nseq×Nres×cm)(N_{seq}\times N_{res} \times c_m) & Residue pair representation Nres×Nres×cz)N_{res} \times N_{res} \times c_z)
1.
Sequence info
Input sequence → MSA → MSA representation
Sequential evolutionary covariation info
Genetic search (MSA)
Find evolutionary context of input sequence, perform profile HMM-based DB search
2.
Structure info
Similar known structure template info
Template search
With the MSA result (especially JackHMMER v3.3 + UniRef90), use HHSearch on PDB70 to find similar structures

3.6. Evoformer stack

Input: MSA representation & Pair representation (+ recycled input) Output: (updated) MSA representation & (updated) Pair representation
Two major stacks and their communication
MSA stack: update MSA embedding
Axial (row-wise & column-wise) gated self-attention
Pair stack: update residue pair embedding
Triangular operations
Information exchange between two stacks
Attention biasing in row-wise gated self-attention: Pair info → MSA rep
Outer product mean: MSA info → Pair rep

3.7. Structure module

Input: MSA representation & Pair representation from Evoformer & (updated) backbone frames Output: 3D atom Coordinate, per-residue confidence (pLDDT), losses
Two-step procedure
1.
Residue position prediction
Residue frame: triangle of N, CA, C of each residue
Orientation, position: RiR3×3,tiR3R_i \in \mathbb{R}^{3 \times 3}, \vec{\mathbf{t}}_i \in \mathbb{R}^3 (rotation matrix, translation vector)
Black-hole initialization: initially, all residue is at origin ti=[0,0,0]\vec{\mathbf{t}}_i = [0, 0, 0] with same identity orientation Ri=I3R_i = I_3
The global coordinates can be computed as
xglobal=Tixlocal=Rixlocal+ti\vec{\mathbf{x}}_{\text{global}} = T_i \circ \vec{\mathbf{x}}_{\text{local}} = R_i \cdot \vec{\mathbf{x}}_{\text{local}} + \vec{\mathbf{t}}_i
, where Ti=(Ri,ti)T_i = (R_i, \vec{\mathbf{t}}_i)
2.
Atom position prediction
Predict torsion angles (of backbone atoms) and chi angles (of side chain atoms) to compute all atom positions
Structure module structure
Attention between two residues on 3D space that is invariant to global transformations
Input: MSA single representation & Pair representation & updated backbone frame position
Components
1. Core self-attention for singe MSA representation
2. Pair representation (as bias term)
3. Invariant point attention (IPA) module
pLDDT: per-residue confidence
Additional head predicts per-residue lDDT-CαC \alpha
lDDT: local score using distance difference test
Label: binned per-residue lDDT-CαC\alpha
Task: classification
Loss: cross entropy loss
pseudocode
example

3.8. Recycling

Repeat the prediction process with Evoformer output & Structure module output
Advantages
Recycling deepens the network
Model can experience various versions of input features for a single input sequence
When training,
N=Uniform(1,Ncycle)N' = \text{Uniform}(1, N_{cycle})
Backpropagation is only performed for the last cycle (NN'-th cycle).
When inference,
N=NcycleN' = N_{cycle}

4. Others..

4.1. RosettaFold

4.2. AF-multimer

4.3. AF2 latest

4.4. Isomorphic Labs

Founded by Hassabis in 2021, under DeepMind parent company Alphabet
$3 billion deal
Eli Lilly $1.7 billion (upfront $45 million), excluding royalties
Novartis 1.2 billion (upfront $37.5 million), excluding royalties

Reference