lDDT

History of scoring protein structures

컴퓨터로 단백질 구조를 모델링하는 기술이 발달하면서, reference 구조와 modeling 된 구조 (prediction) 를 적절히 비교하여 품질을 평가하고자 하는 시도가 계속되어 왔다.

특히, 2년 마다 열리는 CASP (Critical Assessment of techniques for protein Structure Prediction) 대회에서는 전세계의 여러 팀들이 제출한 modeling 된 구조들의 품질을 평가하고자 하는 정량적인 기준을 여러 가지로 평가하고 있었다.

초창기에는 아래 두 가지 방법이 가장 널리 사용되었다.

RMSD (root-mean-squared deviation)

\text{RMSD}(\mathbf{v}, \mathbf{w}) = \sqrt{\frac{1}{n} \sum^n_{i=1} \| v_i - w_i \|^2}

•

설명: 가장 단순한 방법으로, prediction 을 reference 에 superpose 한 뒤, 각 atom 들 좌표의 차를 계산하고, 제곱한 뒤, 평균을 내고 제곱근을 씌운다.

•

단점:

◦

Outlier 에 의해 점수가 너무 많이 차이가 난다.

◦

Superposition 을 어떻게 하느냐에 매우 의존도가 높다.

GDT (Global distance test)

위에서 언급된 RMSD의 단점을 일부 개선하고자 CASP4 에서 소개된 일종의 agreement-based 점수이다.

Iterative 하게 prediction과 reference 를 superpose 하면서, 특정 threshold 이내에 들어오는 model 구조의 CA (alpha carbon) atom의 개수를 센다. 가장 많은 수가 count 된 구조가 가장 잘 superpose 가 되었다고 가정하고 threshold를 여러 단계로 바꿔가며 점수를 내고, 최종적으로 평균을 내어 GDT score 로 사용한다.

Threshold 는 보통 GDT_TS (total score)의 경우 1Å, 2Å, 4Å, 8Å 을 사용하고, GDT_HA (high accuracy)의 경우 0.5Å, 1Å, 2Å, 4Å을 사용한다.

•

단점:

◦

Superposition 을 iterative 하게 수행한다.

◦

작은 outlier에 대해서는 RMSD 보다는 robust 해졌지만, domain이 여러 개인 flexible한 단백질에 대해서 적합한 점수를 내기 어렵다.

따라서 local atomic detail을 반영한, superposition-free score가 필요했고, lDDT 라는 점수가 개발되었다.

lDDT: introduction

Target structure (shown in gray) consists of two domains. A: predicted model is shown in full length, with the first domain superposed to the target. b. two domains in prediction are separated according to CASP AUs (assessment units) and superposed individually to the target structure.

lDDT는 말 그대로 local distance 의 차이를 계산하는 방법이다. 위 그림 A, B에서 볼 수 있듯이 2개 이상의 domain으로 구성되어 있고 그 domain들이 연결되는 부위가 flexible 하다면, GDT 로는 적합한 점수를 얻어내기 어려울 것이다.

lDDT는 “reference 구조의 local environment가 prediction에서 얼마나 잘 보존되는지”를 측정하는 점수이다.

Reference 구조에서 서로 같은 residue 에 포함되지 않으면서 특정 inclusion radius R0\text{R}_0R0​ (보통 15Å) 안에 들어오는 모든 atom pair 와 각각의 거리 LLL 을 계산한다.

해당 atom pair 의 거리가 prediction 구조에서도 LLL 을 유지하면 (또는 특정 tolerance threshold 정도의 차이라면) 그 거리가 “유지”되었다고 보고, 유지된 거리의 비율을 계산한다.

일반적으로 threshold 는 총 네 가지 (0.5Å, 1Å, 2Å, 4Å) 경우에 대해 각각 계산한 뒤, 평균을 낸다.

•

장점:

◦

Superposition 이 필요없다.

◦

Flexible 한 단백질 구조에서도 유의미한 점수 역할을 할 수 있다.

꼭 모든 atom을 사용할 필요는 없고, CA atom 만 사용한다거나, backbone atom 만 사용한다거나, 서로 다른 chain 관계에 있는 atom 끼리만 계산하는 것 등도 얼마든지 가능하다.

lDDT: Python code

TBD

Reference

NCBI - WWW Error Blocked Diagnostic

Your access to the NCBI website at www.ncbi.nlm.nih.gov has been temporarily blocked due to a possible misuse/abuse situation involving your site. This is not an indication of a security issue such as a virus or attack. It could be something as simple as a run away script or learning how to better use E-utilities, http://www.ncbi.nlm.nih.gov/books/NBK25497/, for more efficient work such that your work does not impact the ability of other researchers to also use our site. To restore access and understand how to better interact with our site to avoid this in the future, please have your system administrator contact info@ncbi.nlm.nih.gov.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3799472/