Capturing protein domain structure and function using self-supervision on domain architectures

Damianos P. Melidis; Wolfgang Nejdl

doi:10.3390/a14010028

Details

Originalsprache	Englisch
Aufsatznummer	28
Fachzeitschrift	Algorithms
Jahrgang	14
Ausgabenummer	1
Publikationsstatus	Veröffentlicht - 19 Jan. 2021

Abstract

Predicting biological properties of unseen proteins is shown to be improved by the use of protein sequence embeddings. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector separately. Therefore, current sequence embedding cannot be intrinsically evaluated on the degree of their captured biological information in a quantitative manner. We address this drawback by our approach, dom2vec, by learning vector representation for protein domains and not for each amino acid base, as biological metadata do exist for each domain separately. To perform a reliable quantitative intrinsic evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of a domain, which are its structure, enzymatic, and molecular function. Notably, dom2vec obtains an adequate level of performance in the intrinsic assessment—therefore, we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperforms sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction.

ASJC Scopus Sachgebiete

Mathematik (insg.)
Theoretische Informatik
Mathematik (insg.)
Numerische Mathematik
Informatik (insg.)
Theoretische Informatik und Mathematik
Mathematik (insg.)
Computational Mathematics

Zitieren

Capturing protein domain structure and function using self-supervision on domain architectures. / Melidis, Damianos P.; Nejdl, Wolfgang.
in: Algorithms, Jahrgang 14, Nr. 1, 28, 19.01.2021.

Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review

Melidis, DP & Nejdl, W 2021, 'Capturing protein domain structure and function using self-supervision on domain architectures', Algorithms, Jg. 14, Nr. 1, 28. https://doi.org/10.3390/a14010028, https://doi.org/10.1101/2020.03.17.995498

Melidis, D. P., & Nejdl, W. (2021). Capturing protein domain structure and function using self-supervision on domain architectures. Algorithms, 14(1), Artikel 28. https://doi.org/10.3390/a14010028, https://doi.org/10.1101/2020.03.17.995498

Melidis DP, Nejdl W. Capturing protein domain structure and function using self-supervision on domain architectures. Algorithms. 2021 Jan 19;14(1):28. doi: 10.3390/a14010028, 10.1101/2020.03.17.995498

Melidis, Damianos P. ; Nejdl, Wolfgang. / Capturing protein domain structure and function using self-supervision on domain architectures. in: Algorithms. 2021 ; Jahrgang 14, Nr. 1.

Download

@article{a05fdd4348144370a068abb3d2e48a5a,

title = "Capturing protein domain structure and function using self-supervision on domain architectures",

abstract = "Predicting biological properties of unseen proteins is shown to be improved by the use of protein sequence embeddings. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector separately. Therefore, current sequence embedding cannot be intrinsically evaluated on the degree of their captured biological information in a quantitative manner. We address this drawback by our approach, dom2vec, by learning vector representation for protein domains and not for each amino acid base, as biological metadata do exist for each domain separately. To perform a reliable quantitative intrinsic evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of a domain, which are its structure, enzymatic, and molecular function. Notably, dom2vec obtains an adequate level of performance in the intrinsic assessment—therefore, we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperforms sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction.",

keywords = "Enzymatic commission class, Protein domain architectures, Quantitative quality assessment, SCOPe secondary structure class, Word embeddings",

author = "Melidis, {Damianos P.} and Wolfgang Nejdl",

note = "Funding Information: Funding: This study was funded by the Ministry for Science and Culture of Lower Saxony Germany (MWK: Ministerium f{\"u}r Wissenschaft und Kultur) within the project “Understanding Cochlear Implant Outcome Variability using Big Data and Machine Learning Approaches”, project id: ZN3429.",

year = "2021",

month = jan,

day = "19",

doi = "10.3390/a14010028",

language = "English",

volume = "14",

number = "1",

}

Download

TY - JOUR

T1 - Capturing protein domain structure and function using self-supervision on domain architectures

AU - Melidis, Damianos P.

AU - Nejdl, Wolfgang

N1 - Funding Information: Funding: This study was funded by the Ministry for Science and Culture of Lower Saxony Germany (MWK: Ministerium für Wissenschaft und Kultur) within the project “Understanding Cochlear Implant Outcome Variability using Big Data and Machine Learning Approaches”, project id: ZN3429.

PY - 2021/1/19

Y1 - 2021/1/19

N2 - Predicting biological properties of unseen proteins is shown to be improved by the use of protein sequence embeddings. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector separately. Therefore, current sequence embedding cannot be intrinsically evaluated on the degree of their captured biological information in a quantitative manner. We address this drawback by our approach, dom2vec, by learning vector representation for protein domains and not for each amino acid base, as biological metadata do exist for each domain separately. To perform a reliable quantitative intrinsic evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of a domain, which are its structure, enzymatic, and molecular function. Notably, dom2vec obtains an adequate level of performance in the intrinsic assessment—therefore, we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperforms sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction.

AB - Predicting biological properties of unseen proteins is shown to be improved by the use of protein sequence embeddings. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector separately. Therefore, current sequence embedding cannot be intrinsically evaluated on the degree of their captured biological information in a quantitative manner. We address this drawback by our approach, dom2vec, by learning vector representation for protein domains and not for each amino acid base, as biological metadata do exist for each domain separately. To perform a reliable quantitative intrinsic evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of a domain, which are its structure, enzymatic, and molecular function. Notably, dom2vec obtains an adequate level of performance in the intrinsic assessment—therefore, we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperforms sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction.

KW - Enzymatic commission class

KW - Protein domain architectures

KW - Quantitative quality assessment

KW - SCOPe secondary structure class

KW - Word embeddings

UR - http://www.scopus.com/inward/record.url?scp=85099789523&partnerID=8YFLogxK

U2 - 10.3390/a14010028

DO - 10.3390/a14010028

M3 - Article

AN - SCOPUS:85099789523

VL - 14

JO - Algorithms

JF - Algorithms

IS - 1

M1 - 28

ER -

Research@Leibniz University

Capturing protein domain structure and function using self-supervision on domain architectures

Autorschaft

Organisationseinheiten

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Enhancing quality inspection of highly variant geared motors

A Systematic Evaluation of Single-Cell Foundation Models on Cell-Type Classification Task

WSDM 2025 General Chairs' Welcome

Retrieval-Augmented Generation of Event Collections from Web Archives and the Live Web

Processing UK Biobank High Resolution Accelerometry Data for Unsupervised Identification of Activity Profiles and Their Differences in Clinically Relevant Outcome Parameters: The ATLAS Index Revisited

Enhancing quality inspection of highly variant geared motors

A Systematic Evaluation of Single-Cell Foundation Models on Cell-Type Classification Task

WSDM 2025 General Chairs' Welcome

Retrieval-Augmented Generation of Event Collections from Web Archives and the Live Web

Processing UK Biobank High Resolution Accelerometry Data for Unsupervised Identification of Activity Profiles and Their Differences in Clinically Relevant Outcome Parameters: The ATLAS Index Revisited

Enhancing quality inspection of highly variant geared motors