Joint learning from multiple information sources for biological problems

Thi Ngan Dong

doi:10.15488/14147

Details

Originalsprache	Englisch
Qualifikation	Doctor rerum naturalium
Gradverleihende Hochschule	Leibniz Universität Hannover
Betreut von	Nejdl, W., Betreuer*in
Datum der Verleihung des Grades	13 Juli 2023
Erscheinungsort	Hannover
Publikationsstatus	Veröffentlicht - 2023

Abstract

Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies. Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised models’ knowledge base with publicly available related data to enhance the computational models’ prediction performance. Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approaches’ superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systems’ utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the model’s generated prediction results to facilitate field experts’ assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between “Computer Science” and “Biology” that will open a new era of fruitful collaboration between computer scientists and biological field experts.

Zitieren

Joint learning from multiple information sources for biological problems. / Dong, Thi Ngan.
Hannover, 2023. 148 S.

Publikation: Qualifikations-/Studienabschlussarbeit › Dissertation

Dong, TN 2023, 'Joint learning from multiple information sources for biological problems', Doctor rerum naturalium, Gottfried Wilhelm Leibniz Universität Hannover, Hannover. https://doi.org/10.15488/14147

Dong, T. N. (2023). Joint learning from multiple information sources for biological problems. [Dissertation, Gottfried Wilhelm Leibniz Universität Hannover]. https://doi.org/10.15488/14147

Dong TN. Joint learning from multiple information sources for biological problems. Hannover, 2023. 148 S. doi: 10.15488/14147

Dong, Thi Ngan. / Joint learning from multiple information sources for biological problems. Hannover, 2023. 148 S.

Download

@phdthesis{7251fd9bb81541c68d18f1da94960e75,

title = "Joint learning from multiple information sources for biological problems",

abstract = "Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies. Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised models{\textquoteright} knowledge base with publicly available related data to enhance the computational models{\textquoteright} prediction performance. Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approaches{\textquoteright} superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systems{\textquoteright} utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the model{\textquoteright}s generated prediction results to facilitate field experts{\textquoteright} assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between “Computer Science” and “Biology” that will open a new era of fruitful collaboration between computer scientists and biological field experts.",

author = "Dong, {Thi Ngan}",

year = "2023",

doi = "10.15488/14147",

language = "English",

school = "Leibniz University Hannover",

}

Download

TY - BOOK

T1 - Joint learning from multiple information sources for biological problems

AU - Dong, Thi Ngan

PY - 2023

Y1 - 2023

N2 - Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies. Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised models’ knowledge base with publicly available related data to enhance the computational models’ prediction performance. Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approaches’ superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systems’ utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the model’s generated prediction results to facilitate field experts’ assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between “Computer Science” and “Biology” that will open a new era of fruitful collaboration between computer scientists and biological field experts.

AB - Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies. Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised models’ knowledge base with publicly available related data to enhance the computational models’ prediction performance. Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approaches’ superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systems’ utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the model’s generated prediction results to facilitate field experts’ assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between “Computer Science” and “Biology” that will open a new era of fruitful collaboration between computer scientists and biological field experts.

U2 - 10.15488/14147

DO - 10.15488/14147

M3 - Doctoral thesis

CY - Hannover

ER -

Research@Leibniz University

Joint learning from multiple information sources for biological problems

Autorschaft

Organisationseinheiten

Details

Abstract

Zitieren

Von denselben Autoren

Enhancing quality inspection of highly variant geared motors

A Systematic Evaluation of Single-Cell Foundation Models on Cell-Type Classification Task

WSDM 2025 General Chairs' Welcome

Retrieval-Augmented Generation of Event Collections from Web Archives and the Live Web

Processing UK Biobank High Resolution Accelerometry Data for Unsupervised Identification of Activity Profiles and Their Differences in Clinically Relevant Outcome Parameters: The ATLAS Index Revisited

Enhancing quality inspection of highly variant geared motors

A Systematic Evaluation of Single-Cell Foundation Models on Cell-Type Classification Task

WSDM 2025 General Chairs' Welcome

Retrieval-Augmented Generation of Event Collections from Web Archives and the Live Web

Processing UK Biobank High Resolution Accelerometry Data for Unsupervised Identification of Activity Profiles and Their Differences in Clinically Relevant Outcome Parameters: The ATLAS Index Revisited

Enhancing quality inspection of highly variant geared motors