Details
Originalsprache | Englisch |
---|---|
Aufsatznummer | 18 |
Seitenumfang | 19 |
Fachzeitschrift | Journal of biomedical semantics |
Jahrgang | 14 |
Publikationsstatus | Veröffentlicht - 28 Nov. 2023 |
Abstract
Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly document prediction task over categorized scholarly documents. We also introduce a new approach to enhance the document representation method with a domain-independent knowledge graph to find the influential scholarly document using categorized scholarly content. As the input collection, we use the WHO corpus with scholarly documents on the theme of COVID-19. This study examines different document representation methods for machine learning, including TF-IDF, BOW, and embedding-based language models (BERT). The TF-IDF document representation method works better than others. From various machine learning methods tested, logistic regression outperformed the other for scholarly document category classification, and the random forest algorithm obtained the best results for influential scholarly document prediction, with the help of a domain-independent knowledge graph, specifically DBpedia, to enhance the document representation method for predicting influential scholarly documents with categorical scholarly content. In this case, our study combines state-of-the-art machine learning methods with the BOW document representation method. We also enhance the BOW document representation with the direct type (RDF type) and unqualified relation from DBpedia. From this experiment, we did not find any impact of the enhanced document representation for the scholarly document category classification. We found an effect in the influential scholarly document prediction with categorical data.
ASJC Scopus Sachgebiete
- Informatik (insg.)
- Information systems
- Informatik (insg.)
- Angewandte Informatik
- Medizin (insg.)
- Gesundheitsinformatik
- Informatik (insg.)
- Computernetzwerke und -kommunikation
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
in: Journal of biomedical semantics, Jahrgang 14, 18, 28.11.2023.
Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review
}
TY - JOUR
T1 - Impact of COVID-19 research
T2 - a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph
AU - Rabby, Gollam
AU - D’Souza, Jennifer
AU - Oelen, Allard
AU - Dvorackova, Lucie
AU - Svátek, Vojtěch
AU - Auer, Sören
N1 - Funding Information: Open Access funding enabled and organized by Projekt DEAL. Gollam Rabby was partly supported by grant IGA 16/2022 “PRECOG: Predicting REsearch COncepts of siGnificance” and CIMPLE project (CHIST-ERA-19-XAI-003). Sören Auer, Jennifer D’Souza, and Allard Oelen were partially supported by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536) and the GWK/DFG grant for NFDI4DataScience (460234259).
PY - 2023/11/28
Y1 - 2023/11/28
N2 - Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly document prediction task over categorized scholarly documents. We also introduce a new approach to enhance the document representation method with a domain-independent knowledge graph to find the influential scholarly document using categorized scholarly content. As the input collection, we use the WHO corpus with scholarly documents on the theme of COVID-19. This study examines different document representation methods for machine learning, including TF-IDF, BOW, and embedding-based language models (BERT). The TF-IDF document representation method works better than others. From various machine learning methods tested, logistic regression outperformed the other for scholarly document category classification, and the random forest algorithm obtained the best results for influential scholarly document prediction, with the help of a domain-independent knowledge graph, specifically DBpedia, to enhance the document representation method for predicting influential scholarly documents with categorical scholarly content. In this case, our study combines state-of-the-art machine learning methods with the BOW document representation method. We also enhance the BOW document representation with the direct type (RDF type) and unqualified relation from DBpedia. From this experiment, we did not find any impact of the enhanced document representation for the scholarly document category classification. We found an effect in the influential scholarly document prediction with categorical data.
AB - Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly document prediction task over categorized scholarly documents. We also introduce a new approach to enhance the document representation method with a domain-independent knowledge graph to find the influential scholarly document using categorized scholarly content. As the input collection, we use the WHO corpus with scholarly documents on the theme of COVID-19. This study examines different document representation methods for machine learning, including TF-IDF, BOW, and embedding-based language models (BERT). The TF-IDF document representation method works better than others. From various machine learning methods tested, logistic regression outperformed the other for scholarly document category classification, and the random forest algorithm obtained the best results for influential scholarly document prediction, with the help of a domain-independent knowledge graph, specifically DBpedia, to enhance the document representation method for predicting influential scholarly documents with categorical scholarly content. In this case, our study combines state-of-the-art machine learning methods with the BOW document representation method. We also enhance the BOW document representation with the direct type (RDF type) and unqualified relation from DBpedia. From this experiment, we did not find any impact of the enhanced document representation for the scholarly document category classification. We found an effect in the influential scholarly document prediction with categorical data.
KW - COVID-19
KW - Domain-independent knowledge graph
KW - Influential scholarly document prediction
KW - Machine learning algorithms
KW - Text mining
KW - World health organization
UR - http://www.scopus.com/inward/record.url?scp=85177870094&partnerID=8YFLogxK
U2 - 10.1186/s13326-023-00298-4
DO - 10.1186/s13326-023-00298-4
M3 - Article
C2 - 38017587
AN - SCOPUS:85177870094
VL - 14
JO - Journal of biomedical semantics
JF - Journal of biomedical semantics
SN - 2041-1480
M1 - 18
ER -