PALADIN: A process-based constraint language for data validation

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Autoren

  • Antonio Jesus Diaz-Honrubia
  • Philipp D. Rohde
  • Emetis Niazmand
  • Ernestina Menasalvas
  • Maria Esther Vidal

Externe Organisationen

  • Centro de Tecnología Biomédica (CTB)
  • Technische Informationsbibliothek (TIB) Leibniz-Informationszentrum Technik und Naturwissenschaften und Universitätsbibliothek
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Aufsatznummer102557
Seitenumfang23
FachzeitschriftInformation Fusion
Jahrgang112
Frühes Online-Datum4 Juli 2024
PublikationsstatusElektronisch veröffentlicht (E-Pub) - 4 Juli 2024

Abstract

In many processes, ranging from medical treatments to supply chains and employee management, there is a growing need to gather information with the objective of enhancing the efficiency of the process in question. Often, the information gathered from different stages of a process resides in disparate storage systems, necessitating an information fusion process. Post-fusion, it is common to encounter data inconsistencies that hinder an accurate analysis. Unfortunately, existing data validation languages lack the capability to model constraints across stages, making it challenging to identify inconsistencies without introducing artificial elements. This paper introduces PALADIN, a language which has been specifically designed to allow the formulation of constraints in the realm of process-based data, i.e., data points that evolve through various stages of a process with constraints that change according to the stage at which a data point is. PALADIN is data model-agnostic, which means it is not specific to any particular data model or format. This paper provides a formalization, together with implementation details of PALADIN validators, and their validation through a use case. Furthermore, PALADIN is subjected to an empirical evaluation across 20 datasets, including 18 synthetically generated ones that are openly shared with the scientific community. The experimentation involves 53 testbeds, and shows that PALADIN reduces the data validation time compared with other languages that are not tailored for process-based data—achieving a speed-up of up to five times. The results also highlight the impact of parameters such as the type of data integration system, the number of integrity constraints, and the dataset size on the validation time of PALADIN shape schemas.

ASJC Scopus Sachgebiete

Zitieren

PALADIN: A process-based constraint language for data validation. / Diaz-Honrubia, Antonio Jesus; Rohde, Philipp D.; Niazmand, Emetis et al.
in: Information Fusion, Jahrgang 112, 102557, 12.2024.

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Diaz-Honrubia, AJ, Rohde, PD, Niazmand, E, Menasalvas, E & Vidal, ME 2024, 'PALADIN: A process-based constraint language for data validation', Information Fusion, Jg. 112, 102557. https://doi.org/10.1016/j.inffus.2024.102557
Diaz-Honrubia, A. J., Rohde, P. D., Niazmand, E., Menasalvas, E., & Vidal, M. E. (2024). PALADIN: A process-based constraint language for data validation. Information Fusion, 112, Artikel 102557. Vorabveröffentlichung online. https://doi.org/10.1016/j.inffus.2024.102557
Diaz-Honrubia AJ, Rohde PD, Niazmand E, Menasalvas E, Vidal ME. PALADIN: A process-based constraint language for data validation. Information Fusion. 2024 Dez;112:102557. Epub 2024 Jul 4. doi: 10.1016/j.inffus.2024.102557
Diaz-Honrubia, Antonio Jesus ; Rohde, Philipp D. ; Niazmand, Emetis et al. / PALADIN : A process-based constraint language for data validation. in: Information Fusion. 2024 ; Jahrgang 112.
Download
@article{650a5e5ebe8f4daa998806a9e03935d7,
title = "PALADIN: A process-based constraint language for data validation",
abstract = "In many processes, ranging from medical treatments to supply chains and employee management, there is a growing need to gather information with the objective of enhancing the efficiency of the process in question. Often, the information gathered from different stages of a process resides in disparate storage systems, necessitating an information fusion process. Post-fusion, it is common to encounter data inconsistencies that hinder an accurate analysis. Unfortunately, existing data validation languages lack the capability to model constraints across stages, making it challenging to identify inconsistencies without introducing artificial elements. This paper introduces PALADIN, a language which has been specifically designed to allow the formulation of constraints in the realm of process-based data, i.e., data points that evolve through various stages of a process with constraints that change according to the stage at which a data point is. PALADIN is data model-agnostic, which means it is not specific to any particular data model or format. This paper provides a formalization, together with implementation details of PALADIN validators, and their validation through a use case. Furthermore, PALADIN is subjected to an empirical evaluation across 20 datasets, including 18 synthetically generated ones that are openly shared with the scientific community. The experimentation involves 53 testbeds, and shows that PALADIN reduces the data validation time compared with other languages that are not tailored for process-based data—achieving a speed-up of up to five times. The results also highlight the impact of parameters such as the type of data integration system, the number of integrity constraints, and the dataset size on the validation time of PALADIN shape schemas.",
keywords = "Information fusion, Process-based data, Shape-based validation languages",
author = "Diaz-Honrubia, {Antonio Jesus} and Rohde, {Philipp D.} and Emetis Niazmand and Ernestina Menasalvas and Vidal, {Maria Esther}",
note = "Publisher Copyright: {\textcopyright} 2024 The Author(s)",
year = "2024",
month = jul,
day = "4",
doi = "10.1016/j.inffus.2024.102557",
language = "English",
volume = "112",
journal = "Information Fusion",
issn = "1566-2535",
publisher = "Elsevier",

}

Download

TY - JOUR

T1 - PALADIN

T2 - A process-based constraint language for data validation

AU - Diaz-Honrubia, Antonio Jesus

AU - Rohde, Philipp D.

AU - Niazmand, Emetis

AU - Menasalvas, Ernestina

AU - Vidal, Maria Esther

N1 - Publisher Copyright: © 2024 The Author(s)

PY - 2024/7/4

Y1 - 2024/7/4

N2 - In many processes, ranging from medical treatments to supply chains and employee management, there is a growing need to gather information with the objective of enhancing the efficiency of the process in question. Often, the information gathered from different stages of a process resides in disparate storage systems, necessitating an information fusion process. Post-fusion, it is common to encounter data inconsistencies that hinder an accurate analysis. Unfortunately, existing data validation languages lack the capability to model constraints across stages, making it challenging to identify inconsistencies without introducing artificial elements. This paper introduces PALADIN, a language which has been specifically designed to allow the formulation of constraints in the realm of process-based data, i.e., data points that evolve through various stages of a process with constraints that change according to the stage at which a data point is. PALADIN is data model-agnostic, which means it is not specific to any particular data model or format. This paper provides a formalization, together with implementation details of PALADIN validators, and their validation through a use case. Furthermore, PALADIN is subjected to an empirical evaluation across 20 datasets, including 18 synthetically generated ones that are openly shared with the scientific community. The experimentation involves 53 testbeds, and shows that PALADIN reduces the data validation time compared with other languages that are not tailored for process-based data—achieving a speed-up of up to five times. The results also highlight the impact of parameters such as the type of data integration system, the number of integrity constraints, and the dataset size on the validation time of PALADIN shape schemas.

AB - In many processes, ranging from medical treatments to supply chains and employee management, there is a growing need to gather information with the objective of enhancing the efficiency of the process in question. Often, the information gathered from different stages of a process resides in disparate storage systems, necessitating an information fusion process. Post-fusion, it is common to encounter data inconsistencies that hinder an accurate analysis. Unfortunately, existing data validation languages lack the capability to model constraints across stages, making it challenging to identify inconsistencies without introducing artificial elements. This paper introduces PALADIN, a language which has been specifically designed to allow the formulation of constraints in the realm of process-based data, i.e., data points that evolve through various stages of a process with constraints that change according to the stage at which a data point is. PALADIN is data model-agnostic, which means it is not specific to any particular data model or format. This paper provides a formalization, together with implementation details of PALADIN validators, and their validation through a use case. Furthermore, PALADIN is subjected to an empirical evaluation across 20 datasets, including 18 synthetically generated ones that are openly shared with the scientific community. The experimentation involves 53 testbeds, and shows that PALADIN reduces the data validation time compared with other languages that are not tailored for process-based data—achieving a speed-up of up to five times. The results also highlight the impact of parameters such as the type of data integration system, the number of integrity constraints, and the dataset size on the validation time of PALADIN shape schemas.

KW - Information fusion

KW - Process-based data

KW - Shape-based validation languages

UR - http://www.scopus.com/inward/record.url?scp=85198036386&partnerID=8YFLogxK

U2 - 10.1016/j.inffus.2024.102557

DO - 10.1016/j.inffus.2024.102557

M3 - Article

AN - SCOPUS:85198036386

VL - 112

JO - Information Fusion

JF - Information Fusion

SN - 1566-2535

M1 - 102557

ER -