Open benchmark for filtering techniques in entity resolution

Franziska Neuhof; Marco Fisichella; George Papadakis; Konstantinos Nikoletos; Nikolaus Augsten; Wolfgang Nejdl; Manolis Koubarakis

doi:10.1007/s00778-024-00868-7

Details

Originalsprache	Englisch
Seiten (von - bis)	1671-1696
Seitenumfang	26
Fachzeitschrift	VLDB Journal
Jahrgang	33
Ausgabenummer	5
Frühes Online-Datum	9 Juli 2024
Publikationsstatus	Veröffentlicht - Sept. 2024

Abstract

Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.

ASJC Scopus Sachgebiete

Informatik (insg.)
Information systems
Informatik (insg.)
Hardware und Architektur

Zitieren

Open benchmark for filtering techniques in entity resolution. / Neuhof, Franziska; Fisichella, Marco; Papadakis, George et al.
in: VLDB Journal, Jahrgang 33, Nr. 5, 09.2024, S. 1671-1696.

Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review

Neuhof, F, Fisichella, M, Papadakis, G, Nikoletos, K, Augsten, N, Nejdl, W & Koubarakis, M 2024, 'Open benchmark for filtering techniques in entity resolution', VLDB Journal, Jg. 33, Nr. 5, S. 1671-1696. https://doi.org/10.1007/s00778-024-00868-7

Neuhof, F., Fisichella, M., Papadakis, G., Nikoletos, K., Augsten, N., Nejdl, W., & Koubarakis, M. (2024). Open benchmark for filtering techniques in entity resolution. VLDB Journal, 33(5), 1671-1696. https://doi.org/10.1007/s00778-024-00868-7

Neuhof F, Fisichella M, Papadakis G, Nikoletos K, Augsten N, Nejdl W et al. Open benchmark for filtering techniques in entity resolution. VLDB Journal. 2024 Sep;33(5):1671-1696. Epub 2024 Jul 9. doi: 10.1007/s00778-024-00868-7

Neuhof, Franziska ; Fisichella, Marco ; Papadakis, George et al. / Open benchmark for filtering techniques in entity resolution. in: VLDB Journal. 2024 ; Jahrgang 33, Nr. 5. S. 1671-1696.

Download

@article{985086de737545ed965fbeeb8d96bd51,

title = "Open benchmark for filtering techniques in entity resolution",

abstract = "Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.",

keywords = "Blocking, Entity resolution, Filtering, Nearest neighbors",

author = "Franziska Neuhof and Marco Fisichella and George Papadakis and Konstantinos Nikoletos and Nikolaus Augsten and Wolfgang Nejdl and Manolis Koubarakis",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.",

year = "2024",

month = sep,

doi = "10.1007/s00778-024-00868-7",

language = "English",

volume = "33",

pages = "1671--1696",

journal = "VLDB Journal",

issn = "1066-8888",

publisher = "Springer New York",

number = "5",

}

Download

TY - JOUR

T1 - Open benchmark for filtering techniques in entity resolution

AU - Neuhof, Franziska

AU - Fisichella, Marco

AU - Papadakis, George

AU - Nikoletos, Konstantinos

AU - Augsten, Nikolaus

AU - Nejdl, Wolfgang

AU - Koubarakis, Manolis

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.

PY - 2024/9

Y1 - 2024/9

N2 - Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.

AB - Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.

KW - Blocking

KW - Entity resolution

KW - Filtering

KW - Nearest neighbors

UR - http://www.scopus.com/inward/record.url?scp=85198121701&partnerID=8YFLogxK

U2 - 10.1007/s00778-024-00868-7

DO - 10.1007/s00778-024-00868-7

M3 - Article

AN - SCOPUS:85198121701

VL - 33

SP - 1671

EP - 1696

JO - VLDB Journal

JF - VLDB Journal

SN - 1066-8888

IS - 5

ER -

Research@Leibniz University

Open benchmark for filtering techniques in entity resolution

Autorschaft

Organisationseinheiten

Externe Organisationen

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Processing UK Biobank High Resolution Accelerometry Data for Unsupervised Identification of Activity Profiles and Their Differences in Clinically Relevant Outcome Parameters: The ATLAS Index Revisited

Enhancing quality inspection of highly variant geared motors

1st Workshop on Detecting Trust, Authority, Sense and Knowledge in Online News Media Production

Fedflow: a personalized federated learning framework for passenger flow prediction

Retrieval-Augmented Generation of Event Collections from Web Archives and the Live Web

Processing UK Biobank High Resolution Accelerometry Data for Unsupervised Identification of Activity Profiles and Their Differences in Clinically Relevant Outcome Parameters: The ATLAS Index Revisited

Enhancing quality inspection of highly variant geared motors

1st Workshop on Detecting Trust, Authority, Sense and Knowledge in Online News Media Production

Fedflow: a personalized federated learning framework for passenger flow prediction

Retrieval-Augmented Generation of Event Collections from Web Archives and the Live Web

Processing UK Biobank High Resolution Accelerometry Data for Unsupervised Identification of Activity Profiles and Their Differences in Clinically Relevant Outcome Parameters: The ATLAS Index Revisited