Details
Originalsprache | Englisch |
---|---|
Seiten (von - bis) | 1671-1696 |
Seitenumfang | 26 |
Fachzeitschrift | VLDB Journal |
Jahrgang | 33 |
Ausgabenummer | 5 |
Frühes Online-Datum | 9 Juli 2024 |
Publikationsstatus | Veröffentlicht - Sept. 2024 |
Abstract
Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.
ASJC Scopus Sachgebiete
- Informatik (insg.)
- Information systems
- Informatik (insg.)
- Hardware und Architektur
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
in: VLDB Journal, Jahrgang 33, Nr. 5, 09.2024, S. 1671-1696.
Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review
}
TY - JOUR
T1 - Open benchmark for filtering techniques in entity resolution
AU - Neuhof, Franziska
AU - Fisichella, Marco
AU - Papadakis, George
AU - Nikoletos, Konstantinos
AU - Augsten, Nikolaus
AU - Nejdl, Wolfgang
AU - Koubarakis, Manolis
N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.
PY - 2024/9
Y1 - 2024/9
N2 - Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.
AB - Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.
KW - Blocking
KW - Entity resolution
KW - Filtering
KW - Nearest neighbors
UR - http://www.scopus.com/inward/record.url?scp=85198121701&partnerID=8YFLogxK
U2 - 10.1007/s00778-024-00868-7
DO - 10.1007/s00778-024-00868-7
M3 - Article
AN - SCOPUS:85198121701
VL - 33
SP - 1671
EP - 1696
JO - VLDB Journal
JF - VLDB Journal
SN - 1066-8888
IS - 5
ER -