In-Memory Indexed Caching for Distributed Data Processing

Alexandru Uta; Bogdan Ghit; Ankur Dave; Jan Rellermeyer; Peter Boncz

doi:10.48550/arXiv.2112.06280

Details

Original language	English
Title of host publication	Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	104-114
Number of pages	11
ISBN (electronic)	9781665481069
Publication status	Published - 2022
Externally published	Yes
Event	36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022 - Virtual, Online, France Duration: 30 May 2022 → 3 Jun 2022

Publication series

Name	Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Abstract

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.

Keywords

cs.DC

ASJC Scopus subject areas

Computer Science(all)
Computer Networks and Communications
Computer Science(all)
Hardware and Architecture
Computer Science(all)
Computer Science Applications

Cite this

In-Memory Indexed Caching for Distributed Data Processing. / Uta, Alexandru; Ghit, Bogdan; Dave, Ankur et al.
Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Institute of Electrical and Electronics Engineers Inc., 2022. p. 104-114 (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Uta, A, Ghit, B, Dave, A, Rellermeyer, J & Boncz, P 2022, In-Memory Indexed Caching for Distributed Data Processing. in Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022, Institute of Electrical and Electronics Engineers Inc., pp. 104-114, 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022, Virtual, Online, France, 30 May 2022. https://doi.org/10.48550/arXiv.2112.06280, https://doi.org/10.1109/IPDPS53621.2022.00019

Uta, A., Ghit, B., Dave, A., Rellermeyer, J., & Boncz, P. (2022). In-Memory Indexed Caching for Distributed Data Processing. In Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022 (pp. 104-114). (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.48550/arXiv.2112.06280, https://doi.org/10.1109/IPDPS53621.2022.00019

Uta A, Ghit B, Dave A, Rellermeyer J, Boncz P. In-Memory Indexed Caching for Distributed Data Processing. In Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Institute of Electrical and Electronics Engineers Inc. 2022. p. 104-114. (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022). doi: 10.48550/arXiv.2112.06280, 10.1109/IPDPS53621.2022.00019

Uta, Alexandru ; Ghit, Bogdan ; Dave, Ankur et al. / In-Memory Indexed Caching for Distributed Data Processing. Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Institute of Electrical and Electronics Engineers Inc., 2022. pp. 104-114 (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022).

Download

@inproceedings{773e3387a89043588cccad8ef5cec87e,

title = "In-Memory Indexed Caching for Distributed Data Processing",

abstract = " Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead. ",

keywords = "cs.DC",

author = "Alexandru Uta and Bogdan Ghit and Ankur Dave and Jan Rellermeyer and Peter Boncz",

note = "Funding Information: ACKNOWLEDGEMENTS Part of this work was conducted while the first author was an intern at Databricks. We would like to thank Herman van Hovell, Adrian Ionescu for their suggestions on the implementation of the project, as well as Matei Zaharia for his valuable comments on the manuscript of the paper. The work in this article was in part supported by The Dutch National Science Foundation NWO Veni grant VI.202.195. ; 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022 ; Conference date: 30-05-2022 Through 03-06-2022",

year = "2022",

doi = "10.48550/arXiv.2112.06280",

language = "English",

series = "Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "104--114",

booktitle = "Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022",

address = "United States",

}

Download

TY - GEN

T1 - In-Memory Indexed Caching for Distributed Data Processing

AU - Uta, Alexandru

AU - Ghit, Bogdan

AU - Dave, Ankur

AU - Rellermeyer, Jan

AU - Boncz, Peter

N1 - Funding Information: ACKNOWLEDGEMENTS Part of this work was conducted while the first author was an intern at Databricks. We would like to thank Herman van Hovell, Adrian Ionescu for their suggestions on the implementation of the project, as well as Matei Zaharia for his valuable comments on the manuscript of the paper. The work in this article was in part supported by The Dutch National Science Foundation NWO Veni grant VI.202.195.

PY - 2022

Y1 - 2022

N2 - Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.

AB - Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.

KW - cs.DC

UR - http://www.scopus.com/inward/record.url?scp=85136337448&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2112.06280

DO - 10.48550/arXiv.2112.06280

M3 - Conference contribution

T3 - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

SP - 104

EP - 114

BT - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022

Y2 - 30 May 2022 through 3 June 2022

ER -

Research@Leibniz University

In-Memory Indexed Caching for Distributed Data Processing

Authors

External Research Organisations

Details

Publication series

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Toward Competitive Serverless Deep Learning

The Performance of Distributed Applications: A Traffic Shaping Perspective

Log Parsing Evaluation in the Era of Modern Software Systems

Brug: An Adaptive Memory (Re-)Allocator

Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real World

Toward Competitive Serverless Deep Learning

The Performance of Distributed Applications: A Traffic Shaping Perspective

Log Parsing Evaluation in the Era of Modern Software Systems

Brug: An Adaptive Memory (Re-)Allocator

Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real World

Toward Competitive Serverless Deep Learning