Details
Original language | English |
---|---|
Title of host publication | Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 104-114 |
Number of pages | 11 |
ISBN (electronic) | 9781665481069 |
Publication status | Published - 2022 |
Externally published | Yes |
Event | 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022 - Virtual, Online, France Duration: 30 May 2022 → 3 Jun 2022 |
Publication series
Name | Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022 |
---|
Abstract
Keywords
- cs.DC
ASJC Scopus subject areas
- Computer Science(all)
- Computer Networks and Communications
- Computer Science(all)
- Hardware and Architecture
- Computer Science(all)
- Computer Science Applications
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Institute of Electrical and Electronics Engineers Inc., 2022. p. 104-114 (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - In-Memory Indexed Caching for Distributed Data Processing
AU - Uta, Alexandru
AU - Ghit, Bogdan
AU - Dave, Ankur
AU - Rellermeyer, Jan
AU - Boncz, Peter
N1 - Funding Information: ACKNOWLEDGEMENTS Part of this work was conducted while the first author was an intern at Databricks. We would like to thank Herman van Hovell, Adrian Ionescu for their suggestions on the implementation of the project, as well as Matei Zaharia for his valuable comments on the manuscript of the paper. The work in this article was in part supported by The Dutch National Science Foundation NWO Veni grant VI.202.195.
PY - 2022
Y1 - 2022
N2 - Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.
AB - Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.
KW - cs.DC
UR - http://www.scopus.com/inward/record.url?scp=85136337448&partnerID=8YFLogxK
U2 - 10.48550/arXiv.2112.06280
DO - 10.48550/arXiv.2112.06280
M3 - Conference contribution
T3 - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022
SP - 104
EP - 114
BT - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022
Y2 - 30 May 2022 through 3 June 2022
ER -