Details
Original language | English |
---|---|
Title of host publication | INFOCOM 2022 - IEEE Conference on Computer Communications |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 460-469 |
Number of pages | 10 |
ISBN (electronic) | 9781665458221 |
ISBN (print) | 978-1-6654-5823-8 |
Publication status | Published - 2022 |
Event | 41st IEEE Conference on Computer Communications, INFOCOM 2022 - Virtual, Online, United Kingdom (UK) Duration: 2 May 2022 → 5 May 2022 |
Publication series
Name | Proceedings - IEEE INFOCOM |
---|---|
Volume | 2022-May |
ISSN (Print) | 0743-166X |
ISSN (electronic) | 2641-9874 |
Abstract
Parallel systems divide jobs into smaller tasks that can be serviced by many workers at the same time. Some parallel systems have blocking barriers that require all of their tasks to start and/or depart in unison. This is true of many parallelized machine learning workloads, and the popular Apache Spark processing engine has recently added support for Barrier Execution Mode, which allows users to add such barriers to their jobs. The drawback of these barriers is reduced performance and stability compared to equivalent non-blocking systems.We derive analytical expressions for the stability regions for parallel systems with blocking start and/or departure barriers. We extend results from queueing theory to derive waiting and sojourn time bounds for systems with blocking start barriers. Our results show that for a given system utilization and number of servers, there is an optimal degree of parallelism that balances waiting time and job execution time. This observation leads us to propose and implement a class of self-adaptive schedulers, we call "Take-Half", that modulate the allowed degree of parallelism based on the instantaneous system load, improving mean performance and eliminating stability issues.
ASJC Scopus subject areas
- Computer Science(all)
- Engineering(all)
- Electrical and Electronic Engineering
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
INFOCOM 2022 - IEEE Conference on Computer Communications. Institute of Electrical and Electronics Engineers Inc., 2022. p. 460-469 (Proceedings - IEEE INFOCOM; Vol. 2022-May).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Performance and Scaling of Parallel Systems with Blocking Start and/or Departure Barriers
AU - Walker, Brenton
AU - Bora, Stefan
AU - Fidler, Markus
N1 - Funding Information: This work was supported in part by the German Research Council (DFG) under Grant VaMoS (FI 1236/7-1).
PY - 2022
Y1 - 2022
N2 - Parallel systems divide jobs into smaller tasks that can be serviced by many workers at the same time. Some parallel systems have blocking barriers that require all of their tasks to start and/or depart in unison. This is true of many parallelized machine learning workloads, and the popular Apache Spark processing engine has recently added support for Barrier Execution Mode, which allows users to add such barriers to their jobs. The drawback of these barriers is reduced performance and stability compared to equivalent non-blocking systems.We derive analytical expressions for the stability regions for parallel systems with blocking start and/or departure barriers. We extend results from queueing theory to derive waiting and sojourn time bounds for systems with blocking start barriers. Our results show that for a given system utilization and number of servers, there is an optimal degree of parallelism that balances waiting time and job execution time. This observation leads us to propose and implement a class of self-adaptive schedulers, we call "Take-Half", that modulate the allowed degree of parallelism based on the instantaneous system load, improving mean performance and eliminating stability issues.
AB - Parallel systems divide jobs into smaller tasks that can be serviced by many workers at the same time. Some parallel systems have blocking barriers that require all of their tasks to start and/or depart in unison. This is true of many parallelized machine learning workloads, and the popular Apache Spark processing engine has recently added support for Barrier Execution Mode, which allows users to add such barriers to their jobs. The drawback of these barriers is reduced performance and stability compared to equivalent non-blocking systems.We derive analytical expressions for the stability regions for parallel systems with blocking start and/or departure barriers. We extend results from queueing theory to derive waiting and sojourn time bounds for systems with blocking start barriers. Our results show that for a given system utilization and number of servers, there is an optimal degree of parallelism that balances waiting time and job execution time. This observation leads us to propose and implement a class of self-adaptive schedulers, we call "Take-Half", that modulate the allowed degree of parallelism based on the instantaneous system load, improving mean performance and eliminating stability issues.
UR - http://www.scopus.com/inward/record.url?scp=85126371308&partnerID=8YFLogxK
U2 - 10.1109/INFOCOM48880.2022.9796754
DO - 10.1109/INFOCOM48880.2022.9796754
M3 - Conference contribution
AN - SCOPUS:85126371308
SN - 978-1-6654-5823-8
T3 - Proceedings - IEEE INFOCOM
SP - 460
EP - 469
BT - INFOCOM 2022 - IEEE Conference on Computer Communications
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 41st IEEE Conference on Computer Communications, INFOCOM 2022
Y2 - 2 May 2022 through 5 May 2022
ER -