MapReduce online
Berkeley College · University of California, Berkeley
Abstract
MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see early returns from a job as it is being computed. Our Hadoop Online Prototype (HOP) also…
Citation impact
- FWCI
- 204.27
- Percentile
- 100%
- References
- 30
Authors
6- TCTyson CondieCorresponding
Berkeley College, University of California, Berkeley
- NCNeil Conway
Berkeley College, University of California, Berkeley
- PAPeter Alvaro
Berkeley College, University of California, Berkeley
- JMJoseph M. Hellerstein
Berkeley College, University of California, Berkeley
- KEKhaled Elmeleegy
Topics & keywords
- Computer science
- Implementation
- Fault tolerance
- Big data
- Task (project management)
- Distributed computing
- Stream processing
- Programming paradigm
- Industry, innovation and infrastructure