Abstract

MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see early returns from a job as it is being computed. Our Hadoop Online Prototype (HOP) also…

Citation impact

683
total citations
FWCI
204.27
Percentile
100%
References
30
Citations per year

Authors

6

Topics & keywords

Keywords
  • Computer science
  • Implementation
  • Fault tolerance
  • Big data
  • Task (project management)
  • Distributed computing
  • Stream processing
  • Programming paradigm
UN Sustainable Development Goals
  • Industry, innovation and infrastructure
No related works found for this paper.