Skip to main content

Spark is the New Workhorse of Data Processing on Hadoop

By August 4, 2015Article

If you are a big data practitioner, let me confirm something you have strongly suspected: Apache Spark will replace MapReduce as the general purpose data processing engine for Apache Hadoop. 

Spark’s success is due to the combination of dramatic speed improvements with an API that is significantly more intuitive, expressive and flexible. 

Spark was originally written to speed up iterative data processing algorithms such as those used to train machine-learning models. But as Spark matured and stabilized, it established itself as a fast general-purpose compute engine for processing large volumes of data. This is exemplified by the special purpose frameworks written on top of Spark for stream and graph processing. Spark Streaming, the framework for processing continuous streams of data, saw phenomenal adoption over the past year and has established itself as the leading framework for stream processing. 

We, at Cloudera, have been ardent supporters of Spark for over two years. However, this article is not about the many virtues of Spark. Spark’s amazing features have been covered ad nauseam in blog posts, media articles and conference presentations. In this article, let us instead look at how the Hadoop ecosystem will evolve with Spark as its core compute engine. 

Future of data processing on Hadoop 

What will big data processing on Hadoop look like over the next few years? Most batch data processing jobs will be written in Spark. Jobs that perform ETL processing, predictive model training, large-scale search indexing and exploratory data analytics will all be written in Spark.

Spark Streaming will become the de facto standard for writing jobs that process continuous streams of data in real-time. However, despite its speed and flexibility, Spark will not be able to cover the entire range of workload types on Hadoop. For BI workloads (where low-latency SQL access with high concurrency is critical) MPP systems like Impala are necessary.

Use cases that involve indexing for fast search and retrieval of data, particularly textual data, will also continue to be a good chunk of big data workloads. For these workloads, a massively parallel distributed search framework like Apache Solr will remain essential. 

Is MapReduce dead then? Is it time to start authoring its eulogy? Not quite. MapReduce jobs that crunch through petabytes of data are run on a daily basis at organizations across the world. Spark has not been validated at petabyte scale. It will invariably get there; but until then, MapReduce will be the tool of choice to reliably run petabyte scale, extremely disk IO-intensive workloads. 

More to come … stay tuned 

Spark had a meteoric rise in popularity and adoption. It became a top-level Apache project in early 2014 and, in less than a year and a half, it established itself as the data processing engine of choice for Hadoop. 

Cloudera is proud to be one of the critical drivers of the success of Spark. We were the first large vendor to ship Spark and, in the past year and a half, we have led hundreds of customers to success with Spark. Together with our close partner Intel, we contributed almost 800 patches to Spark. Some of our significant contribution areas are Spark on YARN, dynamic resource allocation, Spark Streaming resiliency, Kerberos integration, as well as features and bug fixes for improved stability and debuggability. Not to mention better ecosystem integration via projects like Hive On Spark, SparkOnHBase, Crunch on Spark, and Pig on Spark. 

Plenty of work has been done, but there is plenty more to do. Our engineering investments in Spark continue to grow. We will continue to drive improvements in areas like security, governance, performance at scale, ecosystem integration, stream processing, machine learning and usability. Be on the lookout for more on these in a subsequent post. 

What makes a comprehensive big data platform? 

Let’s look at the components that are fundamental to a comprehensive big data platform:

  • Data processing engines. We covered these in detail in the previous section.
  • Data storage layer that is reliable, scalable and cost-effective
  • Data catalog. With multiple ways of accessing and processing the data, it is essential to have a central catalog that has metadata about the organization and layout of the data.
  • Resource management layer enabling multiple diverse processing engines and workloads on shared big data infrastructure
  • Streaming data channel. A reliable, low-latency, high-throughput channel for continuous data streams
  • Unified administration for easy management and troubleshooting
  • Comprehensive security and governance for compliance-ready protection and visibility 

In certain cases, especially for a small organization with narrow data processing needs, you can get away with a platform that provides only a small subset of the above components. For example, running batch Spark jobs on data in S3. However, any organization that has massive volumes of data and wishes to maximize the value derived from its data will need to invest in a comprehensive Hadoop-based big data platform like Cloudera Enterprise. Cloudera Enterprise provides components that satisfy all of the aforementioned requirements, and these components are tightly integrated with each other. 

The best part of Hadoop is that it is modular with different pluggable implementations for different components of the platform (as evidenced by the availability of different processing engines). This modularity continues to enable constant innovation in the ecosystem, and lets it evolve as big data needs evolve. 

Spark recently received immense media coverage, rightfully touting it as the next-generation, big data engine. This has led some folks to believe that it is a replacement to Hadoop. However, a data processing engine, no matter how powerful or flexible, provides only a subset of the functionality that a comprehensive enterprise-grade, big data platform needs to provide. For success with big data, enterprises will invariably run Spark as an integrated part of their overall Hadoop deployments. 

Anand Iyer is a senior product manager at Cloudera, the leading vendor of open source Apache Hadoop. His primary areas of focus are platforms for real-time streaming, Apache Spark and tools to ingest data into the Hadoop platform. Before joining Cloudera, he worked as an engineer at LinkedIn, where he applied machine-learning techniques to improve the relevance and personalization of LinkedIn’s Feed. Anand has extensive experience in leveraging big data platforms to deliver products that delight customers.