Spark, Whistling Past the Data Platform Graveyard

We are in the Golden Age of Data. For those of us on the front-lines, it doesn’t feel that way. Every step forward this technology takes, the need for deeper analytics takes two. We're constantly catching up.

Necessity is the mother of invention, and mum’s been very busy. In the past decade we have seen ground-breaking after ground-breaking technological advancements in data processing technology and in 2015 we have arrived in the age in the Spark.

Let's take a look at the needs (and innovative responses) that brought us to where we are today.

Before 2006: ETL, the Data Warehouse, and OLAP Cubes

These were the dark days of inflexible batch processes that produced inflexible canned reports. It felt like you were in a full body cast. This was probably for the best, because otherwise you might just hit your head against the wall.

The most common approach to data platforms was to use batch Extract-Transform-Load (ETL) processes to transform incoming data into ready-made chunks that would be bulk-loaded into a data warehouse.

For low-latency queries, data warehouses were complemented with OLAP Cubes. Inflexibility was immense and most data platforms were on a daily schedule with simple changes in business logic resulting in weeks if not months of cascading technical work.

An OLAP cube is a multidimensional database that is optimized for data warehouse and online analytical processing (OLAP) applications.

This was golden age for commercial vendors yet difficult times for enterprises managing even small amounts of data. Businesses needed to move quicker, and these bulky processes were slowing businesses down. Businesses had ad-hoc questions that were beyond the reaches of canned reports and this was the main catalyst of change.

2006-2009: MPP to the rescue

From 2006 to 2009, Multiple Parallel-Processors (MPP) databases brought scalability and ridiculous speed to the Data Warehouse and made OLAP cubes obsolete, resulting in a consolidation of the stack. MPP vendors such as Greenplum, Netezza, and Vertica rose to dominance and former industry leaders responded with their own solutions such as Oracle's Exadata; Teradata had been playing in this space already. Even in the nascent days of MPPs change was occurring: business needs were evolving and the ugly beast of semi-structured data was rearing it’s head.

Semi-structured data started flooding data platforms with the rise of NoSQL databases like MongoDB and the increased need to analyze RESTful and SOAP API log and response data. The liberation of developers from strict schemas conflicted directly with the foundation of relational databases.

Companies wanted to analyze these new data sources and the pressure to massage semi-structured and unstructured data into a strict schema put a massive strain on ETL processes.

Beyond this, there was another fundamental problem: companies were accumulating and collecting data that they could not fit into a relational data model because they did not yet know how they were going to use it. The restrictiveness of needing a data model a priori meant that truly exploratory analytics to unlock hidden value in data was still nascent.

2010-2012: The Elephant in the Room

Hadoop came onto the scene and gave businesses a place to dump ANY type of data and allow proto-data-scientists to poke sticks at it, relieving the pressure on MPPs to be everything to everyone.

In 2009, Cloudera released the first version of their Hadoop distribution, and by 2010 Hadoop was starting to become a household name in mainstream enterprises. The losers in this shift quickly became ETL tools, which were displaced in droves by Hadoop which could do all of that heavy lifting as well.

The best practice architecture quickly became Hadoop + MPP with Hadoop becoming de-facto ETL platform transforming data to be loaded to MPP databases. Whatever could not be shoved into the MPP database was analyzed in Hadoop — albeit much more slowly through tools like Hive and Pig.

This was a good point of stability, but business needs were changing again: increased data volumes put massive pressure on the MPPs to load data quickly, and the data with the most value to extract shifted from being the structured data to the semi-structured data that was sitting in Hadoop. This meant that executing analytics directly on Hadoop became critical.

MPP vendors put out "Hadoop connectors" that could pull data from Hadoop into the MPP for processing -- but this had a very negative impact on performance as compute needed to be close to storage. There was another simultaneous shift -- the need to analyze streams of data in near real-time.

2012-2014: the rise of Lambda

The solution was starting to become clear: the world needed a system that could take in vast amounts of data and perform batch and streaming operations without flinching. Nathan Marz created the concept of the Lambda architecture based on his work at Twitter.

In Lambda, there is a "batch layer" in Hadoop which does safe, reliable processing of data, and a "speed layer" which does real-time stream processing and does needs not be super-reliable. The Lambda stack is traditionally created with Kafka + Storm as the speed layer and Hadoop as the batch layer.

The stack would process the same data in both layers, the speed layer reacting as soon as the data was created and the batch layer following up with more reliable, hardened processing. The Lambda architecture’s main issues come from its complexity. Jay Kreps did a great job of exploring this in his blog post.

Data engineers needed to implement business logic in two places, with two frameworks, managing two infrastructures, and composing a single view of the data from two sources. The market and community reacted to these shortcomings—there was Summingbird which provided a common API for the speed and batch layer; and then there was Hortonworks included Storm in their Hadoop distribution, unifying infrastructure and management to some degree.

Lambda is very relevant today and has a lot of great properties, but many data engineers kept their eyes fixed on the horizons, looking for changes in technology that would make building a stack less like rocket science.

2015: Apache Spark

Today, Spark is dominating mindshare in the data engineering community. Even as an emerging technology Spark addresses a lot of the issues discussed in the previous sections:

Apache Tez deserves a mention because it is a framework that overlaps with Spark in terms of being able to construct an Direct Acyclical Graph (DAG) that distributes and executes processing across tiered storage. Tez was developed to plug into existing frameworks that have data engineer-friendly APIs such as Pig, Hive and Cascading.

It was not meant to be used directly by data engineers because it’s APIs are so low-level. As a result, it has not received the same amount of attention in the community, but Hortonworks is responding with the Spark-on-Tez project which should be exciting to watch.

Last but not least, even though Spark can live outside of Hadoop, the two are intertwined since most of the data Spark will be processing lives in HDFS. The gravity of HDFS is immense as it builds a “data fabric” unto which analytic applications are built, and cannot be ignored. Spark will need to continue to build upon and improve its Hadoop ecosystem support.

Honorable mentions

Spark @ Localytics

At Localytics we are continuously innovating how we approach our data platform. We are ingesting 2B data points a day into our stack built with SQS, Scala+Akka, DynamoDB, EMR and MPP databases. As our business has evolved, we needed to introduce another cornerstone piece, and we chose Spark.

Last week, we deployed into production our first Spark-based product, Auto-Profiles, that is feeding off of our firehose of data. We’ve been really pleased with Spark thus far, and we are going to build more.

If you are looking for a fun, exciting challenge, come join our team!