AWS Lambda and Event-Driven Computing: The Next Big Thing for Data Pipelines
Two weeks ago, several of us from Localytics attended AWS re:Invent, Amazon’s yearly event. Jon Bass, Mohit Dilawari, and I spent the entire week representing Localytics at the conference.
We learned a ton. We heard about new features, met with product teams, and even participated in the Internet of Things HackDay. We originally wanted to write a blog post recapping everything, but found we had so much to say about AWS Lambda and Event-Driven Computing that it deserved its own post.
Lambda is an important development for AWS. Let’s dive a bit into Lambda and see what it means for event-driven computing and data pipelines.
Event All The Things!
Consider a common problem in AWS: what do you do if you want to listen to change events from any part of your infrastructure and react to those events in a timely manner?
Until recently, your best bet was CloudTrail. CloudTrail dumps AWS API request logs to S3 and posts notifications to SNS. This is pretty good, but CloudTrail isn’t necessarily timely (up to 15 minutes of lag) because you are essentially waiting for logs to rollover and post. Ultimately, if you were doing something such as processing images uploaded to S3, you’d be better off enqueuing messages to SQS yourself in your application code upon receipt of the image instead of waiting for CloudTrail.
Wouldn’t it be better if services posted directly to SQS, SNS, or some type of real time stream like Kinesis? Amazon realized this and announced event notifications for two services: S3 Notifications and DynamoDB Streams. Both of these were announced during re:Invent.
S3 Notifications post to SQS and SNS upon certain S3 operations, and DynamoDB Streams are effectively operation logs on your DynamoDB tables that are exposed in a similar fashion to Kinesis Streams. To be clear, Dynamo Streams are not Kinesis - for example, they provide exactly once semantics instead of the at least once semantics found in Kinesis - but the Kinesis Client Library works against both Kinesis and DynamoDB streams. Neat. We now have real time event streams coming from two services and more services are sure to follow.
However, with the proliferation of events across various services, new problems arise. Fragmentation is a problem, because events are exposed in different mediums (SQS/SNS vs Kineses compatible streams). Orchestration is another problem, because we still need to build out infrastructure to listen and process the events.
Or do we? Enter the big reveal: AWS Lambda.
AWS Lambda
In a nutshell, AWS is building the primitives to deploy simple workers that respond to events and trigger other events without having to manage servers or event queues. Here’s Amazon’s description:
“With Lambda, you simply create a Lambda function, give it permission to access specific AWS resources, and then connect the function to your AWS resources. Lambda will automatically run code in response to modifications to objects uploaded to Amazon Simple Storage Service (S3) buckets, messages arriving in Amazon Kinesis streams, or table updates in Amazon DynamoDB.” (Lambda announcement)
A Lambda function is a Node.js application. An event happens, AWS launches your Node.js application and passes in the event information. This happens on a farm of servers managed by AWS. You get charged for slices of time. Fragmentation and orchestration are none of your concern - Amazon handles all of the plumbing. Official support for more frameworks is on the horizon, but people have already hacked in languages such as Go.
Lambda isn’t a particularly new concept - databases have had triggers for a long time - but doing this at the cloud level is game changing. Building multi-service, event-based triggers on your own would have been a massive undertaking - you’d have to assemble SQS, Kinesis, DynamoDB, and EC2 clusters just to do it properly - but now you get it as part of AWS.
As an aside, it’s interesting to see Lambda announced right alongside the EC2 Container Service, because you could possibly see Lambda as an assault on containerization. Containers are often sold as a lightweight way to distribute disposable tasks, but Lambda is even lighter.
Event-Driven Computing and Data Pipelines
With Lambda, the possibilities for data pipelines are endless, and the beauty of it is that you get to focus on your business logic instead of passing around data. You define a directed acyclic graph of small, composable functions and the rest happens on it own. This is the promise of Event-Driven Computing and Lambda.
For example, imagine the beginnings of a mobile app analytics pipeline like Localytics. First you upload datapoints into S3. That triggers two lambdas, one that writes aggregates to DynamoDB and another that sends a push notification back to the app upon certain conditions (suppose you want to send promotions to the user on certain actions). Writing the aggregate to DynamoDB can trigger another lambda, too, perhaps one that is monitoring for thresholds and sending email alerts.
All of this works with S3 Notifications and DynamoDB Streams behind the scenes, but you don’t need to concern yourself with that or with managing clusters of workers.
Another example would be managing infrastructure. Suppose you need to manage membership in a load balancer. Today you need to wire up your own discovery service and notification mechanism for when instances come up or down. This might mean polling EC2 APIs. However, EC2 will have notifications at some point, too. At that point, this scenario becomes trivial with Lambda.
What’s Next
We’re at the beginning of something very interesting happening in AWS. If you want to learn more, there’s a Getting Started with AWS Lambda presentation, as well as a presentation titled Event-Driven Computing on Change Logs that describes Amazon’s vision.
Localytics is exploring advanced use cases. We’ll provide follow-up blog posts as we make progress. Want to help? We're hiring.