Running large-scale ETL Jobs without an army of developers behind you

Photo by Pietro Jeng on Unsplash

ETL — or Extract, Transform, Load — is a common pattern for processing incoming data. It allows efficient use of resources by bunching the “transform” into a single bulk operation, often making it far easier to develop and maintain than its stream processing counterpart. It also lends itself well to one-off investigations into datasets, where the user writes some custom code to perform some analysis over the dataset, exporting some results to be utilized later on. This common pattern underpins many data science explorations, but in my experience, I have often found implementing it clunky and inefficient. In this article, I’m going to outline a new technology that I believe could be a valuable tool in the data scientist (and developer) arsenal.

First a bit of background on me. I’m a software developer who specializes in AWS data science infrastructure with a Master’s degree in Software Engineering. I’d like to think this makes me no slouch when it comes to writing and deploying code to run remotely, but I have found running large-scale ML/Data jobs to be challenging at best. Most of the big technologies in this space, Spark, for example, are so complex that teams of developers are required to maintain the infrastructure, leaving it totally out of reach for even mid-sized companies.

Let’s explore some single-user solutions which don’t require a small army of developers to maintain, focusing on these evaluation points:

  • Speed of deployment for iterative development
  • Cost
  • Infrastructure requirements
  • Ease of integrating with software development practices (code linting, version control, build systems)

One such technology is AWS Glue ETL Jobs, which aims to make it easier to run transform functions by hiding the underlying infrastructure, with the user just providing the codebase. While Glue works well once configured, it can be intimidating and frustrating for even the most experienced developers. Its unintuitive UIs and poor documentation make it difficult to learn and even harder to debug, I’ve found that significant knowledge of Spark (the underlying technology for Glue ETL) is a must for effective usage.

Figure 1: An example of an error from AWS Glue, giving almost no clues!

It’s not all bad, once you get to grips with Glue it can become very powerful. The DynamicFrame library (which wraps the spark Dataframes) is awesome for reading and writing data to Glue Tables, with one line of code (a few more if you insist on PEP-8 compliance…) you can write a nearly infinite amount of data to a partitioned table — something which still boggles my mind! This makes Glue a great tool if you have large-scale datasets that can be imported using DynamicFrame, but inefficient at best for anything less.

Just as a quick example, as a common pattern I spend lots of my time creating ETL jobs for generating new datasets from raw data stored in S3 — where this transformed data is then stored back into S3 to then perform analysis. Sounds simple right?

Assuming I have my code locally on my machine and I want to create a Glue Job to run this code, I would have to:

  1. Upload my script to S3 via the AWS CLI (Let’s hope it works the first time…)
  2. Go to the AWS Console
  3. Go to the Glue page
  4. Create a new Job
  5. Configure an IAM role for the job to run with the relevant permissions
  6. Input the location of your script in S3
  7. Setup ENI if required to ensure data access across VPCs
  8. Include the zip dependencies for any required libraries — (No C extensions though, so good luck using pandas!)
  9. Add a schedule
  10. Run the Job

And this doesn’t even go into how to develop the script, I mean just look at the instructions for attaching a development endpoint to Pycharm! (https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-pycharm.html)

This doesn’t set up Glue as a good tool for day-to-day data science/machine learning exploration. To quickly summarise, when evaluating against our specification defined above, Glue ETL is:

  • Difficult to quickly iterate on new script versions due to a lack of good local testing
  • Very hard to debug effectively due to spark logging
  • Challenging to integrate with Software Engineering processes, due to remote code editing and clunky deployment
  • Easy to integrate into S3 and Glue Tables

So what other options are there? Let’s introduce our contender. Metaflow!

Metaflow is a fairly new tool, making its entrance in late 2019. It’s designed and built by Netflix to be an open-source tool that gives power back to data scientists, reducing complexity and increasing iteration speed. It comprises a Python/R client library that orchestrates the user's builds and a set of AWS resources (in the form of a Cloudformation stack) that need to be deployed by the user. So, enough teasing - let’s explore why I’m now using Metaflow for my Python ETL jobs.

Developing new code on Metaflow is a breeze

Metaflow adds some new annotations to provide its functionality, but the base of the orchestration uses established features of python, classes, and methods (as a “classically trained” software engineer, this fills me with great happiness). A Flow is a class and steps within that flow are functions, each of which links to the next step to execute. A Flow or step can then have annotations applied to it to add a further configuration.

# An example of a Metaflow Flow running locally
class TestFlow(FlowSpec):
  @step
def start(self):
print(“This is the start step!”)
self.next(self.process) # Runs the process step next
  @step
def process(self, inputs):
print(“This is the process step!”)
self.next(self.end) # Then the end step
  @step
def end(self):
print(“This is the end step!”)
if __name__ == ‘__main__’:
TestFlow() # This initialises the Flow then runs the start step

I don’t think I’ve ever found a tool quite as seamless as Metaflow for switching up the remote deployment of code. With one annotation to my steps, I can go from running my Flow on my laptop to a 64 core behemoth on EC2, it’s that simple.

# Same Flow as before, but will run on the Cloud!
class TestFlow(FlowSpec):
  @batch(cpu=64, memory=2000) # Each step get's these resources
@step
def start(self):
print(“This is the start step!”)
self.next(self.process)
  @batch(cpu=64, memory=2000)
@step
def process(self, inputs):
print(“This is the process step!”)
self.next(self.end)
  @batch(cpu=64, memory=2000)
@step
def end(self):
print(“This is the end step!”)
if __name__ == ‘__main__’:
TestFlow()

Metaflow is easy to manage

So this one may be a little hard to swallow if you’re not familiar with AWS but bear with me. Metaflow provides a cloudformation stack to deploy its infrastructure. This is the slimmest form of the infrastructure, providing all the required parts for using the platform. This single stack is easy to deploy and utilize yourself, and only costs around $40 a month to maintain. This is Metaflow’s biggest downside however, it requires some knowledge of AWS to maintain at scale, but bringing up and tearing down a single stack is trivial and something I’d expect most to be able to do.

Data can be stored directly into Metaflow

Another great feature of Metaflow is its metadata storage, allowing different steps in a Flow to share data, and persist that data into S3 once the execution finishes. In my previous example of creating an ETL pipeline, I wouldn’t need to store the data into S3 myself as Metaflow would do this for me. So when I come back to perform my analysis, I could just get the latest version of my Flow and extract the dataset. This makes Metaflow very powerful for both ad-hoc and scheduled data “rollups” without having to manage complex databases.

# A portion of a Flow showing the metadata storage and retrieval
@step
def start(self):
print(“This is the start step!”)
self.message = “Metaflow metadata is cool!” # Saved to S3
self.next(self.process)
@step
def process(self, inputs): # Loads message from S3 and into self.
print(f”Let’s print the previous steps message {self.message}”)
self.next(self.end)

Excluding the infrastructure cost, running jobs on Metaflow is cheap

As Metaflow uses Batch under the hood, you can configure it to use SPOT pricing. This makes instance cost virtually negligible and as Metaflow can handle retries automatically, there is little risk posed with SPOT instances going offline. With this, I ran a 64 core Flow for just under an hour for $1.10 in compute, a stark contrast to Glue’s 16DPU job cost at around $7 for an hour.

Figure 2: A snapshot of the spot fleet pricing for running a 64 core job.

Metaflow Flows can be easily scheduled

While Metaflow is great for developing workflows, how is it for managing production builds? Unsurprisingly, great. Adding the @schedule annotation to your Flow class will allow a representation of it to be pushed to Step Functions, where it can be triggered by any AWS Event Bridge schedule/rule. This allows for many possibilities. Want to trigger it when a file gets written to S3? Easy, just attach the Step functions start event to the rule and you’re away.

# Will run every hour
@schedule(hourly=True)
class TestFlow(FlowSpec):
  @step
def start(self):
print(“This is the start step!”)
self.next(self.process)
  @step
def process(self, inputs):
print(“This is the process step!”)
self.next(self.end)
  @step
def end(self):
print(“This is the end step!”)
if __name__ == ‘__main__’:
TestFlow()

In conclusion, Metaflow has proven to be a great, robust tool for me and my workflows. It has some downsides, such as having to manage infrastructure — but its wealth of features (many of which I have not yet covered) makes it an excellent tool for those who cannot afford large-scale data platforms. In future blog posts, I will be exploring options to mitigate these downsides and a deeper dive into utilizing the features of Metaflow.


A new contender for ETL in AWS? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.