How to add an EMR step from AWS Lambda

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (50/100)

This short tutorial shows how to configure and add a new EMR step using Python running in AWS Lambda. Because the code is supposed to run in AWS Lambda, we don’t have to configure the AWS client. We can just import boto3 and use it to get the EMR client:

1
2
3
import boto3

emr = boto3.client("emr")

After that, we have to define the EMR step. For example, if I want to run a Scala Spark job, I have to call the spark-submit script:

1
step_args = 'spark-submit --master yarn --deploy-mode client --class class_name --executor-memory 32G --driver-memory 8G'

Right now, we have to create a new JobDefinition object and add it to the EMR cluster:

1
2
step = JobDefintion._prepare_step_dict("step_name", step_args=step_args)
return emr.add_job_flow_steps(JobFlowId=cluster_id, Steps=[step])

What a JobDefinition is?

When you open the AWS EMR web interface, you will see EMR clusters with Steps. The JobDefinition API object defines a single step executed by the EMR.

The JobFlow is the entire queue of steps running on an EMR cluster. That’s why we use the cluster id as the JobFLow id. It is the same thing.

The _prepare_step_dict function creates the JSON object describing a single step. In this article, it has two arguments, the name, and the command to run on the cluster (spark-submit).

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.