How to add an EMR step from AWS Lambda

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (50/100)

This short tutorial shows how to configure and add a new EMR step using Python running in AWS Lambda. Because the code is supposed to run in AWS Lambda, we don’t have to configure the AWS client. We can just import boto3 and use it to get the EMR client:

import boto3

emr = boto3.client("emr")

After that, we have to define the EMR step. For example, if I want to run a Scala Spark job, I have to call the spark-submit script:

step_args = 'spark-submit --master yarn --deploy-mode client --class class_name --executor-memory 32G --driver-memory 8G'

Right now, we have to create a new JobDefinition object and add it to the EMR cluster:

step = JobDefintion._prepare_step_dict("step_name", step_args=step_args)
return emr.add_job_flow_steps(JobFlowId=cluster_id, Steps=[step])

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.

Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group