How to use AWS Batch to run a Python script

How to run a Python script as a part of your data pipeline? We can use the PythonOperator in Airflow to run the script. It is a decent solution, but we will block an Airflow worker. What if we want to run the Python script in an external system?

I have seen projects where such scripts were running on the master node in the EMR cluster. It is not a terrible idea either. The coordinating node in a cluster usually has nothing to do anyway, so we can use it to run additional code.

However, the best solution is to use a specialized environment to run the Python scripts, for example, AWS Batch.

Packing the script as a Docker image

To run a Python script in AWS Batch, we have to generate a Docker image that contains the script and the entire runtime environment.

Let’s assume that I have my script in the main.py file inside a separate directory, which also contains the requirements.txt file. To generate a Docker image, I have to add a Dockerfile:

FROM python:3.8

WORKDIR /script

COPY requirements.txt .

RUN pip install -r requirements.txt

COPY main.py .

ENTRYPOINT [ "python", "/main.py"]

In a deployment script, I have to build the Docker container:

1
docker build -t docker_image_name .

Now, I can use the AWS Elastic Container Registry to store the generated Docker image. Before I push the image, I have to log in to the container registry and tag the image. Note that you must replace the with the identifier of your registry:

1
2
3
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <container registry id>.dkr.ecr.us-east-1.amazonaws.com

docker tag docker_image_name:latest <container registry id>.dkr.ecr.us-east-1.amazonaws.com/docker_image_name:latest

In the end, I can push the image to the repository:

1
docker push <container registry id>.dkr.ecr.us-east-1.amazonaws.com/docker_image_name:latest

Subscribe to the newsletter and join the free email course.

Configuring an AWS Batch job

To configure the AWS Batch job, I will use Terraform. In the first step, I have to define the compute environment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
resource "aws_batch_compute_environment" "python_comp_env" {
  compute_environment_name = "python_comp_env"

  compute_resources {
    instance_role = "arn_of_an_instance_role"
    allocation_strategy = "BEST_FIT"
    instance_type = [
      "optimal"
    ]

    max_vcpus     = 4
    min_vcpus     = 0

    security_group_ids = [
      aws_security_group.some_security_group.id,
    ]

    subnets = [
      aws_subnet.some_subnet.id,
    ]

    type = "EC2"
  }

  service_role = "arn_of_aws_batch_service_role"
  type         = "MANAGED"
}

To schedule jobs, we’ll add them to a job queue, so in the next step, we must define the queue:

1
2
3
4
5
6
resource "aws_batch_job_queue" "python_scripts_queue" {
  name                 = "python_scripts_queue"
  state                = "ENABLED"
  priority             = 1
  compute_environments = [aws_batch_compute_environment.python_comp_env.arn]
}

Finally, we create a job definition to tell AWS Batch which Docker image to run in the compute environment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
esource "aws_batch_job_definition" "the_script_to_run" {
  name = "the_script_to_run"
  type = "container"

  container_properties = <<CONTAINER_PROPERTIES
{
    "command": [],
    "image": "<container registry id>.dkr.ecr.us-east-1.amazonaws.com/docker_image_name:latest",
    "memory": 2048,
    "vcpus": 2,
    "jobRoleArn": "arn_role_with_required_permissions",
    "volumes": [],
    "environment": [],
    "mountPoints": [],
    "ulimits": []
}
CONTAINER_PROPERTIES
}

Running an AWS Batch job in Airflow

When the Docker image, the computing environment, and the job definition are ready, I can add the AWS batch job to an Airflow pipeline using a built-in operator:

1
2
3
4
5
6
7
8
9
10
11
12
13
from airflow.contrib.operators.awsbatch_operator import AWSBatchOperator

batch_params = ["",""]

run_aws_batch = AWSBatchOperator(
    task_id='run_aws_batch',
    aws_conn_id='aws_conn_id',
    job_name='aws_batch_job_name',
    job_definition='aws_batch_job_definition',
    job_queue='batch_job_queue',
    overrides={'command': batch_params},
    dag=dag
)

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.