How to configure both core and spot instances in EMR using Terraform

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (52/100)

In the Terraform configuration, we can use the core_instance_group to define either core and spot instances. When we use the bid_price, we get spot instances. When there is no bid_price, we get core instances.

What do we do when we want both core and spot instances in the same cluster? We cannot have two core_instance_group parameters in the same aws_emr_cluster.

We can solve that problem by defining the core instances in the core_instance_group:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
resource "aws_emr_cluster" "emr_name" {
  name = "emr_name"
  release_label = "emr-5.29.0"
  applications = ["Spark"]
  service_role = "EMR_ROLE"
  termination_protection = false
  keep_job_flow_alive_when_no_steps = true

  log_uri = "s3n://logs_bucket/"

  master_instance_group {
    instance_type = "m5.xlarge"
  }

  core_instance_group {
    instance_type = "m5.xlarge"
    instance_count = 2

    ebs_config {
      size = 64
      type = "gp2"
      volumes_per_instance = 1
    }
  }

  ec2_attributes {
    instance_profile = "EC2_ROLE"
    key_name = "ssh_key_name"
    subnet_id = aws_subnet.subnet.id
  }

  configurations_json = <<EOF
[
    {
        "Classification": "spark-hive-site",
        "Properties": {
            "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
        }
    }
]
EOF
}

This configuration gives us an EMR cluster with two core instances.



Now, we can add spot instances using an aws_emr_instance_group parameter:

1
2
3
4
5
6
7
8
9
10
11
12
13
resource "aws_emr_instance_group" "emr_name_spot" {
  cluster_id = aws_emr_cluster.emr_name.id
  instance_type = "m5.2xlarge"
  instance_count = 3

  bid_price = ""

  ebs_config {
    size = 128
    type = "gp2"
    volumes_per_instance = 1
  }
} 

If we put a value in the bid_price, we will use it as the price we want to pay for the spot instances. When the bid_price is empty, we get On-Demand spot instances.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group


This website DOES NOT use cookies
but you may still see the cookies set earlier if you have already visited it.