How to populate a PostgreSQL (RDS) database with data from CSV files stored in AWS S3

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (16/100)

Today, I am going to show you how to import data from Amazon S3 into a PostgreSQL database running as an Amazon RDS service.

The first thing we have to do is installing the aws_s3 extension in PostgreSQL. To do this, we have to login as an administrator and run the following statement:

1
CREATE EXTENSION aws_s3 CASCADE;

The RDS service needs read-only access to the S3 bucket we want to import and the ability to use the s3import feature. Let’s use Terraform to define the policies and roles:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
resource "aws_iam_policy" "s3-access" {
    name        = "s3-access"
    description = "A policy for S3 access."
    policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "s3import",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::bucket-name",
                "arn:aws:s3:::bucket-name/*"
            ]
        }
    ]
}
EOF
}

Now we have to allow the RDS service to assume the role associated with the S3 access policy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
resource "aws_iam_role" "rds-assume-role" {
    name = "rds"
    assume_role_policy = <<EOF
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "rds.amazonaws.com"
      },
      "Effect": "Allow"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "rds-to-s3" {
    role       = "${aws_iam_role.rds-assume-role.name}"
    policy_arn = "${aws_iam_policy.s3-access.arn}"
}

There is one more thing we need. We have to allow our RDS instance to import data from S3:

1
2
3
4
5
resource "aws_db_instance_role_association" "rds_s3import_role" {
  db_instance_identifier = "${aws_db_instance.rds-database.id}"
  feature_name = "s3Import"
  role_arn = "${aws_iam_role.rds-assume-role.arn}"
} 

In the end, we have to add two rules to the security group used by the RDS database. The database needs permission to make HTTP and HTTPS calls to communicate with S3:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Add those to the security group used by the RDS database
  egress {
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    cidr_blocks     = ["0.0.0.0/0"]
  }
  
  egress {
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    cidr_blocks     = ["0.0.0.0/0"]
  }


Importing the data

Finally, we can use the aws_s3 extension to import the CSV files into the database tables:

1
SELECT aws_s3.table_import_from_s3('schema.table_name', '', '(FORMAT CSV)', 'bucket-name,file_key,aws-region-id)');

Obviously, before running that command, we have to create the table with a schema that matches the CSV file.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group