Copy directories in S3 using s3-dist-cp
S3 has no catalogs concept, but that does not stop us from putting
/ as delimiters in the object keys and think of files with the same key prefix as files in the same directory.
That causes a problem when we want to copy one catalog’s content into another because we cannot just copy files to a different location. We have to preserve parts of the object keys.
In a file system on a computer when we have those files:
1 2 3 4 /home/user/the_directory/file_A /home/user/the_directory/file_B /home/user/the_directory/file_C /home/user/the_directory/file_D
and we want to copy them to the home directory of user
another_user, the expected result looks like this:
1 2 3 4 /home/another_user/the_directory/file_A /home/another_user/the_directory/file_B /home/another_user/the_directory/file_C /home/another_user/the_directory/file_D
How do we achieve the same outcome in S3?
We need two things:
- a running EMR cluster
s3-dist-cpscript, which is available on all EMR clusters
Let’s pretend that the above directory structure is also the structure of our S3 keys. For example, we have a file in this location:
First, we have to SSH into the master node of the cluster.
After that, we run the s3-dist-cp command using the source prefix as the source and the target prefix as the destination. The script will automatically preserve the rest of the object keys:
1 s3-dist-cp --src=s3://home/user --dest=s3://home/another_user
You may also like
- Athena performance tips explained
- How to deploy a REST API AWS Lambda using Chalice and AWS Code Pipeline
- How to populate a PostgreSQL (RDS) database with data from CSV files stored in AWS S3
- Making OFFSET LIMIT queries in AWS Athena
- How to retrieve the table descriptions from Glue Data Catalog using boto3