Copy directories in S3 using s3-dist-cp

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (60/100)

S3 has no catalogs concept, but that does not stop us from putting / as delimiters in the object keys and think of files with the same key prefix as files in the same directory.

That causes a problem when we want to copy one catalog’s content into another because we cannot just copy files to a different location. We have to preserve parts of the object keys.

In a file system on a computer when we have those files:

1
2
3
4
/home/user/the_directory/file_A
/home/user/the_directory/file_B
/home/user/the_directory/file_C
/home/user/the_directory/file_D

and we want to copy them to the home directory of user another_user, the expected result looks like this:

1
2
3
4
/home/another_user/the_directory/file_A
/home/another_user/the_directory/file_B
/home/another_user/the_directory/file_C
/home/another_user/the_directory/file_D

How do we achieve the same outcome in S3?

We need two things:

  • a running EMR cluster
  • the s3-dist-cp script, which is available on all EMR clusters

Let’s pretend that the above directory structure is also the structure of our S3 keys. For example, we have a file in this location: s3://home/user/the_directory/file_A.

First, we have to SSH into the master node of the cluster.

After that, we run the s3-dist-cp command using the source prefix as the source and the target prefix as the destination. The script will automatically preserve the rest of the object keys:

1
s3-dist-cp --src=s3://home/user --dest=s3://home/another_user



Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group