Copy directories in S3 using s3-dist-cp

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (60/100)

S3 has no catalogs concept, but that does not stop us from putting / as delimiters in the object keys and think of files with the same key prefix as files in the same directory.

That causes a problem when we want to copy one catalog’s content into another because we cannot just copy files to a different location. We have to preserve parts of the object keys.

In a file system on a computer when we have those files:


and we want to copy them to the home directory of user another_user, the expected result looks like this:


How do we achieve the same outcome in S3?

We need two things:

  • a running EMR cluster
  • the s3-dist-cp script, which is available on all EMR clusters

Let’s pretend that the above directory structure is also the structure of our S3 keys. For example, we have a file in this location: s3://home/user/the_directory/file_A.

First, we have to SSH into the master node of the cluster.

After that, we run the s3-dist-cp command using the source prefix as the source and the target prefix as the destination. The script will automatically preserve the rest of the object keys:

s3-dist-cp --src=s3://home/user --dest=s3://home/another_user

Subscribe to the newsletter and join the free email course.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.

Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.

Do you want to work with me at riskmethods?

REMOTE position (available in Poland or Germany)