How to index data in Redshift

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (83/100)

There are no indexes in Redshift. At least not the ones we use in relational databases. It is so because Redshift uses columnar storage, standard row indexing techniques are not applicable.

Instead of that, we can define a sort key. There are, however, a few issues related to sort keys. First of all, we can have only one sort key per table. Such a limitation exists because Redshift will use that sort key to order the data stored in the underlying files, and, of course, there can be only one way to order it.

The second issue is the fact that we have to define the sort key at the time when we create the table. Because the sort key is a technical concept that influences even the order of data on the hard drives, we provide it as a part of the table definition.

To determine which column(s) will be the best to use as the sort key, take a look at the Redshift best practices:

Note that it is possible to change the SORT KEY later. Of course, that requires reorganizing the underlying data files, but that technical detail is handled automatically by AWS. For us, the essential information is that:

  • The sort key can change.
  • There is always only one sort key (however, we can use a compound key with multiple columns).
  • It is used to order the data when storing it on the hard drives, so it has a tremendous influence on query performance.

Subscribe to the newsletter and join the free email course.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.