How to retrieve the table descriptions from Glue Data Catalog using boto3
It is not a common use-case, but occasionally we need to create a page or a document that contains the description of the Athena tables we have. It is relatively easy to do if we have written comments in the create external table
statements while creating them because those comments can be retrieved using the boto3 client.
In this article, I am going to show you how to do it.
First, we have to create a glue client using the following statement:
1
2
3
4
5
6
import boto3
glue_client = boto3.client('glue',
region_name=region_name,
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
To retrieve the tables, we need to know the database name:
1
glue_tables = glue_client.get_tables(DatabaseName=db_name, MaxResults=1000)
Would you like to help fight youth unemployment while getting mentoring experience?
Develhope is looking for tutors (part-time, freelancers) for their upcoming Data Engineer Courses.
The role of a tutor is to be the point of contact for students, guiding them throughout the 6-month learning program. The mentor supports learners through 1:1 meetings, giving feedback on assignments, and responding to messages in Discord channels—no live teaching sessions.
Expected availability: 15h/week. You can schedule the 1:1 sessions whenever you want, but the sessions must happen between 9 - 18 (9 am - 6 pm) CEST Monday-Friday.
Check out their job description.
(free advertisement, no affiliate links)
Now, we can iterate over the tables and retrieve the data such as the column names, types, and the comments added when the table was created:
1
2
3
4
5
for table in glue_tables['TableList']:
for column in table['StorageDescriptor']['Columns']:
column_name = column['Name']
comment = column.get('Comment', '')
column_type = column['Type']
We have to remember that the code above does not return the columns used for data partitioning. To get the partition keys, we need the following code:
1
2
3
4
5
for table in glue_tables['TableList']:
for partition_key in table.get('PartitionKeys', []):
column_name = partition_key['Name']
comment = partition_key.get('Comment', '')
column_type = partition_key['Type']
You may also like
- How to select a random sample of rows using Athena
- Multimodel deployment in Sagemaker Endpoints
- How to populate a PostgreSQL (RDS) database with data from CSV files stored in AWS S3
- What is s3:TestEvent, and why does it break my event processing?
- How to deploy a Transformer-based model with custom preprocessing code to Sagemaker Endpoints using BentoML