Career Coaching for Data Professionals
Building trustworthy data pipelines because AI cannot learn from dirty data
How to find and terminate an idle Redshift session
How to find the idle session that is blocking the connection pool in Redshift
14 Oct 2020
How to configure Spark to maximize resource usage while using AWS EMR
How to configure EMR to use all available resources when running a Spark cluster
13 Oct 2020
How to use AWSAthenaOperator in Airflow to verify that a DAG finished successfully
How to check that an AWS Athena table contains data after running an Airflow DAG.
12 Oct 2020
How to start an AWS Glue Crawler to refresh Athena tables using boto3
How to create and start an AWS Glue Crawler from Python code using boto3
11 Oct 2020
How to retrieve the table descriptions from Glue Data Catalog using boto3
How to get the comments from the create table statements when the metadata is stored in the Glue Data Catalog
10 Oct 2020
How to perform a batch write to DynamoDB using boto3
How to write multiple DynamoDB objects at once using boto3
09 Oct 2020
How to populate a PostgreSQL (RDS) database with data from CSV files stored in AWS S3
How to upload S3 data into RDS tables
08 Oct 2020
How to concatenate multiple MySQL rows into a single field?
How to concatenate multiple rows into a string in MySQL
07 Oct 2020
How to get an array/bag of elements from the Hive group by operator?
How to get an array of elements from one column when grouping by another column in Hive
06 Oct 2020
Working with dates and time in Apache Spark
How to get relative dates (yesterday, tomorrow) in Apache Spark, and how to calculate the difference between two dates
05 Oct 2020
How to save an Apache Spark DataFrame as a dynamically partitioned table in Hive
How to use the saveAsTable function to create a partitioned table
04 Oct 2020
When to cache an Apache Spark DataFrame?
Should we cache everything in Apache Spark or are there any rules?
03 Oct 2020