Building trustworthy data pipelines because AI cannot learn from dirty data
How to retrieve the table descriptions from Glue Data Catalog using boto3
How to get the comments from the create table statements when the metadata is stored in the Glue Data Catalog
10 Oct 2020
How to perform a batch write to DynamoDB using boto3
How to write multiple DynamoDB objects at once using boto3
09 Oct 2020
How to populate a PostgreSQL (RDS) database with data from CSV files stored in AWS S3
How to upload S3 data into RDS tables
08 Oct 2020
How to concatenate multiple MySQL rows into a single field?
How to concatenate multiple rows into a string in MySQL
07 Oct 2020
How to get an array/bag of elements from the Hive group by operator?
How to get an array of elements from one column when grouping by another column in Hive
06 Oct 2020
Working with dates and time in Apache Spark
How to get relative dates (yesterday, tomorrow) in Apache Spark, and how to calculate the difference between two dates
05 Oct 2020
How to save an Apache Spark DataFrame as a dynamically partitioned table in Hive
How to use the saveAsTable function to create a partitioned table
04 Oct 2020
When to cache an Apache Spark DataFrame?
Should we cache everything in Apache Spark or are there any rules?
03 Oct 2020
How to flatten a struct in a Spark DataFrame?
How to convert struct fields into separate columns.
02 Oct 2020
What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?
Desc: How to use the cube and rollup functions in Apache Spark or PySpark. What is the difference between a cube and a rollup.
01 Oct 2020
How to concatenate columns in a PySpark DataFrame
How to use the concat and concat_ws functions to merge multiple columns into one in PySpark
30 Sep 2020
How to derive multiple columns from a single column in a PySpark DataFrame
Extract multiple columns from a single column using the withColumn function and a PySpark UDF
29 Sep 2020