Building trustworthy data pipelines because AI cannot learn from dirty data
PySpark-Check - data quality validation for PySpark 3.0.0
Last week, I was testing whether we can use AWS Deequ for data quality validation. I ran into a few problems. First of all, it...
06 Jul 2020
The problem with software testing in data engineering
What if we found a bug in our data pipelines? What if that bug were easy to fix, but it would require a lot of...
15 Jun 2020
Data flow - what functional programming and Unix philosophy can teach us about data streaming
What does stream processing have in common with functional programming and Unix?
04 May 2020
Four books to boost your programmer career
I quit my dream job because of a book
06 Jan 2020
Definition of done for data engineers
When can data engineers be sure that they have done the task?
14 Jan 2021
Don't learn another programming language
Should you learn a new programming language this year?
07 Jan 2021
How to read from SQL table in PySpark using a query instead of specifying a table
Fetching data using a SQL query in PySpark
01 Jan 2021
How to restart a stuck Airflow DAG
What to do when an Airflow DAG gets stuck and does not want to run
31 Dec 2020
Why does the DayOfWeekSensor exist in Airflow?
How to make an Airflow DAG wait until a specified day of the week
30 Dec 2020
Send SMS from an Airflow DAG using AWS SNS
How to configure SNS subscription to send SMS messages and use Airflow to send them
29 Dec 2020
How to emulate temporary tables in Athena
Use CTAS to create a temporary table in Athena
28 Dec 2020
How to enable S3 bucket versioning using Terraform
How to configure Define S3 bucket versioning in Terraform
27 Dec 2020
How to get a notification when a new file is uploaded to an S3 bucket
Get a Slack notification when a file is uploaded to an S3 bucket
26 Dec 2020
Get an XCom value in the Airflow on_failure_callback function
How to get the task instance in the on_failure_callback to get access to XCom
25 Dec 2020
Add the row insertion time to a MySQL table
Automatically add the insertion and update time in MySQL
24 Dec 2020
Best practices about partitioning data in S3 by date
How to partition data in S3 by date in a way that makes your life easier
23 Dec 2020