Data Craft - making data engineering trustworthy because AI cannot learn from dirty data
How to train a Reinforcement Learning Agent using Tensorflow Agents
The reinforcement learning loop with Tensorflow Agents
31 Jul 2019
How to deal with underfitting and overfitting in deep learning
The lessons learned from Andrew Ng’s online course
17 Apr 2019
How to split a list inside a Dataframe cell into rows in Pandas
Step by step instructions to "explode" a list into DataFrame rows.
27 Aug 2018
Monte Carlo simulation in Python
How to make business decisions using the Monte Carlo simulation?
17 Aug 2018
How to use one SparkSession to run all Pytest tests
How to speed us Pytest tests by reusing the same SparkSession in all of them
20 Jul 2020
How to send AWS CloudWatch Alerts to a Slack channel using Terraform
How to use Terraform to configure a CloudWatch alert and send the message to a Slack channel.
13 Jul 2020
PySpark-Check - data quality validation for PySpark 3.0.0
Last week, I was testing whether we can use AWS Deequ for data quality validation. I ran into a few problems. First of all, it was using an outdated version...
06 Jul 2020
Measuring data quality using AWS Deequ
How to measure data quality in Athena tables using AWS Deequ running on an EMR cluster.
29 Jun 2020
How to conditionally skip tasks in an Airflow DAG
How to use XCom and PythonSensor to skip remaining tasks in an Airflow DAG.
22 Jun 2020
The problem with software testing in data engineering
What if we found a bug in our data pipelines? What if that bug were easy to fix, but it would require a lot of time spent backfilling the data?...
15 Jun 2020
How does Kafka Connect work?
In this article, I am going to describe the internals of Kafka Connect, explain how it uses the Sink and Source Connectors, and how it tracks the offsets of the...
08 Jun 2020
Why my Airflow tasks got stuck in "no_status" and how I fixed it
A story about debugging an Airflow DAG that was not starting tasks
01 Jun 2020
What is Kafka log compaction, and how does it work?
How the log compaction is implemented in Apache Kafka and how to configure it properly
22 May 2020
How does a Kafka Cluster work?
What is the difference between a leader and a replica broker? What is the cluster controller? How is the controller elected?
18 May 2020
Athena performance tips explained
How to use query execution plans to speed up Athena queries
11 May 2020
Data flow - what functional programming and Unix philosophy can teach us about data streaming
What does stream processing have in common with functional programming and Unix?
04 May 2020