Data Craft - making data pipelines trustworthy
How to train a Reinforcement Learning Agent using Tensorflow Agents
The reinforcement learning loop with Tensorflow Agents
31 Jul 2019
How to deal with underfitting and overfitting in deep learning
The lessons learned from Andrew Ng’s online course
17 Apr 2019
How to split a list inside a Dataframe cell into rows in Pandas
Step by step instructions to "explode" a list into DataFrame rows.
27 Aug 2018
Monte Carlo simulation in Python
How to make business decisions using the Monte Carlo simulation?
17 Aug 2018
Why my Airflow tasks got stuck in "no_status" and how I fixed it
A story about debugging an Airflow DAG that was not starting tasks
01 Jun 2020
What is Kafka log compaction, and how does it work?
How the log compaction is implemented in Apache Kafka and how to configure it properly
22 May 2020
How does a Kafka Cluster work?
What is the difference between a leader and a replica broker? What is the cluster controller? How is the controller elected?
18 May 2020
Athena performance tips explained
How to use query execution plans to speed up Athena queries
11 May 2020
Data flow - what functional programming and Unix philosophy can teach us about data streaming
What does stream processing have in common with functional programming and Unix?
04 May 2020
AWS IAM roles and policies explained
In this article, I am going to explain the essential parts of IAM and describe how to grant permissions to your users or AWS Lambda functions you wrote.
06 Apr 2020
How to be happy at work - lessons learned from "Career superpowers" book
In this article, I share the lessons I learned from James Whittaker’s book “Career Superpowers: Succeeding on Purpose.”
30 Mar 2020
How to send metrics to AWS CloudWatch from custom Python code
23 Mar 2020
How to unit test PySpark
Recently, I came across an interesting problem: how to speed up the feedback loop while maintaining a PySpark DAG. Of course, I could just run the Spark Job and look...
24 Feb 2020
How to speed up a PySpark job
I had a Spark job that occasionally was running extremely slow. On a typical day, Spark needed around one hour to finish it, but sometimes it required over four hours....
17 Feb 2020
How does MapReduce work, and how is it similar to Apache Spark?
In this article, I am going to explain the original MapReduce paper “MapReduce: Simplified Data Processing on Large Clusters,” published in 2004 by Jeffrey Dean and Sanjay Ghemawat.
10 Feb 2020
#Papers We Love
Data streaming with Apache Kafka - guide for data engineers
Are you preparing for a data engineer job interview? Here are my answers to job interview questions about data streaming.
03 Feb 2020