Building trustworthy data pipelines because AI cannot learn from dirty data
Data streaming: what is the difference between the tumbling and sliding window?
When you start processing streams of events, there always comes a time to decide on how to group them. We have a few kinds of window functions that we can...
27 Jan 2020
I put a carnivorous plant on the Internet of Things to save its life, and it did not survive
This article is a text version of my talk, "I put a carnivorous plant on the Internet of Things," which I presented during the DataNatives conference (November 25-26, 2019 in...
23 Jan 2020
What are the 4 V's of big data, and which one is the most important?
One of the first models that describe what big data is was the four Vs-model. That definition divides big data into four categories (sometimes called dimensions) of problems: volume, velocity,...
20 Jan 2020
10x software architecture: high cohesion
A few months ago, it was fashionable to complain about the 10x developer myth. I agree that such people don’t exist, but, in my opinion, proper software architecture can transform...
12 Jan 2020
How to add dependencies to AWS lambda
The process of adding dependencies to an AWS Lambda consists of two steps. First, we have to install the dependencies in the source code directory. Later, we have to package...
08 Jan 2020
Four books to boost your programmer career
I quit my dream job because of a book
06 Jan 2020
What is the difference between data lake, data warehouse, and data mart
We can easily distinguish between them by focusing on three qualities: data structure (schema), data quality, and ownership.
18 Dec 2019
Three biggest traps to avoid while setting Spark executor memory
What happens when you set the executor memory of a Spark worker which uses YARN as the cluster resource manager? Does it get exactly the amount of memory you requested?...
16 Dec 2019
How to run Airflow DAGs for a specified date in the past?
Have you created a new Airflow DAG, but now you have to run it using every data snapshot created during the last six months? Don’t worry. You don’t need to...
11 Dec 2019
What do you need to know about storing passwords in AWS?
How to use the AWS Secrets Manager
09 Dec 2019
Apache Spark: should we use RDD, Dataset, or DataFrame?
Is there a difference between Dataset and DataFrame? Why do we even have both?
04 Dec 2019
What a data engineer can learn from The Unicorn Project?
Have you ever seen a novel about developers? Reading such a book seems to be a massive waste of time, doesn’t it? After all, the internet is full of stories...
02 Dec 2019