mikulskibartosz.name
Start here
About me
Twitter
Mastodon
Hire me
Bartosz Mikulski
Leveraging AI to drive growth and innovation
Category
Software engineering
Query string validation in Fastify
How to validate query parameters using Fastify
Assert object pattern
The easiest way to make your tests more readable and easier to maintain
A "known bug" is still a bug
What does a "known bug" or an update say our users?
Language is all about nouns
Programmers are afraid of nouns. We often replace them with poorly written descriptions of things.
The cake pattern is a lie
Cake pattern was a terrible idea.
Can we make it more generic?
What can we learn from a horrible mistake made by a programmer who wanted to make the code more generic?
[JUG Thüringen] Effortless Domain-Driven Design - The real Power of Scala
How to use some parts of Domain Driven Design to create maintainable code in Scala?
Prevent accidental deployments on Friday
You feel you should not deploy your code on Fridays but nothing stops you. Can you prevent accidental deployments?
Developers just wanna have fun
Software maintenance is painful because of hype driven development.
Support for old browsers — is it necessary?
Do you think that every web page should support all existing browsers? How about all versions of those browsers?
The beauty of properly used statically typed languages
The real power of programming in Scala is not in mimicking Haskell and overusing monads, but in taking advantage of its type system.
4 reasons why TDD slows you down
It is easy to announce that TDD slows you down, but have you ever wondered why it happens? Is there anything you can do better?
Reversing a binary tree and other great interview questions
We do not like being asked to write an algorithm on a whiteboard during job interviews, but is there a better way?
The importance of documenting things
What happens when someone asks you about your code and you cannot answer because you have no idea how it works? That happened to me… again.
One thing can improve LambdaDays
One thing that can significantly improve LambdaDays.
Can we write cheaper software?
Can we write our web application in a way which saves our company money? Can we make our software cheaper?
Rage against unprofessionalism in software engineering
Today software engineers disappointed another person…
Category
Software craft
Turning greenfield projects into brownfields projects
What happens when the team lacks software craft skills?
Is programming art or science?
On a quest to find the right metaphor
Does a tester break the product?
How does a name influence our attitude?
Programmers love new toys but hate new habits
We talk about toys. We love new buzzwords. We adore things that sound cool. Yes, we do.
Buzzwords, buzzwords everywhere
Do we behave like a child in a toy store?
Extreme ownership and software engineering
What a software engineer can learn from “Extreme ownership” book? How can it influence your daily work?
Perpetually dysfunctional software
What happens when we release another beta version? Are users happy or angry? What if the reality is different than we think?
The importance of documenting things
What happens when someone asks you about your code and you cannot answer because you have no idea how it works? That happened to me… again.
Rage against unprofessionalism in software engineering
Today software engineers disappointed another person…
Category
Scala
How to run a single test in SBT
Why testOnly does not work?
Live unit testing with sbt
Can I have the coolest Visual Studio feature in IntelliJ?
Scala structural types with generics
A short example of defining a structural type which matches a generic class
The cake pattern is a lie
Cake pattern was a terrible idea.
[JUG Thüringen] Effortless Domain-Driven Design - The real Power of Scala
How to use some parts of Domain Driven Design to create maintainable code in Scala?
The beauty of properly used statically typed languages
The real power of programming in Scala is not in mimicking Haskell and overusing monads, but in taking advantage of its type system.
Always stop unused Akka actors
Akka actors do not magically disappear when you no longer need them.
A year of Poznan Scala User Group
How can a group created because of a tweet exists for over a year? Have we learned anything in a year? What are we going to do now?
Category
Meetup
JUG Thüringen meetup - retrospective
My opinion about my presentation at a meetup in Erfurt, Germany.
A year of Poznan Scala User Group
How can a group created because of a tweet exists for over a year? Have we learned anything in a year? What are we going to do now?
Category
Conference
Buzzwords, buzzwords everywhere
Do we behave like a child in a toy store?
Scalar 2017
Scalar Conference 2017 — everything I liked
One thing can improve LambdaDays
One thing that can significantly improve LambdaDays.
Category
Recruiting
Reversing a binary tree and other great interview questions
We do not like being asked to write an algorithm on a whiteboard during job interviews, but is there a better way?
Category
Akka
Always stop unused Akka actors
Akka actors do not magically disappear when you no longer need them.
Category
TDD
Test-Driven Development in Python with Pytest
How to setup and use Pytest to test Python code
Why should you practice TDD?
What are the benefits of TDD for programmers and companies that hire them?
Testing legacy data pipelines
Do you struggle with maintaining your legacy data pipelines? Check out our article on how to add tests and refactor your code while working with legacy data pipelines.
How to teach your team to write automated tests?
How to teach writing automated tests: TDD, BDD, and other techniques
How to learn TDD
Learning Test-Driven Development is hard and there is nothing we can do about it
Test-driven development in Jupyter Notebook
TDD for data scientists working with Jupyter Notebook
4 reasons why TDD slows you down
It is easy to announce that TDD slows you down, but have you ever wondered why it happens? Is there anything you can do better?
Category
Project management
Theory of constraints in data engineering
Are you busy, but nothing ever gets done? Perhaps, theory of constraints will help you
User story mapping for developers
A natural way of splitting work into small, but useful parts
Developers just wanna have fun
Software maintenance is painful because of hype driven development.
Support for old browsers — is it necessary?
Do you think that every web page should support all existing browsers? How about all versions of those browsers?
Extreme ownership and software engineering
What a software engineer can learn from “Extreme ownership” book? How can it influence your daily work?
Category
Domain Driven Design
[JUG Thüringen] Effortless Domain-Driven Design - The real Power of Scala
How to use some parts of Domain Driven Design to create maintainable code in Scala?
The beauty of properly used statically typed languages
The real power of programming in Scala is not in mimicking Haskell and overusing monads, but in taking advantage of its type system.
Category
Git
Git fixup explained
How to change the commit history
Prevent accidental deployments on Friday
You feel you should not deploy your code on Fridays but nothing stops you. Can you prevent accidental deployments?
Category
Book review
What a data engineer can learn from The Unicorn Project?
Have you ever seen a novel about developers? Reading such a book seems to be a massive waste of time, doesn’t it? After all, the internet is full of stories...
[book review] James Whittaker's Little Book of the Future
Read this book if you believe we can use A.I. and IoT to build a bright future.
Discipline Equals Freedom — Jocko Willink
A review of Jocko Willink’s book: “Discipline Equals Freedom.” Should you read it even if you don’t want to run a marathon?
Category
Software architecture
10x software architecture: high cohesion
A few months ago, it was fashionable to complain about the 10x developer myth. I agree that such people don’t exist, but, in my opinion, proper software architecture can transform...
Re: “I Don’t Want To Maintain Their Code”
How can we facilitate knowledge sharing? Will easily accessible documentation foster cooperation?
Category
Learning
Don't learn another programming language
Should you learn a new programming language this year?
How to be happy at work - lessons learned from "Career superpowers" book
In this article, I share the lessons I learned from James Whittaker’s book “Career Superpowers: Succeeding on Purpose.”
Can’t learn anything? You’re doing it wrong
Have you been trying to learn something for a few months? What to do when you keep learning but still don’t understand anything?
Category
Data science
Minkowski distance explained
Precision vs. recall - explanation
How to understand the difference between precision and recall?
Category
Machine learning
A.I. in production: your next stylist is going to be a neural network
What is the difference between training, validation, and test sets in machine learning
Training a machine learning model is like learning before an exam.
How to plot the decision trees from XGBoost classifier
Using machine learning for software testing
How to sample production data to get representative testing dataset?
Using a surrogate model to interpret a machine learning model
How to explain a machine learning model?
Generalized Linear Models — Using linear regression when the dependent variable does not follow Gaussian distribution
Understanding the GLM from the statsmodels package
PCA — how to choose the number of components?
How many principal components do we need when using Principal Component Analysis?
How to avoid bias against underrepresented target classes while training a machine learning model
The difference between KFold and StratifiedKFold in Scikit-learn
The problem of large categorical variables in machine learning
How to use FeatureHasher in Scikit-learn
Encoding categorical variables in machine learning
One-hot encoding, dummy coding, and effect coding in Scikit learn and Pandas
How To Avoid Data Leakage While Building A Machine Learning Model
What to do when your model works perfectly during testing but fails in production
Using scikit-automl for building a classification model
My first attempt to use scikit-automl and how I got it working
Preprocessing the input Pandas DataFrame using ColumnTransformer in Scikit-learn
How to encode text/categorical variables and scale numerical values using only one Scikit-learn class
How to install scikit-automl in a Kaggle notebook
error: command ‘swig’ failed with exit status 1 while installing scikit-automl
Nested cross-validation in time series forecasting using Scikit-learn and Statsmodels
Tweaking the parameters of Statsmodels
A few useful things to know about machine learning
Pedro Domingo’s observations about feature engineering
How to interpret ROC curve and AUC metrics
In my opinion, AUC is a metric that is both easy to use and easy to misuse. Do you want to know why? Keep reading ;)
F1 score explained
The mathematics behind F1 score.
A comprehensive guide to putting a machine learning model in production using Flask, Docker, and Kubernetes
How to use Docker and Flask to put a Scikit model in production as a microservice.
How to save a machine learning model into a file
Saving a Scikit-learn model using the joblib library in Python
Understanding uncertainty intervals generated by Prophet
How to tweak uncertainty intervals in Prophet.
Prophet plot explained
How to read the Prophet forecast plot
Machine learning cheat sheets
A collection of machine learning cheat sheets I find useful and google repeatedly.
Forward feature selection in Scikit-Learn
Two workarounds to get an equivalent of forward feature selection in Scikit-Learn
How to set the global random_state in Scikit Learn
What to do if you keep forgetting to set the random_state?
How to load data from Google Drive to Pandas running in Google Colaboratory
Precision vs. recall - explanation
How to understand the difference between precision and recall?
Category
Data Science
I worked as a data scientist and that was the worst job I have ever had.
I believed in the Sexiest Job of 21 Century hype. I was wrong.
How to remove outliers from Seaborn boxplot charts
Hide outliers when displaying boxplot in Seaborn
Pandas stack and unstack explained
How to use the stack and unstack functions in Pandas
Numpy reshape explained
How to use the reshape function in Numpy
Human bias in A/B testing
Underpowered tests, true negative, and ignored tests results
Smoothing time series in Python using Savitzky–Golay filter
In this article, I will show you how to use the Savitzky-Golay filter in Python and show you how it works. To understand the Savitzky–Golay filter, you should be familiar...
XGBoost hyperparameter tuning in Python using grid search
Using GridSearchCV from Scikit-Learn to tune XGBoost classifier
Forecasting time series: using lag features
How to turn Pandas data frame into time-series input for RNN
From Pandas dataframe to RNN input
How to measure the similarity of sequence values
Levenshtein distance and Kendall tau distance
Measuring document similarity in machine learning
How to measure the similarity of two datasets?
Why most data science projects fail?
Product/market fit - buidling a data-driven product
How to test a product idea?
Notetaking for data science
How to document a project?
Wilson score in Python - example
How to get the value by rank from a grouped Pandas dataframe
How to rank a grouped data frame in Pandas
The difference between the expanding and rolling window in Pandas
How to use rolling window with datetime (and other types) in Pandas
Write everything down
Lessons learnt from "Practical Data Cleaning" by Lee Baker
How to display all columns of a Pandas DataFrame in Jupyter Notebook
The silly mistakes in exploratory data analysis
Smoothing time series in Pandas
How to use the exponentially weighted window functions in Pandas
How to reduce memory usage in Pandas
Fit more data in the same amount of memory
Guidelines for data science teams — a summary of Daniel Molnar’s talks
Avoiding over-engineering in machine learning
How to return rows with missing values in Pandas DataFrame
How does it work and why the most popular solution is wrong
Predicting customer lifetime value using the Pareto/NBD model and Gamma-Gamma model
How to estimate the CLV from a list of customer transactions using the lifetimes library in Python
Predicting customer churn using the Pareto/NBD model
How to use a Python lifetimes library to build a Pareto/NBD model.
Business metrics that make no sense
There are three kinds of metrics that won’t destroy your business.
How to perform an A/B test correctly in Python
What can we expect from a correctly performed A/B test?
Recommendations vs. raw data — what is better?
Should we suggest an action when we visualize data?
How to display mathematical equations in Jupyter Notebook
LaTeX support in Jupyter Notebook
Apriori algorithm explained
How to change plot size in Jupyter Notebook
Pyplot parameter that configures the chart size
Looking for structure in data — Andrews curves plot explained
How to read Andrews curves chart
Finding seasonality in time series using autocorrelation plot
How to interpret autocorrelation plot?
My favourite data science podcasts
I was asked for some podcast recommendation, so here is my very short list ;)
A podcast that changed my perspective on exploratory data analysis
How to avoid bad science
How to read a confusion matrix
Predicted labels are in columns, right? Or maybe in rows? Do you remember? ;)
F1 score explained
The mathematics behind F1 score.
How to display a progress bar in Jupyter Notebook
Display a progress bar with no additional dependencies, just Python + Jupyter Notebook
How to save a machine learning model into a file
Saving a Scikit-learn model using the joblib library in Python
Bootstrapping vs. bagging
The difference explained
Understanding uncertainty intervals generated by Prophet
How to tweak uncertainty intervals in Prophet.
Prophet plot explained
How to read the Prophet forecast plot
How to visualise prediction errors
How to explain the errors of a linear regression model
Test-driven development in Jupyter Notebook
TDD for data scientists working with Jupyter Notebook
Dealing with dates and time in Pandas
How to use Pandas to parse dates or calculate time in a different timezone.
Fill missing values using Random Forest
How to predict the missing values using Scikit-Learn
Box and whiskers plot
How to plot and interpret the box and whiskers plot
How I failed to plot parallel coordinates in Matplotlib
Built-in matplotlib functions are not enough in this case
Import Jupyter Notebook from GitHub
The easiest way to access someone else’s code in your own notebook
Fill missing values in Pandas
Use the next or previous value to fill the missing values in Pandas
Heat map with Matplotlib
A short tutorial about generating a heat map of the values stored in a Pandas dataframe
Outlier detection with Scikit Learn
Z-score and Density-Based Spatial Clustering of Applications with Noise
How to split a list inside a Dataframe cell into rows in Pandas
Step by step instructions to "explode" a list into DataFrame rows.
Interactive plots in Jupyter Notebook
How to create a plot that supports zooming
Probability plot - visually compare probability distributions
How to visually check whether your sample is normally distributed?
Monte Carlo simulation in Python
How to make business decisions using the Monte Carlo simulation?
Word cloud from a Pandas data frame
Create a nice visualization of the most popular words in your data frame
Visualize common elements of two datasets using NetworkX
How to use undirected graph to visualize common elements of two Pandas data frames
How to load data from Google Drive to Pandas running in Google Colaboratory
Category
NLP
Word cloud from a Pandas data frame
Create a nice visualization of the most popular words in your data frame
Category
Algorithms
Count unique elements of an infinite stream of objects
HyperLogLog - probabilistic counting algorithm
Category
Teamwork
Re: DataOps Principles: How Startups Do Data The Right Way
Team vs. a bunch of individuals reporting work time in the same spreadsheet
The one important thing I learned from "Beyond Developer" by Dan North
How to motivate software engineers?
[book review] Team Geek
This book deserves a 3-star review on Amazon for many reasons.
Category
Book
How to be happy at work - lessons learned from "Career superpowers" book
In this article, I share the lessons I learned from James Whittaker’s book “Career Superpowers: Succeeding on Purpose.”
Four books to boost your programmer career
I quit my dream job because of a book
Review of “Conversations On Data Science” by Roger D. Peng and Hilary Parker
[book review] The hundred-page machine learning book
I have mixed feelings about this book.
[book review] You had me at Hello, World
Did you ever want to have a mentor?
[book review] Deep work by Cal Newport
How to focus on the high outcome tasks and avoid being distracted
[book review] So good they can’t ignore you
A polarizing book
[book review] The effective engineer
What is the best investment of your time?
5 best books I read in 2018
A list that surprised even me…
"The war of art" and other books I did not finish reading
You can read more good books if you skip the lousy ones.
Smart creative — the new role model
It may look like a unicorn, but it is real
"The Box: How the Shipping Container Made the World Smaller and the World Economy Bigger" by Marc Levinson
What happens when one invention makes the whole industry obsolete?
[book review] Dichotomy of leadership
The follow-up to “Extreme ownership”
Category
Docker
How to remove all Docker images and containers
An explanation of removing Docker images and containers.
What is inside a Docker image?
How to unpack a Docker image
How to build a project inside a Docker container
How to safely run code downloaded from the Internet
Category
Productivity
Music and other distractions
Why is it difficult to work in the office?
Brain dump — programmer productivity experiment #2
How to generate new ideas instead of thinking about the same thing over and over again
Programmer diary — programmer productivity experiment #1
One of the most intriguing ideas described in the book "How Google works" is writing "snippets."
Smart creative — the new role model
It may look like a unicorn, but it is real
Category
JVM
Java performance testing — Epsilon garbage collector
How to make sure that GC does not stop the JVM during a test?
Category
Public Speaking
What is wrong with tech conferences?
Why are tech conferences boring?
Category
Data engineering
Theory of constraints in data engineering
Are you busy, but nothing ever gets done? Perhaps, theory of constraints will help you
Testing data products: BDD for data engineers
How to use BDD to test PySpark code
Definition of done for data engineers
When can data engineers be sure that they have done the task?
How does a Kafka Cluster work?
What is the difference between a leader and a replica broker? What is the cluster controller? How is the controller elected?
Data streaming with Apache Kafka - guide for data engineers
Are you preparing for a data engineer job interview? Here are my answers to job interview questions about data streaming.
What are the 4 V's of big data, and which one is the most important?
One of the first models that describe what big data is was the four Vs-model. That definition divides big data into four categories (sometimes called dimensions) of problems: volume, velocity,...
What is the difference between data lake, data warehouse, and data mart
We can easily distinguish between them by focusing on three qualities: data structure (schema), data quality, and ownership.
Three biggest traps to avoid while setting Spark executor memory
What happens when you set the executor memory of a Spark worker which uses YARN as the cluster resource manager? Does it get exactly the amount of memory you requested?...
Apache Spark: should we use RDD, Dataset, or DataFrame?
Is there a difference between Dataset and DataFrame? Why do we even have both?
AI in production: make data as easy as using your phone
Dependencies between DAGs: How to wait until another DAG finishes in Airflow?
In this article, I am going to show how to set up dependencies between two DAGs. Imagine that I have a DAG that dumps data from production databases and another...
How to run Airflow in Docker (with a persistent database)
In this blog post, I am going to show you how to prepare the minimalist setup of puckel/docker-airflow Docker image that will run a single DAG and store logs persistently...
Calculating the cumulative sum of a group using Apache Spark
How to use the window function to calculate a cumulative sum
How to write to a Parquet file in Scala without using Apache Spark
Row number in Apache Spark window — row_number, rank, and dense_rank
This article is mostly a “note to self” because I don’t want to google that anymore ;)
How Airflow scheduler works
Explanation of the Airflow interval and start_date parameters
Making your Scrapy spider undetectable by applying basic statistics
How to delay scraper requests to make it look like a human visiting the website
How to use Scrapy to follow links on the scraped pages
A web spider that does not follow links is not very useful, let’s fix that.
How to scrape a single web page using Scrapy in Jupyter Notebook?
Scrapy Spiders and processing pipelines 101
Category
Python
Test-Driven Development in Python with Pytest
How to setup and use Pytest to test Python code
Functional programming in Python
Does functional programming in Python make sense?
Selecting rows in Pandas
How to use loc, iloc, slice, and row filtering in Pandas
Python decorators explained
How can we define a Python decorator, and when should we use Python decorators.
Pattern matching in Python vs Scala
What is the difference between pattern matching in Python and Scala?
Using AWS Deequ in Python with Python-Deequ
How to use Python-Deequ to validate Spark Dataframes
How to retry a Python function call
How to retry a Python function call in case of an error
From Scala to Python - Python dataclasses
Domain model in Python
A Python HTTP server for serving static content
How to easily serve static content on localhost or in the local network
Category
Problem solving
Mental models: inversion
Solve the opposite problem to avoid stupidity.
Category
TensorFlow
Using Boltzmann distribution as the exploration policy in TensorFlow-agent reinforcement learning models
There is a whole spectrum of exploration strategies between random and greedy policies.
How to turn Pandas data frame into time-series input for RNN
From Pandas dataframe to RNN input
How to automatically select the hyperparameters of a ResNet neural network
Training ResNet network for multiclass image classification using keras-tuner
Using Hyperband for TensorFlow hyperparameter tuning with keras-tuner
Tuning TensorFlow with Hyperband
Using keras-tuner to tune hyperparameters of a TensorFlow model
Tuning Keras hyperparameters with keras-tuner
Understanding the Keras layer input shapes
What is the input_shape in Keras/TensorFlow?
How to train a model in TensorFlow 2.0
Using the built-in Keras in TensorFlow 2.0
How to train a Reinforcement Learning Agent using Tensorflow Agents
The reinforcement learning loop with Tensorflow Agents
How to use a custom metric with Tensorflow Agents
How to define a new Tensorflow Agents metric and add it to the driver
How to use a behavior policy with Tensorflow Agents
Random and scripted behavior policies
How to create an environment for a Tensorflow Agent?
Save and restore a Tensorflow model using Keras for continuous model training
How to run fit function multiple time and improve the model?
Category
Deep learning
Why do we use dropout in artificial neural networks?
How does dropout work in artificial neural networks?
How to automatically select the hyperparameters of a ResNet neural network
Training ResNet network for multiclass image classification using keras-tuner
Using Hyperband for TensorFlow hyperparameter tuning with keras-tuner
Tuning TensorFlow with Hyperband
Using keras-tuner to tune hyperparameters of a TensorFlow model
Tuning Keras hyperparameters with keras-tuner
Understanding the Keras layer input shapes
What is the input_shape in Keras/TensorFlow?
How to train a model in TensorFlow 2.0
Using the built-in Keras in TensorFlow 2.0
Understanding layer size in Convolutional Neural Networks
Filter size, padding, and stride explained
Understanding the softmax activation function
Softmax function explained
How to increase accuracy of a deep learning model
Debugging a machine learning model
Which hyperparameters of deep learning model are important and how to find them
How to speed up finding the right hyperparameters of a machine learning model
How to choose the right mini-batch size in deep learning
How to deal with underfitting and overfitting in deep learning
The lessons learned from Andrew Ng’s online course
Ludwig machine learing model in Kaggle
My first attempt to use Ludwig
The optimal learning rate during fine-tuning of an artificial neural network
How to set the learning rate after you unfreeze the network layers in fast.ai
Save and restore a Tensorflow model using Keras for continuous model training
How to run fit function multiple time and improve the model?
Category
fast.ai
The optimal learning rate during fine-tuning of an artificial neural network
How to set the learning rate after you unfreeze the network layers in fast.ai
Category
Statistics
Wilson score in Python - example
Category
Startup
Product/market fit - buidling a data-driven product
How to test a product idea?
Re: DataOps Principles: How Startups Do Data The Right Way
Team vs. a bunch of individuals reporting work time in the same spreadsheet
Category
Genetic algorithms
How to assign people to groups in a fair way using genetic algorithms
Using Helisa and Jenetics in Scala
Genetic algorithms in Scala - solving optimization problems
Using Helisa and Jenetics to help Fallout players
Category
YouTube
Product/market fit - buidling a data-driven product
How to test a product idea?
Category
Math
Bellman equation explained
The fundamental equation of reinforcement learning
How to measure the similarity of sequence values
Levenshtein distance and Kendall tau distance
Measuring document similarity in machine learning
How to measure the similarity of two datasets?
Minkowski distance explained
Category
Airflow
Anomaly detection in Airflow DAG using Prophet library
How to detect problems in Airflow pipeline using Prophet for time series anomaly detection
How to restart a stuck Airflow DAG
What to do when an Airflow DAG gets stuck and does not want to run
Why does the DayOfWeekSensor exist in Airflow?
How to make an Airflow DAG wait until a specified day of the week
Send SMS from an Airflow DAG using AWS SNS
How to configure SNS subscription to send SMS messages and use Airflow to send them
Get an XCom value in the Airflow on_failure_callback function
How to get the task instance in the on_failure_callback to get access to XCom
How to define an AWS Athena view using Airflow
How to use the AWSAthenaOperator
How to check whether a YARN application has finished
How to use Airflow PythonSensor to check whether a YARN application finished running
How to set a different retry delay for every task in an Airflow DAG
How to use a different retry delay in every Airflow task
How to find the Hive partition closest to a given date
How to use Airflow to find the Hive partition closest to a given date
Get the date of the previous successful DAG run in Airflow.
Get the start time or the execution date of the previous successful DAG run in Airflow
How to prevent Airflow from backfilling old DAG runs
How to disable backfilling of an Airflow DAG or skip a part of the DAG during a backfill
How to set Airflow variables while creating a dev environment
How to use command-line to set Airflow variables
How to run an Airflow DAG in a loop
How to keep running a DAG indefinitely
How to use xcom_pull to get a variable from another DAG
Get an XCOM variable from another DAG
What to do when Airflow BashOperator fails with TemplateNotFound error
How to fix TemplateNotFound error when using Airflow BashOperator
Use HttpSensor to pause an Airflow DAG until a website is available
Pause an Airflow DAG until an HTTP endpoint returns 200 OK
How to add an EMR step in Airflow and wait until it finishes running
How to use AwsHook and EmrStepSensor to add an EMR step and wait until it finishes running
How to use Virtualenv to prepare a separate environment for Python function running in Airflow
How to use the PythonVirtualenvOperator in Airflow
Remove a directory from S3 using Airflow S3Hook
How to remove files with a common prefix from S3
Run a command on a remote server using SSH in Airflow
how to use the SSHHook in a PythonOperator to connect to a remote server from Airflow using SSH and execute a command.
Use a custom function in Airflow templates
How to add a custom function to Airflow and use it in a template
Pass parameters to SQL query when using PostgresOperator in Airflow
How to pass parameters to SQL template when using PostgresOperator in Airflow
Send a Slack message from an Airflow DAG
How to use the SlackAPIPostOperator to send a templated message to a Slack channel
How to delay an Airflow DAG until a given hour using the DateTimeSensor
How to use the DateTimeSensor in Airflow
How to run PySpark code using the Airflow SSHOperator
How to submit a PySpark job using SSHOperator in Airflow
How to add a manual step to an Airflow DAG using the JiraOperator
How can you add a human action to an Airflow DAG?
Conditionally pick an Airflow DAG branch using an SQL query
How to use the BranchSQLOperator to choose a DAG branch to execute
How to trigger an Airflow DAG from another DAG
How to trigger another DAG from an Airflow DAG
Why does the ExternalTaskSensor get stuck?
How to fix the stuck ExternalTaskSensor
How to render an Airflow template for testing
How to generate the code of an Airflow task from a template and a given execution date
How to check the next execution date of an Airflow DAG
How to use Airflow CLI to get the next execution date of a DAG
Doing data quality checks using the SQLCheckOperator
How to use SQLCheckOperator to verify that the database contains an expected number of rows
How to deal with the jinja2 TemplateNotFound error in Airflow
How to fix the TemplateNotFound error while using a custom Airflow operator
How to postpone Airflow DAG until files get uploaded into an S3 bucket
How to use Airflow sensors to detect that files have been uploaded into an S3 bucket
Use LatestOnlyOperator to skip some tasks while running a backfill in Airflow
How to skip some tasks when backfilling a DAG in the past
How to retrieve the statuses of the recent DAG executions from Airflow database
How to make a dashboard that displays Airflow DAG statuses
How to use AWSAthenaOperator in Airflow to verify that a DAG finished successfully
How to check that an AWS Athena table contains data after running an Airflow DAG.
How to conditionally skip tasks in an Airflow DAG
How to use XCom and PythonSensor to skip remaining tasks in an Airflow DAG.
Why my Airflow tasks got stuck in "no_status" and how I fixed it
A story about debugging an Airflow DAG that was not starting tasks
How to use Airflow backfill to run DAGs for a specified date in the past?
Have you created a new Airflow DAG, but now you have to run it using every data snapshot created during the last six months? Don’t worry. You don’t need to...
Dependencies between DAGs: How to wait until another DAG finishes in Airflow?
In this article, I am going to show how to set up dependencies between two DAGs. Imagine that I have a DAG that dumps data from production databases and another...
How to run Airflow in Docker (with a persistent database)
In this blog post, I am going to show you how to prepare the minimalist setup of puckel/docker-airflow Docker image that will run a single DAG and store logs persistently...
Category
Reinforcement learning
How to train a Reinforcement Learning Agent using Tensorflow Agents
The reinforcement learning loop with Tensorflow Agents
How to use a custom metric with Tensorflow Agents
How to define a new Tensorflow Agents metric and add it to the driver
How to use a behavior policy with Tensorflow Agents
Random and scripted behavior policies
How to create an environment for a Tensorflow Agent?
Deep Q-network terminology in plain English
The terminology used in the paper "Human-level control through deep reinforcement learning"
Bellman equation explained
The fundamental equation of reinforcement learning
Category
Machine Learning
How to speed up Pandas?
Is the Pandas library too slow? Here are two methods to speed it up!
How to add custom preprocessing code to a Sagemaker Endpoint running a Tensorflow model
How to customize input/output of a Sagemaker Endpoint running a Tensorflow model
How to A/B test Tensorflow models using Sagemaker Endpoints
How to deploy multiple model versions as one Sagemaker Endpoint
How to predict the value of time series using Tensorflow and RNN
How to train the RNN model in Tensorflow to predict time series?
How to deal with days of the week in machine learning
How to encode week days as features for machine learning models
XGBoost hyperparameter tuning in Python using grid search
Using GridSearchCV from Scikit-Learn to tune XGBoost classifier
Forecasting time series: using lag features
Category
Numpy
Numpy reshape explained
How to use the reshape function in Numpy
Category
Tensorflow
How to add custom preprocessing code to a Sagemaker Endpoint running a Tensorflow model
How to customize input/output of a Sagemaker Endpoint running a Tensorflow model
How to A/B test Tensorflow models using Sagemaker Endpoints
How to deploy multiple model versions as one Sagemaker Endpoint
How to predict the value of time series using Tensorflow and RNN
How to train the RNN model in Tensorflow to predict time series?
How to split a data frame into time-series for LSTM deep neural network
Category
Deep Learning
How to split a data frame into time-series for LSTM deep neural network
Category
AI in production
How does the Atlan data platform help you ensure data quality?
Atlan - a tool for facilitating a collaborative data culture
Building and deploying ML models using Qwak ML platform
What is Qwak ML platform and how does it work?
AI in production: Roobits Events360
What would you do if you were writing an application which had to process one billion events per day?
AI in production: Carta Healthcare
AI in production: make data as easy as using your phone
A.I. in production: your next stylist is going to be a neural network
Category
Reinforcement Learning
Using Boltzmann distribution as the exploration policy in TensorFlow-agent reinforcement learning models
There is a whole spectrum of exploration strategies between random and greedy policies.
Category
AI
How to fine-tune an OpenAI model using custom data
How to prepare the training data for an OpenAI model and how to fine-tune OpenAI's GPT model in Python
Deploy LLMs with Confidence: A Comprehensive Guide to Software Architecture for Production-Ready AI
Learn the essentials of deploying large language models in production with our comprehensive guide on software architecture for AI
Should you use machine learning in your product?
How to put AI in production without overengineering your system
AI in production: Carta Healthcare
Category
Data Engineering
CUPID properties in data engineering
SOLID principles vs. CUPID properties in data engineering
How to add tests to existing code in data transformation pipelines
How data engineers can write tests for legacy code in their ETL pipelines without breaking the existing implementation
Software engineering practices in data engineering and data science
How to produce high-quality software in data teams
A comprehensive guide to Kappa Architecture
What is Kappa Architecture? When should we use Kappa Architecture? What's the difference between Kappa Architecture and Lambda Architecture? And way, way more!
ETL vs ELT - what's the difference? Which one should you choose?
Should you use a data warehouse or build a data lake? When is a data warehouse a better choice? When is it better to build a data lake?
What is shuffling in Apache Spark, and when does it happen?
When does an Apache Spark cluster perform the shuffle operation?
Data engineers are data librarians or how to upgrade your data lake to 2500 BCE technology.
What can data engineers learn from (ancient) librarians?
Data pipeline documentation without wasting your time
How to document an ETL pipeline or ML inference pipeline without doing useless work
Testing legacy data pipelines
Do you struggle with maintaining your legacy data pipelines? Check out our article on how to add tests and refactor your code while working with legacy data pipelines.
What does your data pipeline need in production?
When you're debugging a failing production pipeline at 2 am, what do you need?
Is it overengineered?
What's the difference between reasonable future-proof architecture and overengineering? Is there a difference?
What should you learn as a data engineer?
Should you spend time learning data engineering tools and libraries?
Data Engineering - the first principles
What is true in every data engineering project?
Building trustworthy data pipelines
How to build a trustworthy data pipeline?
How to deploy a Tensorflow model using Sagemaker Endpoints and AWS Code Pipeline
How to build a Docker image using AWS Code Pipeline and deploy it as an Sagemaker Endpoint
Why your company should use PrestoSQL
Should your team use PrestoSQL?
Is counting rows all we can do?
How to detect problems in data pipelines before they turn into hard to debug bugs? I wish I knew.
Check-Engine - data quality validation for PySpark 3.0.0
Last week, I was testing whether we can use AWS Deequ for data quality validation. I ran into a few problems. First of all, it was using an outdated version...
Measuring data quality using AWS Deequ
How to measure data quality in Athena tables using AWS Deequ running on an EMR cluster.
How to conditionally skip tasks in an Airflow DAG
How to use XCom and PythonSensor to skip remaining tasks in an Airflow DAG.
The problem with software testing in data engineering
What if we found a bug in our data pipelines? What if that bug were easy to fix, but it would require a lot of time spent backfilling the data?...
How does Kafka Connect work?
In this article, I am going to describe the internals of Kafka Connect, explain how it uses the Sink and Source Connectors, and how it tracks the offsets of the...
Why my Airflow tasks got stuck in "no_status" and how I fixed it
A story about debugging an Airflow DAG that was not starting tasks
What is Kafka log compaction, and how does it work?
How the log compaction is implemented in Apache Kafka and how to configure it properly
Athena performance tips explained
How to use query execution plans to speed up Athena queries
Data flow - what functional programming and Unix philosophy can teach us about data streaming
What does stream processing have in common with functional programming and Unix?
How to send metrics to AWS CloudWatch from custom Python code
How to unit test PySpark
Recently, I came across an interesting problem: how to speed up the feedback loop while maintaining a PySpark DAG. Of course, I could just run the Spark Job and look...
How to speed up a PySpark job
I had a Spark job that occasionally was running extremely slow. On a typical day, Spark needed around one hour to finish it, but sometimes it required over four hours....
How does MapReduce work, and how is it similar to Apache Spark?
In this article, I am going to explain the original MapReduce paper “MapReduce: Simplified Data Processing on Large Clusters,” published in 2004 by Jeffrey Dean and Sanjay Ghemawat.
AI in production: Roobits Events360
What would you do if you were writing an application which had to process one billion events per day?
Category
Apache Spark
What is shuffling in Apache Spark, and when does it happen?
When does an Apache Spark cluster perform the shuffle operation?
How to measure Spark performance and gather metrics about written data
How to track Spark metrics in AWS CloudWatch
How to combine two DataFrames with no common columns in Apache Spark
Use full outer join to combine two Apache Spark DataFrames with no common columns
How to get names of columns with missing values in PySpark
How to get the names of missing properties for every row in a PySpark Dataframe
How to read multiple Parquet files with different schemas in Apache Spark
What to do when Apache Spark skips Parquet files with incompatible schemas
How to determine the partition size in Apache Spark
How to choose the proper partition size and the number of partitions to run an Apache Spark job
How to run PySpark code using the Airflow SSHOperator
How to submit a PySpark job using SSHOperator in Airflow
How Data Mechanics can reduce your Apache Spark costs by 70%
Stop wasting time and money tuning Apache Spark parameters
What is the difference between a transformation and an action in Apache Spark?
What is an action in Apache Spark? What do you understand as transformations in Apache Spark?
How to configure Spark to maximize resource usage while using AWS EMR
How to configure EMR to use all available resources when running a Spark cluster
Working with dates and time in Apache Spark
How to get relative dates (yesterday, tomorrow) in Apache Spark, and how to calculate the difference between two dates
How to save an Apache Spark DataFrame as a dynamically partitioned table in Hive
How to use the saveAsTable function to create a partitioned table
When to cache an Apache Spark DataFrame?
Should we cache everything in Apache Spark or are there any rules?
How to flatten a struct in a Spark DataFrame?
How to convert struct fields into separate columns.
What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?
Desc: How to use the cube and rollup functions in Apache Spark or PySpark. What is the difference between a cube and a rollup.
How to concatenate columns in a PySpark DataFrame
How to use the concat and concat_ws functions to merge multiple columns into one in PySpark
How to derive multiple columns from a single column in a PySpark DataFrame
Extract multiple columns from a single column using the withColumn function and a PySpark UDF
Broadcast variables and broadcast joins in Apache Spark
How to speed up joins of small DataFrames by using the broadcast join
How to use the window function to get a single row from each group in Apache Spark
How to group values by a key and extract a single row from each group in Apache Spark
What is the difference between repartition and coalesce in Apache Spark?
When should you use coalesce instead of repartition in Apache Spark
How to pivot an Apache Spark DataFrame
How to turn an Apache Spark or PySpark DataFrame into a pivot table.
What is the difference between cache and persist in Apache Spark?
When should you use the cache, and when you should use the persist function
How to use one SparkSession to run all Pytest tests
How to speed us Pytest tests by reusing the same SparkSession in all of them
Check-Engine - data quality validation for PySpark 3.0.0
Last week, I was testing whether we can use AWS Deequ for data quality validation. I ran into a few problems. First of all, it was using an outdated version...
How to unit test PySpark
Recently, I came across an interesting problem: how to speed up the feedback loop while maintaining a PySpark DAG. Of course, I could just run the Spark Job and look...
How to speed up a PySpark job
I had a Spark job that occasionally was running extremely slow. On a typical day, Spark needed around one hour to finish it, but sometimes it required over four hours....
Apache Spark: should we use RDD, Dataset, or DataFrame?
Is there a difference between Dataset and DataFrame? Why do we even have both?
Category
AWS
How to deploy a Transformer-based model with custom preprocessing code to Sagemaker Endpoints using BentoML
Deploy a machine learning model with custom inference code to a Sagemaker Endpoint using BentoML
Multimodel deployment in Sagemaker Endpoints
How to deploy multiple models in a single Sagemaker Endpoint?
How to deploy a REST API AWS Lambda using Chalice and AWS Code Pipeline
How to create a REST API Endpoint using AWS Lambda, Chalice, and AWS Code Pipeline
How to use AWS Batch to run a Python script
How to build a Docker image, define an AWS Batch job using Terraform, and run the AWS Batch job using Airflow
Send SMS from an Airflow DAG using AWS SNS
How to configure SNS subscription to send SMS messages and use Airflow to send them
How to get a notification when a new file is uploaded to an S3 bucket
Get a Slack notification when a file is uploaded to an S3 bucket
Best practices about partitioning data in S3 by date
How to partition data in S3 by date in a way that makes your life easier
How to assign rows to ranked groups in AWS Athena
How to use the NTILE function in Athena
How to use WHEN CASE queires in AWS Athena
Using conditions in AWS Athena queries
How to decode base64 to text in AWS Athena
How to use from_base64 in AWS Athena
What is s3:TestEvent, and why does it break my event processing?
S3 sends s3:TestEvent to SQS after setting up the bucket notifications
Making OFFSET LIMIT queries in AWS Athena
How to use OFFSET in AWS Athena queries
How to get an alert if an AWS lambda does not get invoked during the last 24 hours
How to get a notification when AWS Lambda stops begin used
How to check when an Athena table was updated
How to track the time when an Athena table was updated
Copy directories in S3 using s3-dist-cp
How to copy files in S3 and preserve the directory structure
How to select a random sample of rows using Athena
How to use a window function to select random rows from Athena
Remove a directory from S3 using Airflow S3Hook
How to remove files with a common prefix from S3
How to temporarily disable an AWS Lambda function using AWS CLI without removing the function
Disable an AWS Lambda using AWS CLI
How to add an EMR step from AWS Lambda
How to configure a new EMR step using AWS Lambda in Python
Send event to AWS Lambda when a file is added to an S3 bucket
Trigger AWS Lambda when a file is created in an S3 bucket
How to postpone Airflow DAG until files get uploaded into an S3 bucket
How to use Airflow sensors to detect that files have been uploaded into an S3 bucket
How to retrieve the table descriptions from Glue Data Catalog using boto3
How to get the comments from the create table statements when the metadata is stored in the Glue Data Catalog
How to populate a PostgreSQL (RDS) database with data from CSV files stored in AWS S3
How to upload S3 data into RDS tables
How to Speed Up AWS Athena Queries Using Partition Projection
How to define partition projection while creating an Athena table
How to send AWS CloudWatch Alerts to a Slack channel using Terraform
How to use Terraform to configure a CloudWatch alert and send the message to a Slack channel.
Athena performance tips explained
How to use query execution plans to speed up Athena queries
AWS IAM roles and policies explained
In this article, I am going to explain the essential parts of IAM and describe how to grant permissions to your users or AWS Lambda functions you wrote.
How to send metrics to AWS CloudWatch from custom Python code
How to add dependencies to AWS lambda
The process of adding dependencies to an AWS Lambda consists of two steps. First, we have to install the dependencies in the source code directory. Later, we have to package...
What do you need to know about storing passwords in AWS?
How to use the AWS Secrets Manager
Category
Spark
Three biggest traps to avoid while setting Spark executor memory
What happens when you set the executor memory of a Spark worker which uses YARN as the cluster resource manager? Does it get exactly the amount of memory you requested?...
Category
Serverless
Select Serverless configuration variables using the stage parameter
How to add dependencies to AWS lambda
The process of adding dependencies to an AWS Lambda consists of two steps. First, we have to install the dependencies in the source code directory. Later, we have to package...
Category
Big data
Data streaming: what is the difference between the tumbling and sliding window?
When you start processing streams of events, there always comes a time to decide on how to group them. We have a few kinds of window functions that we can...
What are the 4 V's of big data, and which one is the most important?
One of the first models that describe what big data is was the four Vs-model. That definition divides big data into four categories (sometimes called dimensions) of problems: volume, velocity,...
Category
Arduino
I put a carnivorous plant on the Internet of Things to save its life, and it did not survive
This article is a text version of my talk, "I put a carnivorous plant on the Internet of Things," which I presented during the DataNatives conference (November 25-26, 2019 in...
Category
IoT
I put a carnivorous plant on the Internet of Things to save its life, and it did not survive
This article is a text version of my talk, "I put a carnivorous plant on the Internet of Things," which I presented during the DataNatives conference (November 25-26, 2019 in...
Category
Event Streaming
Data streaming: what is the difference between the tumbling and sliding window?
When you start processing streams of events, there always comes a time to decide on how to group them. We have a few kinds of window functions that we can...
Category
Data Streaming
Data streaming with Apache Kafka - guide for data engineers
Are you preparing for a data engineer job interview? Here are my answers to job interview questions about data streaming.
Category
Papers We Love
How does MapReduce work, and how is it similar to Apache Spark?
In this article, I am going to explain the original MapReduce paper “MapReduce: Simplified Data Processing on Large Clusters,” published in 2004 by Jeffrey Dean and Sanjay Ghemawat.
Category
Stream processing
Data flow - what functional programming and Unix philosophy can teach us about data streaming
What does stream processing have in common with functional programming and Unix?
Category
Apache Kafka
How to reset the consumer offset in Apache Kafka topic
How to use kafka-consumer-groups.sh to reset topic offsets
How to purge a Kafka topic
How to remove all messages from a Kafka topic
How does Kafka Connect work?
In this article, I am going to describe the internals of Kafka Connect, explain how it uses the Sink and Source Connectors, and how it tracks the offsets of the...
What is Kafka log compaction, and how does it work?
How the log compaction is implemented in Apache Kafka and how to configure it properly
How does a Kafka Cluster work?
What is the difference between a leader and a replica broker? What is the cluster controller? How is the controller elected?
Category
Software Craft
Don't use AI to generate tests for your code or how to do test-driven development with AI
How to use AI to geneate test cases for your code
Using Abstraction Layers to Tackle Common Problems with Legacy Code
Are you struggling to manage and update your legacy codebase? In this article, I'll show you how to leverage the power of abstraction layers to overcome common challenges with legacy...
Why should you practice TDD?
What are the benefits of TDD for programmers and companies that hire them?
How to debug code
How to debug code and solve problems as fast as possible
CUPID properties in data engineering
SOLID principles vs. CUPID properties in data engineering
The secret of working with legacy code on a software team
How to work with code written by other people? What to do when you join a new team?
How to write technical documentation
How to document a software project?
What is the root cause of problems in software engineering?
What is the primary, unrepairable cause of almost all bugs, data leaks, human problems, etc.?
How to become a better programmer
What's stopping us from getting better at coding
How to throw useful exceptions
How to make debugging easier by paying attention to the errors you report
Why are programmers slow, and what to do about it?
The one practice that makes every team faster (in the long run)
How to build maintainable software by abstracting the business rules in data engineering
Are we building the right abstractions?
How to learn TDD
Learning Test-Driven Development is hard and there is nothing we can do about it
The problem with software testing in data engineering
What if we found a bug in our data pipelines? What if that bug were easy to fix, but it would require a lot of time spent backfilling the data?...
Category
Data quality
Measuring data quality using AWS Deequ
How to measure data quality in Athena tables using AWS Deequ running on an EMR cluster.
Category
AWS Deequ
Measuring data quality using AWS Deequ
How to measure data quality in Athena tables using AWS Deequ running on an EMR cluster.
Category
Check-Engine
Check-Engine - data quality validation for PySpark 3.0.0
Last week, I was testing whether we can use AWS Deequ for data quality validation. I ran into a few problems. First of all, it was using an outdated version...
Category
Monitoring
How to send AWS CloudWatch Alerts to a Slack channel using Terraform
How to use Terraform to configure a CloudWatch alert and send the message to a Slack channel.
Category
Pytest
How to use one SparkSession to run all Pytest tests
How to speed us Pytest tests by reusing the same SparkSession in all of them
Category
Apache Airflow
How to send a customized Slack notification when an Airflow task fails
How to customize a Slack notification before sending it to the Slack incoming webhook.
Category
AWS Athena
How to make a pivot table in AWS Athena or PrestoSQL
How to make a pivot table in AWS Athena, and why the pivot function does not exist
How to Speed Up AWS Athena Queries Using Partition Projection
How to define partition projection while creating an Athena table
Category
Data Quality
Is counting rows all we can do?
How to detect problems in data pipelines before they turn into hard to debug bugs? I wish I knew.
Category
Presto
Why your company should use PrestoSQL
Should your team use PrestoSQL?
Category
PrestoSQL
How to make a pivot table in AWS Athena or PrestoSQL
How to make a pivot table in AWS Athena, and why the pivot function does not exist
Category
PySpark
Testing data products: BDD for data engineers
How to use BDD to test PySpark code
How to read from SQL table in PySpark using a query instead of specifying a table
Fetching data using a SQL query in PySpark
How to write to a SQL database using JDBC in PySpark
How to use JDBC driver in PySpark to write a DataFrame to a SQL database
How to add dependencies as jar files or Python scripts to PySpark
How to add a jar file or a Python file as a Pyspark dependency
Speed up counting the distinct elements in a Spark DataFrame
Use HyperLogLog to calculate the approximate number of distinct elements in Apache Spark
Use regexp_replace to replace a matched string with a value of another column in PySpark
Use regex to replace the matched string with the content of another column in PySpark
Working with dates and time in Apache Spark
How to get relative dates (yesterday, tomorrow) in Apache Spark, and how to calculate the difference between two dates
How to save an Apache Spark DataFrame as a dynamically partitioned table in Hive
How to use the saveAsTable function to create a partitioned table
How to flatten a struct in a Spark DataFrame?
How to convert struct fields into separate columns.
What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?
Desc: How to use the cube and rollup functions in Apache Spark or PySpark. What is the difference between a cube and a rollup.
How to concatenate columns in a PySpark DataFrame
How to use the concat and concat_ws functions to merge multiple columns into one in PySpark
How to derive multiple columns from a single column in a PySpark DataFrame
Extract multiple columns from a single column using the withColumn function and a PySpark UDF
Category
Hive
How to write Hive queries with column position number in the GROUP BY or ORDER BY clauses
How to enable column position support in Hive GROUP BY or ORDER BY
How to check whether a regular expression matches a string in Hive
What is the equivalent of Athena/Presto regexp_like in Hive
How to find the Hive partition closest to a given date
How to use Airflow to find the Hive partition closest to a given date
Use the ROW_NUMBER() function to get top rows by partition in Hive
How to calculate row number by partition in Hive and use it to filter rows
How to get an array/bag of elements from the Hive group by operator?
How to get an array of elements from one column when grouping by another column in Hive
Category
SQL
Add the row insertion time to a MySQL table
Automatically add the insertion and update time in MySQL
How to count the number of rows that match a condition in Redshift
How to count the rows by multiple conditions at the same time in SQL
How to concatenate multiple MySQL rows into a single field?
How to concatenate multiple rows into a string in MySQL
Category
RDS
How to populate a PostgreSQL (RDS) database with data from CSV files stored in AWS S3
How to upload S3 data into RDS tables
Category
DynamoDB
How to download all available values from DynamoDB using pagination
How to use pagination to retrieve all DynamoDB values
How to perform a batch write to DynamoDB using boto3
How to write multiple DynamoDB objects at once using boto3
Category
AWS Glue
How to start an AWS Glue Crawler to refresh Athena tables using boto3
How to create and start an AWS Glue Crawler from Python code using boto3
Category
Athena
How to emulate temporary tables in Athena
Use CTAS to create a temporary table in Athena
How to use AWSAthenaOperator in Airflow to verify that a DAG finished successfully
How to check that an AWS Athena table contains data after running an Airflow DAG.
Category
AWS EMR
How to configure Spark to maximize resource usage while using AWS EMR
How to configure EMR to use all available resources when running a Spark cluster
Category
Redshift
Get the last day of the month in Redshift
How to use the last_day function in Redshift
How to make an unconditional join in Redshift
LEFT OUTER JOIN ON 1=1 in Redshift
How to count the number of rows that match a condition in Redshift
How to count the rows by multiple conditions at the same time in SQL
How to index data in Redshift
How to create an equivalent of an index in Redshift
How to generate a sequence of dates in Redshift
How to use the generate_series function to generate a sequence of dates
How to find and terminate an idle Redshift session
How to find the idle session that is blocking the connection pool in Redshift
Category
Interviews
Christopher Bergh - How the DataOps principles help data engineers make data pipelines trustworthy
An interview with Christopher Bergh who explains how the DataOps principles help data engineers make data pipelines trustworthy
Category
DataOps
Christopher Bergh - How the DataOps principles help data engineers make data pipelines trustworthy
An interview with Christopher Bergh who explains how the DataOps principles help data engineers make data pipelines trustworthy
Category
S3
How to enable S3 bucket versioning using Terraform
How to configure Define S3 bucket versioning in Terraform
How to automatically remove files from S3 using lifecycle rules defined in Terraform
How to define S3 lifecycle rules using Terraform
Category
Terraform
How to enable S3 bucket versioning using Terraform
How to configure Define S3 bucket versioning in Terraform
How to configure both core and spot instances in EMR using Terraform
Use EMR instance group to add spot instances to an EMR cluster
How to automatically remove files from S3 using lifecycle rules defined in Terraform
How to define S3 lifecycle rules using Terraform
Category
EMR
How to make sure that you did not leave an EMR cluster running
How to get notifications about running EMR cluster
Category
BDD
How to test REST API contract using BDD
Testing a REST API using Behave in Python
Testing data products: BDD for data engineers
How to use BDD to test PySpark code
Category
Prophet
Anomaly detection in Airflow DAG using Prophet library
How to detect problems in Airflow pipeline using Prophet for time series anomaly detection
Category
Blogging
On technical blogging
How to start blogging as a programmer
Category
MLOps
Why do you need a text summarization service, and how to deploy a text summarization model in 15 minutes using HuggingFace and Qwak?
Why do you need text summarization services in your business? How can you deploy a model downloaded from HuggingFace using the Qwak ML platform in 15 minutes?
MLOps at small companies
How to do MLOps while working on a small data engineering team
MLOps engineer, you will need those three books every day!
Don't reinvent the wheel as a MLOps engineer. The 3 books you must read in 2022
Why should you use a feature store
Benefits of having a feature store and what happens when you don't have one
How to run batch inference using Sagemaker Batch Transform Jobs
Running a batch machine learning job using Sagemaker and data stored in S3.
What is the essential KPI of an MLOps team?
What KPI to measure in an MLOps team
Deploying your first ML model in production
The minimal setup for ML deployment without the things you DON'T need yet
Shadow deployment vs. canary release of machine learning models
What is shadow deployment in machine learning? What is a canary release? What is the difference?
How to deploy a Transformer-based model with custom preprocessing code to Sagemaker Endpoints using BentoML
Deploy a machine learning model with custom inference code to a Sagemaker Endpoint using BentoML
Building and deploying ML models using Qwak ML platform
What is Qwak ML platform and how does it work?
How to deploy MLFlow on Heroku
How to deploy MLFlow on Heroku using PostgreSQL as the database, S3 as the artifact storage and with BasicAuth authentication
What is MLOps? Do we need MLOps?
MLOps is not just DevOps applied to machine learning!
How to add a new dataset to the Feast feature store
How to use Feast feature store in a local environment
Multimodel deployment in Sagemaker Endpoints
How to deploy multiple models in a single Sagemaker Endpoint?
How to deploy a Tensorflow model using Sagemaker Endpoints and AWS Code Pipeline
How to build a Docker image using AWS Code Pipeline and deploy it as an Sagemaker Endpoint
Category
Data Lake
Data versioning with LakeFS
Why you should use LakeFS to build a data lake that supports data versioning
Category
LakeFS
Data versioning with LakeFS
Why you should use LakeFS to build a data lake that supports data versioning
Category
Storytelling
The ugly truth about product demo storytelling in data teams
How to make product demos more engaging and persuade people to care about the data
Category
Career
How to write a growth plan as a programmer?
How to write a growth plan that helps you get promoted and doesn't get in the way when you want to focus on your hobbies
How to become a data engineer for free
What do you need to know to become a data engineer? Does a data engineer need a degree? How can you get your first data engineering job?
How does a bad interview look like in data engineering
What you should avoid when you interview programmers for a data engineer positition
I worked as a data scientist and that was the worst job I have ever had.
I believed in the Sexiest Job of 21 Century hype. I was wrong.
Secrets of mentoring junior software engineers
How to quickly train junior engineers to make them as productive as the rest of the team
How to pass a machine learning engineer interview
Trivial (and easily fixable) mistakes that will make you fail a job interview
Why do data engineers quit?
Why do data engineers quit their jobs?
How writing can improve your programming skills
How writing texts for people makes you a better programmer
Category
Writing
How writing can improve your programming skills
How writing texts for people makes you a better programmer
Category
Software Engineering
How to add tests to existing code in data transformation pipelines
How data engineers can write tests for legacy code in their ETL pipelines without breaking the existing implementation
How to advertise to software engineers, or how do we make terrible tech choices
Why do programmers make wrong decisions when they choose the tools they use?
Category
Dev Rel
How to teach programming workshops to adults
How to prepare an enjoyable programming workshop that teaches people the skills they need without overwhelming them with new knowledge.
Category
Functional programming
Functional programming in Python
Does functional programming in Python make sense?
Category
Stream Processing
A comprehensive guide to Kappa Architecture
What is Kappa Architecture? When should we use Kappa Architecture? What's the difference between Kappa Architecture and Lambda Architecture? And way, way more!
Category
Pandas
How to sort a Pandas DataFrame by month name
How to use an ordered categorical variable to sort a Pandas Dataframe by months while displaying their names
Category
Marketing
Generate a landing page for a newsletter in 17 minutes using ChatGPT or GPT-3
Are you looking for a way to generate high-quality content quickly and effectively? This article outlines how AI and the AIDA marketing model create a landing page.
Marketing for SaaS startups: how to describe your product?
How to use the "benefits over features" technique to advertise your SaaS product and get more clients than your competition
How to pitch your idea
What a co-founder of DeepMind teaches us about pitching our ideas to investors
Category
Copywriting
Marketing for SaaS startups: how to describe your product?
How to use the "benefits over features" technique to advertise your SaaS product and get more clients than your competition
How to pitch your idea
What a co-founder of DeepMind teaches us about pitching our ideas to investors
Category
Management
What does kill IT projects?
What does kill IT projects? What you should avoid, at all costs, to ensure the success of your startup or software project
Category
Software Architecture
Deploy LLMs with Confidence: A Comprehensive Guide to Software Architecture for Production-Ready AI
Learn the essentials of deploying large language models in production with our comprehensive guide on software architecture for AI
What does modern software architecture look like in 2022?
Do architecture diagrams still matter? How do we deal with constant changes? How to design software architecture?
Category
AI in business
Improve AI Output Using the Guardrails Library with Custom Validators
Use AI to validate another AI's output. Learn how to create custom validators and corrections using the Guardrails library.
How to Build a ChatGPT Plugin in Python?
A step-by-step guide to building a ChatGPT plugin in Python to access a vector database
Alternatives to OpenAI GPT model: using an open-source Cerebras model with LangChain
Discover how to leverage the powerful open-source Cerebras model with LangChain in this comprehensive guide, featuring step-by-step instructions for loading the model with HuggingFace Transformers, creating prompt templates, and integrating...
AI-Powered Pair Programming: Enhance Your Web Development Skills with GPT-4 Assistance
Improve your coding skills and elevate your writing with GPT-4 as your AI-driven pair programming partner, guiding you through the process of building a web application that functions as a...
Build an AI-powered Newsletter Generator with dust.tt and OpenAI
How to create an AI-powered newsletter generator using the dust.tt and OpenAI API. You'll learn how to use the few-shot in-context learning technique to train the AI model, deploy the...
Maximize Customer Support Efficiency: Build an AI Chatbot to Answer Common Client Questions
How to build an AI-powered Facebook chatbot using GPT-3 from OpenAI and vector databases to answer client questions using your documentation.
Detection of Text Duplicates and Text Search with Word Embeddings and Vector Databases
Discover how word embeddings and vector databases can revolutionize text search and duplicate detection. Learn how to implement it with OpenAI GPT-3 and Milvus vector database.
Connect GPT-3 to the Internet: Create a Slack Bot and Perform Web Search, Calculations, and More
Unleash the potential of GPT-3 and make it access the Internet. Learn how to use Langchain and build a Slack bot that can do a web search, extract text from...
Create an AI Data Analyst bot for Slack that can lookup data in your database
Build an AI-powered Slack bot that reads data from your production database and answers simple analytics questions
Generate a landing page for a newsletter in 17 minutes using ChatGPT or GPT-3
Are you looking for a way to generate high-quality content quickly and effectively? This article outlines how AI and the AIDA marketing model create a landing page.
Why do you need a text summarization service, and how to deploy a text summarization model in 15 minutes using HuggingFace and Qwak?
Why do you need text summarization services in your business? How can you deploy a model downloaded from HuggingFace using the Qwak ML platform in 15 minutes?
Category
GPT
Don't use AI to generate tests for your code or how to do test-driven development with AI
How to use AI to geneate test cases for your code
Alternatives to OpenAI GPT model: using an open-source Cerebras model with LangChain
Discover how to leverage the powerful open-source Cerebras model with LangChain in this comprehensive guide, featuring step-by-step instructions for loading the model with HuggingFace Transformers, creating prompt templates, and integrating...
AI-Powered Pair Programming: Enhance Your Web Development Skills with GPT-4 Assistance
Improve your coding skills and elevate your writing with GPT-4 as your AI-driven pair programming partner, guiding you through the process of building a web application that functions as a...
Maximize Customer Support Efficiency: Build an AI Chatbot to Answer Common Client Questions
How to build an AI-powered Facebook chatbot using GPT-3 from OpenAI and vector databases to answer client questions using your documentation.
Automating Git Commit Messages with GPT-3 for Faster Software Development Workflows
Learn how to use GPT-3 to automate Git commit message generation and speed up your development workflows.
Connect GPT-3 to the Internet: Create a Slack Bot and Perform Web Search, Calculations, and More
Unleash the potential of GPT-3 and make it access the Internet. Learn how to use Langchain and build a Slack bot that can do a web search, extract text from...
Unlocking the Power of In-Context Learning With Zero-Shot, One-Shot, and Few-Shot Prompt Engineering for GPT
How the in-context learning prompt engineering technique improves GPT-3 results, and why does it work? What's the difference between zero-shot, one-shot, and few-shot prompting?
Create an AI Data Analyst bot for Slack that can lookup data in your database
Build an AI-powered Slack bot that reads data from your production database and answers simple analytics questions
Category
Prompt Engineering
Unlocking the Power of In-Context Learning With Zero-Shot, One-Shot, and Few-Shot Prompt Engineering for GPT
How the in-context learning prompt engineering technique improves GPT-3 results, and why does it work? What's the difference between zero-shot, one-shot, and few-shot prompting?
Category
Embeddings
Detection of Text Duplicates and Text Search with Word Embeddings and Vector Databases
Discover how word embeddings and vector databases can revolutionize text search and duplicate detection. Learn how to implement it with OpenAI GPT-3 and Milvus vector database.
Category
ChatGPT
How to Build a ChatGPT Plugin in Python?
A step-by-step guide to building a ChatGPT plugin in Python to access a vector database
Get Started with ChatGPT API: A Step-by-Step Guide for Python Programmers
A step-by-step tutorial on ChatGPT API in Python. You'll also learn about prompt engineering, interactivity, optimizing API calls, and using parameters to get better results.
Category
dust.tt
Build an AI-powered Newsletter Generator with dust.tt and OpenAI
How to create an AI-powered newsletter generator using the dust.tt and OpenAI API. You'll learn how to use the few-shot in-context learning technique to train the AI model, deploy the...
Category
Guardrails
Improve AI Output Using the Guardrails Library with Custom Validators
Use AI to validate another AI's output. Learn how to create custom validators and corrections using the Guardrails library.
Category
OpenAI
How to fine-tune an OpenAI model using custom data
How to prepare the training data for an OpenAI model and how to fine-tune OpenAI's GPT model in Python