mikulskibartosz.name
Start here
About me
efficacious.engineering
mlops.today
Bartosz Mikulski
Building trustworthy data pipelines because AI cannot learn from dirty data
All Stories
What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?
Desc: How to use the cube and rollup functions in Apache Spark or PySpark. What is the difference between a cube and a rollup.
How to concatenate columns in a PySpark DataFrame
How to use the concat and concat_ws functions to merge multiple columns into one in PySpark
How to derive multiple columns from a single column in a PySpark DataFrame
Extract multiple columns from a single column using the withColumn function and a PySpark UDF
Broadcast variables and broadcast joins in Apache Spark
How to speed up joins of small DataFrames by using the broadcast join
How to use the window function to get a single row from each group in Apache Spark
How to group values by a key and extract a single row from each group in Apache Spark
How to make a pivot table in AWS Athena or PrestoSQL
How to make a pivot table in AWS Athena, and why the pivot function does not exist
What is the difference between repartition and coalesce in Apache Spark?
When should you use coalesce instead of repartition in Apache Spark
How to pivot an Apache Spark DataFrame
How to turn an Apache Spark or PySpark DataFrame into a pivot table.
What is the difference between cache and persist in Apache Spark?
When should you use the cache, and when you should use the persist function
Why your company should use PrestoSQL
Should your team use PrestoSQL?
Is counting rows all we can do?
How to detect problems in data pipelines before they turn into hard to debug bugs? I wish I knew.
How to Speed Up AWS Athena Queries Using Partition Projection
How to define partition projection while creating an Athena table
« Prev
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Next »