mikulskibartosz.name
Start here
About me
Twitter
Mastodon
mlops.today
Bartosz Mikulski
Building trustworthy data pipelines because AI cannot learn from dirty data
All Stories
Use the ROW_NUMBER() function to get top rows by partition in Hive
How to calculate row number by partition in Hive and use it to filter rows
How to configure both core and spot instances in EMR using Terraform
Use EMR instance group to add spot instances to an EMR cluster
How to temporarily disable an AWS Lambda function using AWS CLI without removing the function
Disable an AWS Lambda using AWS CLI
How to add an EMR step from AWS Lambda
How to configure a new EMR step using AWS Lambda in Python
Send event to AWS Lambda when a file is added to an S3 bucket
Trigger AWS Lambda when a file is created in an S3 bucket
Select Serverless configuration variables using the stage parameter
Use a custom function in Airflow templates
How to add a custom function to Airflow and use it in a template
Speed up counting the distinct elements in a Spark DataFrame
Use HyperLogLog to calculate the approximate number of distinct elements in Apache Spark
Pass parameters to SQL query when using PostgresOperator in Airflow
How to pass parameters to SQL template when using PostgresOperator in Airflow
Use regexp_replace to replace a matched string with a value of another column in PySpark
Use regex to replace the matched string with the content of another column in PySpark
How to read multiple Parquet files with different schemas in Apache Spark
What to do when Apache Spark skips Parquet files with incompatible schemas
How to determine the partition size in Apache Spark
How to choose the proper partition size and the number of partitions to run an Apache Spark job
« Prev
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Next »