How to read multiple Parquet files with different schemas in Apache Spark

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (43/100)

When we read multiple Parquet files using Apache Spark, we may end up with a problem caused by schema differences. When Spark gets a list of files to read, it picks the schema from either the Parquet summary file or a randomly chosen input file:

    "file_c"): _*

Most likely, you don’t have the Parquet summary file because it is not a popular solution. In this case, Spark will try to apply the schema of a randomly chosen file to every file in the list.

It is an annoying problem because if we have additional columns in some files, we may end up with a dataset that does not contain those extra columns because Spark read the schema from a file without those columns.

Subscribe to the newsletter and join the free email course.

How to merge Parquet schemas in Apache Spark?

To solve the issue, we must instruct Apache Spark to merge the schemas from all given files into one common schema. We can do that using the mergeSchema configuration parameter:

1"mergeSchema", "true").parquet(...) 

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.