Apache Spark

How to read multiple Parquet files with different schemas in Apache Spark

What to do when Apache Spark skips Parquet files with incompatible schemas

Bartosz Mikulski 04 Nov 2020 – 1 min read

When we read multiple Parquet files using Apache Spark, we may end up with a problem caused by schema differences. When Spark gets a list of files to read, it picks the schema from either the Parquet summary file or a randomly chosen input file:

spark.read.parquet(
List(
    "file_a",
    "file_b",
    "file_c"): _*
)

Most likely, you don’t have the Parquet summary file because it is not a popular solution. In this case, Spark will try to apply the schema of a randomly chosen file to every file in the list.

It is an annoying problem because if we have additional columns in some files, we may end up with a dataset that does not contain those extra columns because Spark read the schema from a file without those columns.

How to merge Parquet schemas in Apache Spark?

To solve the issue, we must instruct Apache Spark to merge the schemas from all given files into one common schema. We can do that using the mergeSchema configuration parameter:

spark.read.option("mergeSchema", "true").parquet(...)

Older post

How to read multiple Parquet files with different schemas in Apache Spark

How to merge Parquet schemas in Apache Spark?

How to determine the partition size in Apache Spark

Use regexp_replace to replace a matched string with a value of another column in PySpark

How to read multiple Parquet files with different schemas in Apache Spark

How to merge Parquet schemas in Apache Spark?

How to determine the partition size in Apache Spark

Use regexp_replace to replace a matched string with a value of another column in PySpark

Related Posts

What is shuffling in Apache Spark, and when does it happen?

How to measure Spark performance and gather metrics about written data

How to combine two DataFrames with no common columns in Apache Spark