How to read from SQL table in PySpark using a query instead of specifying a table

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (100/100)

When we know precisely what query we should run to get the data we want from a SQL database, we don’t need to load multiple tables in PySpark, and emulate the joins and selects in the Python code. Instead of that, we can pass the SQL query as the source of the DataFrame while retrieving it from the database.

If my code to retrieve the data looks like this:

1
2
3
4
5
6
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:mysql://localhost:port") \
    .option("dbtable", "schema.tablename") \
    ...
    .load()

I can replace the dbtable parameter with a SQL query and use the result as the table loaded by PySpark:

1
.option("dbtable", "(SELECT column_A, column_B FROM some_table) AS tbl")

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.