How to read from SQL table in PySpark using a query instead of specifying a table

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (100/100)

When we know precisely what query we should run to get the data we want from a SQL database, we don’t need to load multiple tables in PySpark, and emulate the joins and selects in the Python code. Instead of that, we can pass the SQL query as the source of the DataFrame while retrieving it from the database.

If my code to retrieve the data looks like this:

1
2
3
4
5
6
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:mysql://localhost:port") \
    .option("dbtable", "schema.tablename") \
    ...
    .load()

I can replace the dbtable parameter with a SQL query and use the result as the table loaded by PySpark:

1
.option("dbtable", "(SELECT column_A, column_B FROM some_table) AS tbl")

Subscribe to the newsletter and join the free email course.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.