How to write to a SQL database using JDBC in PySpark

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (90/100)

To write a PySpark DataFrame to a table in a SQL database using JDBC, we need a few things.

First, we have to add the JDBC driver to the driver node and the worker nodes. We can do that using the --jars property while submitting a new PySpark job:

1
2
3
spark-submit --deploy-mode cluster \
    --jars s3://some_bucket/jdbc_driver.jar \
    s3://some_bucket/pyspark_job.py 

After that, we have to prepare the JDBC connection URL. The URL consists of three parts: the database name, the host with port, and the database (schema) name. If I want to connect to Postgres running on the local machine, the URL should look like this:

1
url = "jdbc:postgresql://localhost/database_name"

In addition to that, I have to prepare a dictionary of properties, which contains the username and password used to connect to the database:

1
2
3
4
properties = {
    "user": "the_username",
    "password": "the_password"
}

Please do not store the credentials in the code. It is better to use the AWS SecretsManager (if you run your code on EMR) or any other method of passing the credentials securely from an external source.

I have to decide how Spark should behave when there is already some data in the table. Let’s assume that I want to overwrite the existing data with the DataFrame df content. In this case, I have to set the write mode to ‘overwrite’.

The last information I need is the table name that will be populated with the DataFrame. When I have all of the required information, I can call the write.jdbc function:

1
df.write.jdbc(url=url, table="the_table_name", mode='overwrite', properties=properties)

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.