How to use one SparkSession to run all Pytest tests

When we test a PySpark application, we run into a problem of passing the SparkSession into the tests. Of course, we can instantiate a separate session in every test function, but that is going to slow down the tests significantly. Such a solution may be acceptable when we have only one or two PySpark tests inside a larger application, but what if we want to run a few hundreds of tests and every one of them uses PySpark?

In this situation, we should instantiate the SparkSession once and pass it to every test as a parameter. In PyTest, we can do it using the fixtures. The fixtures are supposed to configure the test environment and clean up after the tests.

To configure a fixture, we must create a new file in the tests directory and implement a function that returns the value of that fixture. The function must be decorated using the pytest.fixture decorator. Inside the function, we can also define a finalizer that is supposed to release the resources allocated by the fixture.

To reuse the same SparkSession in all of the tests, we must specify the scope of the fixture and set its value to “session”.

The following example demonstrates the complete way of defining a SparkSession fixture. Note that the name of the function is going to be used as the fixture name.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import pytest
from pyspark.sql import SparkSession


@pytest.fixture(scope="session")
def spark_session(request):
    spark_session = SparkSession.builder \
        .master("local[*]") \
        .appName("some-app-name") \
        .getOrCreate()

    request.addfinalizer(lambda: spark_session.sparkContext.stop())

    return spark_session

In the tests, we must declare which fixture we want to use inside the test file. The function that creates a SparkSession is called spark_session, so we use the same name to declare the fixture.

1
pytestmark = pytest.mark.usefixtures("spark_session")

Now, we can add the spark_session parameter to every test function that needs a SparkSession.

1
2
def test_name(spark_session):
    ...

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group