How to use Virtualenv to prepare a separate environment for Python function running in Airflow

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (56/100)

It is not difficult to turn your Python environment into a mess. Soon, the libraries become incompatible with one another, start producing weird results or suddenly crash in the middle of a computation.

Fortunately, we can create separate environments using Virtualenv or Conda. This feature is also available in Airflow, but in this case, we have access only to Virtualenv (unless you add a custom operator).

First, we have to define a Python function we want to run. Note that we must define ALL imports inside the function, and it cannot reference anything defined outside. Even if it is a global variable. We must pass all such variables as arguments of the PythonVirtualenvOperator.

1
2
3
4
5
6
def some_python_function():
    import pandas as pd

    # do something with Pandas

    return "some value"

The returned value is available in the Airflow XCOM, and we can reference it in the subsequent tasks.

There is one issue concerning returned values (and input parameters). If the Python version used in the Virtualenv environment differs from the Python version used by Airflow, we cannot pass parameters and return values. In this case, we can use only the string_args parameter.

Use PythonVirtualenvOperator

Now, I can configure the Airflow operator. I pass the required libraries as the requirements parameter. It supports the same syntax as the requirements.txt file, so I can also define a version:

1
2
3
4
5
6
7
virtualenv_task = PythonVirtualenvOperator(
    task_id="virtualenv_pandas",
    python_callable=some_python_function,
    requirements=["pandas"],
    system_site_packages=False,
    dag=dag,
)

Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • MLOps engineer by day
  • AI and data engineering consultant by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
  • Mastodon: @mikulskibartosz@mathstodon.xyz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.