Don't use AI to generate tests for your code or how to do test-driven development with AI

One day, we may write code by explaining in English (or any other language) what we want, and the code will get automatically generated. I doubt it will ever happen. How likely will you get exactly what you want after writing 20, 50, or 300 pages of application description? I’m not saying the code generation part will have any flows. No. It will be perfect. I think writing a detailed description without any internal contradictions and missing parts is impossible. There is a reason why we don’t do waterfall anymore.

However, let’s assume we can somehow write a sufficiently-good description. Let’s imagine a world where programmers turn into lawyers. Instead of writing code, we will write a long document and make sure we cover all cases and “the other party” (the code-generating AI model) cannot screw us over.

It seems everything about programming will be different than what we do now. It’s not true. In this new world, one thing will be even more crucial than it is right now — automated testing. We will need a way to make sure the application works fine. We may use the BDD approach to describe it in a natural language. Most likely, those test cases will be used by AI to generate the code. But automated testing won’t go away.

Why you shouldn’t use AI to generate tests for your code

People who say you can write code and use AI to generate tests are misguided. It’s the worst thing you can do.

You wrote code that may or may not work correctly. You have no clue because there are no tests, and your manual testing most likely covered only a fraction of all possible cases. Now, you generate tests. What will happen? The current behavior of the code will get carved into the stone. Now you have the tests preserving the current behavior regardless of its correctness. If you make a mistake, you ensure you won’t catch and fix the error. If you want to generate test cases from the implementation code, do yourself a favor and don’t write any tests.

How to do test-driven development with AI

On the other hand, there is a test-generating technique that may work if you use it correctly:

  1. First, you have to write a description of the code you want to create.
  2. Now, instead of telling AI to generate the code, ask it to generate test cases for the code. You get test cases made from the description of the expected behavior, not the code. So, at least in theory, they should test whether you got what you want. Don’t trust them yet, though.
  3. Review every test case carefully. Ask AI to criticize its work. Ask which test case is redundant or incorrect. Then check it yourself.
  4. When you have a test suite vetted by humans and AI, you can generate the code. This and only this is the correct way to do it. Even if you generate the test code, do it first before generating any implementation code.

How can you do it? Before we start, let’s manage your expectations. All AI models have limited context size, and they will always have a limited context size (unless you think it’s possible to have an infinite amount of RAM. I hope you don’t believe anything as absurd as this). We can’t fit all of the application code into the model input at once. So we won’t. Instead, we will generate the functions one by one. Of course, when we do it, we risk having bugs related to passing the right arguments to functions. Basically, when you test individual parts properly, all of the bugs will happen in the interactions between them (which is much harder to debug and fix). We can extend the test scope as much as we want, but every test has its border. In this article, I will keep the example as simple as possible.

An example of TDD with AI

Generating test cases

What’s the example? I want to send a text to a machine learning model running in a Sagemaker endpoint and return True if the model’s response is above 0.75.

It seems trivial. However, a proper test suite would consist of at least five test cases. We should use a stub implementation of the boto3 API and write tests for three proper responses: below the threshold, exactly the threshold, and above the threshold, as well as a test case for getting an exception from the API. If I wanted to handle different errors in different ways, I should have a test case for all supported error cases. In this example, an error is an error, and we don’t care what happened. The fifth test should be an integration test using the actual endpoint.

Let’s see what tests GPT -4 will suggest. We will skip the integration test for now.

I'm writing an application in Python using boto3. I want to send a text to a machine learning model running in a Sagemaker endpoint and return True if the model's response is above 0.75. Write unit tests using a stub instead of the boto3 API. For now, instead of writing the tests, return a list of all of the test cases that should be tested.

GPT-4 generated this response:

Test if the function sends text to the SageMaker endpoint correctly.
Test if the function handles a successful response from the SageMaker endpoint.
Test if the function returns True when the model's response is greater than or equal to 0.75.
Test if the function returns False when the model's response is less than 0.75.
Test if the function handles an empty text input.
Test if the function handles an invalid SageMaker endpoint (e.g., a non-existent or misconfigured endpoint).
Test if the function handles an invalid model response (e.g., non-numeric values or values outside the range of 0 to 1).
Test if the function handles network errors or timeouts when communicating with the SageMaker endpoint.
Test if the function handles AWS service limit errors (e.g., too many requests within a specific time window).
Test if the function gracefully handles any other exceptions that might be raised by the boto3 API.

What do we see here? First, the parts done correctly. GPT-4 wants to handle more error cases than most programmers would even envision. Of course, I would prefer to write tests for the model code and make sure it can never return an invalid value instead of testing whether I got a correct response in the application code, but adding one test case with an invalid response won’t hurt us.

Now, the mistakes. AI made the same mistake as many programmers would. I wrote “return true when the response is above 0.75,” and GPT-4 checks whether the response is equal to or greater than 0.75. It’s not what I wrote.

We have reviewed the test cases, and now let’s use AI to check them:

Compare the list of test cases with the description of the function. Which of the test cases are redundant or incorrect?

Well, it tried, but I disagree with the results:

I removed the following test cases as they are either redundant or not applicable to the function's description:

Test if the function handles an invalid model response (e.g., non-numeric values or values outside the range of 0 to 1). (This test case is redundant because handling successful responses from the SageMaker endpoint should cover handling any possible model response format.)
Test if the function handles AWS service limit errors (e.g., too many requests within a specific time window). (This test case is not applicable because it is an infrastructure concern rather than a function-specific concern.)

It didn’t catch its mistake and wants to remove test cases for the wrong reasons. The first cases may be redundant if we test the model code. We can remove the second case if we decide to handle all errors the same way. However, if we distinguish between them, a service limit error must be a separate test case because it’s not really an error. It’s an unusual but normal situation that we should handle adequately.

Generating the first test

Let’s move on to writing a test case. We will test the “greater than 0.75” case, but we will do it properly without including 0.75:

Write a test case using pytest and botocore.stub. Check if the function returns True when the model's response is above 0.75 (note that, in the case of 0.75, the function should return False). Assume that the Sagemaker endpoint is called test-endpoint.

In addition to a (decent) test case, it produced the implementation code. I don’t like that because I didn’t ask for the implementation yet. Let’s ignore it for now and take a look at the test case:

# test_my_app.py

import pytest
from botocore.stub import Stubber
from my_app import send_text_to_sagemaker
import boto3

@pytest.fixture(scope="function")
def sagemaker_client():
    return boto3.client("sagemaker-runtime")

def test_send_text_to_sagemaker_above_threshold(sagemaker_client):
    stubber = Stubber(sagemaker_client)
    test_text = "Sample text"
    test_endpoint = "test-endpoint"
    expected_response = {
        "ContentType": "text/plain",
        "InvokedProductionVariant": "variant-name",
        "Body": b"0.76"
    }

    stubber.add_response("invoke_endpoint", expected_response, {
        "EndpointName": test_endpoint,
        "ContentType": "text/plain",
        "Body": test_text
    })
    stubber.activate()

    result = send_text_to_sagemaker(test_text)

    assert result is True

    stubber.deactivate()

Fixing the code generated by AI

It’s ok. Almost correct. It should pass the stubber to the tested function as the boto3 client. Also, the deactivation step makes no sense because it won’t get triggered when the assertion fails, and the client is function-scoped, so it doesn’t matter whether we clean up or not. Personally, I would write it using a with block and don’t worry about activation and deactivation: with Stubber(sagemaker_client) as stubber: I didn’t tell it anything about the model’s response format, so it assumed something. If the format differed, we would need to remember to pass the correct instruction regarding the output formatting too.

If you want to continue generating code with AI, you must tell GPT-4 what changes you made in the test code it generated. Of course, it gets a little bit easier when you use GitHub Copilot or Replit. Those AI systems can read the code directly. On the other hand, they won’t generate as much code in one step as GPT-4. I think it’s better because the step-by-step approach produces better code, and you don’t need to modify as much incorrect code as you have to when AI creates 50 lines of code at once.

I could continue the process with the rest of the test cases, but, let’s be honest, you would scroll down to the bottom of the article. So let’s tell GPT-4 what I changed and generate only two more tests.

I decided to write the test case like this:

# test_my_app.py

import pytest
from botocore.stub import Stubber
from my_app import send_text_to_sagemaker
import boto3

@pytest.fixture(scope="function")
def sagemaker_client():
    return boto3.client("sagemaker-runtime")

def test_send_text_to_sagemaker_above_threshold(sagemaker_client):
    with Stubber(sagemaker_client) as stubber:
        test_text = "Sample text"
        test_endpoint = "test-endpoint"
        expected_response = {
            "ContentType": "text/plain",
            "InvokedProductionVariant": "variant-name",
            "Body": b"0.76"
        }

        stubber.add_response("invoke_endpoint", expected_response, {
            "EndpointName": test_endpoint,
            "ContentType": "text/plain",
            "Body": test_text
        })

        result = send_text_to_sagemaker(stubber, test_text)

        assert result is True

Please generate the correct `send_text_to_sagemaker` code using the client passed as an argument.

Now, I have the implementation code looking like this:

def send_text_to_sagemaker(client, text):
    response = client.invoke_endpoint(
        EndpointName="test-endpoint",
        ContentType="text/plain",
        Body=text
    )
    prediction = float(response["Body"].read().decode("utf-8"))

    return True if prediction > 0.75 else False

Handling the incorrect input

Let’s move on to an error case. When the input is empty, I want to return False without invoking the model.

Good. It follows the convention I established by telling it what changes I made in the previous test case, and it made sure that the model wasn’t called:

# test_my_app.py

def test_send_text_to_sagemaker_empty_input(sagemaker_client):
    with Stubber(sagemaker_client) as stubber:
        test_text = ""

        result = send_text_to_sagemaker(stubber, test_text)

        assert result is False

        # Check that invoke_endpoint was not called
        with pytest.raises(AssertionError):
            stubber.assert_no_pending_responses()

Obviously, the implementation code also looks fine:

# my_app.py

def send_text_to_sagemaker(client, text):
    if not text:
        return False

    response = client.invoke_endpoint(
        EndpointName="test-endpoint",
        ContentType="text/plain",
        Body=text
    )
    prediction = float(response["Body"].read().decode("utf-8"))

    return True if prediction > 0.75 else False

Handling the error case

We will finish with the error case:

If the Sagemaker call fails for any reason, wrap the exception in an instance of `MLException` and raise the `MLException`. Write the test case and modify the implementation code.

I’m satisfied with the result. It told me to create the MLException class:

# my_app.py

class MLException(Exception):
    pass

It created a correct test implementation. I didn’t care about the kind of error, so a ThrottlingException is as good as any other type of error.

# test_my_app.py

from my_app import MLException

def test_send_text_to_sagemaker_sagemaker_call_failure(sagemaker_client):
    with Stubber(sagemaker_client) as stubber:
        test_text = "Sample text"
        test_endpoint = "test-endpoint"

        stubber.add_client_error("invoke_endpoint", service_error_code="ThrottlingException")

        with pytest.raises(MLException, match="An error occurred while invoking the SageMaker endpoint."):
            send_text_to_sagemaker(stubber, test_text)

        # Check that there are no pending responses after the exception
        stubber.assert_no_pending_responses()

Finally, AI added the error handling code I wanted. What surprised me is that it didn’t learn how to do it incorrectly from tons of bad examples on the Internet and didn’t use the log-and-throw antipattern. (Seriously, if you can handle an error, do it, and don’t propagate the exception up the function stack. If you can add some information, add it to the exception class as fields instead of scattering it in multiple places in the application log. If you can’t do any of that, why do you even catch the exception?)

# my_app.py

def send_text_to_sagemaker(client, text):
    if not text:
        return False

    try:
        response = client.invoke_endpoint(
            EndpointName="test-endpoint",
            ContentType="text/plain",
            Body=text
        )
        prediction = float(response["Body"].read().decode("utf-8"))
    except Exception as e:
        raise MLException("An error occurred while invoking the SageMaker endpoint: test-endpoint.") from e

    return True if prediction > 0.75 else False

Now, you would need to repeat the process for all the test cases you want to support.

Takeaways on using AI to TDD

Remember three things:

  • Don’t believe anything AI produces. Review the test cases carefully. Try to simplify them.
  • If you make any changes, tell AI about them so it incorporates them in the next generated snippet of code.
  • “Test-Driven Development (TDD) turns the painful and exhausting hours of debugging into pleasant and joyful minutes of writing tests.” - Daniel Moka

Do you need help building AI-powered applications for your business?
You can hire me!

Older post

Alternatives to OpenAI GPT model: using an open-source Cerebras model with LangChain

Discover how to leverage the powerful open-source Cerebras model with LangChain in this comprehensive guide, featuring step-by-step instructions for loading the model with HuggingFace Transformers, creating prompt templates, and integrating it with LangChain Agents.

Newer post

How to Build a ChatGPT Plugin in Python?

A step-by-step guide to building a ChatGPT plugin in Python to retrieve data from the knowledge base stored in a vector database