Testing Singer REST taps

Dec 8, 2023 · 783 words · 4 minute read

I probably need to deal with this subject not only separately, but in installments.

To set the scene, my tap-prefect comes with a test-suite from the SDK, which yields

5 warnings, 94 errors in 5.38s

The singer_sdk test suite that comes bundled is… convoluted. And it seems it by default tries to run it without trying to authenticate.

I tried to go along with the test suite but without success. And after I started suspecting there wasn’t a good way to actually test the output I decided to go my own way.

Where to begin? 🔗

Because this is basically a CLI utility, it would be ideal to actually run the CLI commands and check the output. This is easier said than done though, because we want to mock some API responses. You can run CLI commands as part of tests and you can mock an API response as part of a tesst, but as far as I know you can’t do both.

After a little digging, it turns out there are quite accessible objects and methods that can be invoked, and provide for an almost end-to-end test. That way, I can mock the API the way I need, and still get to test 90% of my code in one go.

requests and responses 🔗

Mocking a response isn’t necessarily hard. You have to do it inside the test, but with the responses library there isn’t much overhead.

@responses.activate
def test_events():
    responses.add(POST, "example.com/data", json={"value": [1, 2, 3]}, status=200)

    r = requests.post("example.com/data", json={"limit": 3})

    assert r.json()["value"] == [1, 2, 3]

Of course, when testing our code it is probably the code we want to test that does the requesting, so it will probably look something like this:

from my_lib import DataGetter

@responses.activate
def test_events():
    responses.add(POST, "example.com/data", json={"value": [1, 2, 3]}, status=200)

    datagetter = DataGetter(limit=3)

    assert datagetter.value == [1, 2, 3]

Naturally, or at least hopefully, you know from your code what URL the DataGetter class reaches out to.

In our case, there are two more things we want test, that complicates our code: Pagination, and number of requests made.

Testing pagination isn’t actually hard: Because we do an almost end-to-end test, the native pagination of our Singer tap will run, and all we need to do is make sure that we have a mock that will catch the request for the second page when it comes around.

Depending on the API, that second page might be a new POST request to the same URL, or it might go to a new URL.

If the next page is a new URL, simply mock that URL too. If the next page is the same URL, responses has a very cool feature: It works somewhat like a FIFO queue. You can add several responses from the same URL, and they will be returned in the order they were added.

from my_lib import DataGetter

@responses.activate
def test_events():
    responses.add(POST, "example.com/data", json={"value": [1, 2, 3]}, status=200)
    responses.add(POST, "example.com/data", json={"value": []}, status=200)
    datagetter = DataGetter(limit=1000)

    assert datagetter.value == [1, 2, 3]

Imagine the DataGetter pagination calls the API until it receives an empty response. Like this, the first time it queries the API it gets 3 values in return, and so the pagination returns for more - probably with some pagination parameters in the body of the POST request that we don’t have to deal with. Now, the response values is an empty array, and DataGetter’s pagination quits.

Capturing stdout 🔗

Because this is a command line utility, the output of the function is printed to stdout. This is what we need to capture to verify that the function does what it is supposed to.

The built-in pytest fixture capsys makes this straight-forward. To expand on our example, imagine DataGetter has a method called print_values() that does exactly that: Simply print the values to the terminal.

Adding this test to the code we have already gives us a much more comprehensive test:

from my_lib import DataGetter

@responses.activate
def test_events(capsys):
    responses.add(POST, "example.com/data", json={"value": [1, 2, 3]}, status=200)
    responses.add(POST, "example.com/data", json={"value": []}, status=200)
    datagetter = DataGetter(limit=1000)

    assert datagetter.value == [1, 2, 3]

    datagetter.print_values()

    all_outs = capsys.readouterr()
    all_stdout = all_outs.out.strip()

    assert all_stdout == "1, 2, 3"

That’s it, pytest takes care of capsys as a reserved keyword, and you can use readouterr() to get what has been printed so far.

Callbacks and wildcards 🔗

I also had an edge-case relating to pagination that is a little too weird to get in to, but in addition to the responses.add method there is also a responses_add_callback that lets you run a callback function with the request as input before returning a response. This way, you can examine the URL or body of the request in detail and construct a response.