tap-pxwebapi

· 672 words · 4 minute read

A Singer Tap for Official Statistics 🔗

In the world of data engineering, Singer is popular standard, with tools like Airbyte and Meltano providing a flexible framework for data loading. One source that is often overlooked for data loading needs however, is official statistics. Different statistical offices around the world have different APIs (and in some cases no API at all), but one place to start is PxWeb.

PxWeb is a common thread connecting Norway, Sweden, and Finland in the realm of official statistics. This web application, developed by Statistics Sweden and used by their Norwegian and Finnish counterparts, is all about disseminating statistical data and metadata. It is even open-source software, potentially beneficial to a range of other organizations.

Yet, extracting data from PxWeb’s API can feel like a bit of a manual chore. This API stands out with its unique json-stat2 format and doesn’t quite follow the usual API patterns. The tap-pxwebapi Singer tap is created to make it possible to load official statistics in the same way as any other data source. It’s designed to streamline the process of extracting data from this quirky API, focusing on the features most relevant for recurrent data loads, typical in Singer’s use cases. Hopefully this provides the needed flexibility without unnecessary complexity.

Getting started requires a basic grasp of the API though. The first step is simple: input a the table number you need in the config. But for most tables and use cases, a bit more detail is be needed.

For each column, you have the option to specify which values to return. If you leave it blank, the API decides for you, providing either all values or a default top-level value for ‘collapsible’ columns. Take, for instance, the scenario of analyzing median and average incomes by region. Here, regions might be set as collapsible because it’s logical to get a total figure for each. If you don’t specify the region, you’ll end up with data for the whole country. However, for something like statistical measures (average and median), the API will give you all variants, as combining them doesn’t quite make sense.

But how do you know which filters are optional and which are necessary?

The API endpoint provides this information. For example, if you’re looking for data in Norwegian from table 07594, your URL would be https://data.ssb.no/api/v0/no/table/07594. A simple GET request to this URL fetches a JSON outlining the available dimensions (or columns) and their respective values.

In our shortened example JSON:

{
  "title": "07594: Average income and tax...",
  "variables": [
    {
      "code": "STN",
      "text": "area",
      "values": ["LANDET", "STN", "OVR"],
      "valueTexts": ["The whole country", "STN area north of Saltfjellet", "Other areas north of Saltfjellet"],
      "elimination": true
    },
    {
      "code": "ContentsCode",
      "text": "statistikkvariabel",
      "values": [
        "Inntekt"
      ],
      "valueTexts": [
        "Gjennomsnittlig beløp"
      ]
    },
    {
      "code": "Tid",
      "text": "år",
      "values": ["2006", "2007", "2013", "2015"],
      "valueTexts": [ "2006",  "2007", "2015"],
      "time": true
    }
  ]
}

Notice how the variable STN, includes an elimination key set to true. This suggests that these dimensions have a default “top” or aggregate value that the API will default to. Variables without the “elimination” key will by default return all values unless a selection is specified. While the API offers various filtering methods, tap-pxwebapi simplifies this by accepting a list of chosen values.

For dimensions without the elimination option, like ContentsCode, all values will be returned.

Note that the time column (denoted by the "time": true variable, in our case Tid), you’ll need to specify your desired filters in the POST request. tap-pxwebapi assumes you’re interested in all time periods, so you don’t need to specify the ’time’ variable.

In sum, the config for extracting our example table might look something like this, for returning all areas:

      - table_name: befolkning_sysselsetting
        table_id: '07594'
        select:
        - code: 'STN'
          values: ["LANDET", "STN", "OVR"]

That is the basics of not only using but also understanding tap-pxwebapi and the API. Hopefully, this brings official statistics into the world of normal extract/load batch jobs.

You can find the tap on the Meltano hub: https://hub.meltano.com/extractors/tap-pxwebapi