Px Files

Sep 26, 2024 · 1429 words · 7 minute read

This is amazingly nerdy and you should probably go find something better to do with your life. But I got challenged to write a PX file the other day. PX files are used by the PXWeb application, which is some now fairly old software for making statistical data available.

So the statistical agencies in Norway, Sweden, Finland and Estonia (and maybe others) have to create these files somehow. But because there isn’t a wide install base, resources are scarce. Still, there are resources.

The resources 🔗

Firstly, the page for the software itself: https://www.scb.se/en/services/statistical-programs-for-px-files/px-web/

Secondly, the technical specification of the format: https://www.scb.se/globalassets/vara-tjanster/px-programmen/pc-axis_fileformat-2008.pdf

There are also some libraries to help with PX files. For python, there is pyaxis, which says it can read and write PX files. But neither the documentation nor the code seems to concern writing them: https://pypi.org/project/pyaxis/

For R, however, there is pxR, which is able to both read and write PX files: https://github.com/cjgb/pxR. The documentation is a little sparse on the writing part though.

First, though, some theory about the file format.

A PX file 🔗

First off, it is important to understand that PX files contain a lot of metadata that a normal CSV file does not. After all, the files represent statistical tables for use by the general public, and these tables need things like titles, descriptions, an ID, etc. Stuff that you can’t fit into a CSV.

Secondly, and perhaps slightly perplexing for today’s audience, these files do weird things to minimize size.

The basic anatomy of a PX file is a newline delimited sequence of keywords and values. There is a large list of keywords to use, most of them are optional, but as a minimum the file needs a table ID, a heading, stub, subject area and code, title, units, values, and the data itself. Because I don’t have a running instance of the PXWeb application to test with, some of these explanations are gqualified guesses.

MATRIX: This is the ID of the table, basically any unique alphanumeric string should work. Try to keep it short, don’t use hashes or anything, it will be exposed to users.
HEADING: Typically, this will be the “statistical measure”, but really any variable that it is natural to display as columns (as opposed to rows) when the statistical table is displayed.
STUB: Typically this is the dimensions, but really any variable that it is natural to display as rows.
CODES: For each dimension, a list of the category codes it can assume.
VALUES: The label for each of the category codes.
TITLE: The title of the statistical table.
UNITS: The unit of measurement. Often this is people, sometimes it is money such as NOK, sometimes it is something entirely different.
SUBJECT-CODE: Statistical tables are divided into hierarchical subject areas. This is the code for that subject area. This code appears in some URLs etc.
SUBJECT-AREA: The human-readable name of the subject code.
DATA: An ordered array of the actual data. Naturally, the ordering of this list is important and must follow the ordering of the dimensions and values they can assume.

The last point about ordering makes it difficult to create these PX files manually, and a library for writing them is useful. This is where pxR is handy.

pxR 🔗

As mentioned, most of the documentation of pxR is centered around reading PX files, not writing them. But pxR can write them correctly too, based on a data frame and some correctly formatted metadata.

Other than the DATA part of the items above, these items must be defined manually (or semi-manually) separate from the data frame you want to write to a PX file.

As for the data frame, there is one important limitation: All the values, the numeric fields, have to be in a single column. This can mean a pretty big transpose job before you start writing.

Let’s start with the data. I have imagined a teeny tiny dataset of entirely hypothetical earnings data by gender, in a format that seems intuitive:

library(pxR)
library(tidyverse)

df <- data.frame(
    gender = c("M", "M", "M", "F", "F", "F", "T", "T", "T"),
    mean_earnings = c(5.34, 33.3, 553, 4.3, 3.3, 83.3, 12.3, 34.2, 22.3),
    median_earnings = c(100, 102, 303, 323, 423, 434, 847, 324, 214),
    year = c("2012", "2013", "2014", "2012", "2013", "2014", "2012", "2013", "2014")
)

We have two categorical variables: gender and year, each with 3 values. We also have two numerical columns with median and mean values. In order to prepare for writing it to a PX file, we need to transpose it so that we only have one column with numerical values.

df_long <- df %>% 
    dplyr::pivot_longer(
        cols = ends_with("_earnings"),
        names_to = "contentscode",
        values_to = 'value'
    )

the dplyr::pivot_longer is a fairly intuitive function for transposing datasets, but we are not explaining it here - but the result is:

> head(df_long)
# A tibble: 6 × 4
  gender year  contentscode     value
  <fct>  <fct> <fct>            <dbl>
1 M      2012  mean_earnings     5.34
2 M      2012  median_earnings 100   
3 M      2013  mean_earnings    33.3 
4 M      2013  median_earnings 102   
5 M      2014  mean_earnings   553   
6 M      2014  median_earnings 303

Now, we are ready for the metadata. Let’s start figuring out some of the fields:

HEADING, the value that should go across columns by default, should probably be the one that we now call contentscode.
STUB should be the other categorical columns: gender and year
CODES should be two different code lists: One for gender containing the values M, F and T, and one for year containing the values 2012, 2013, 2014.
VALUES: The labels for the codes. In the dataset, we only used codes, not explanatory text. For gender, the values M, F and T represents Male, Female and Total. For year, the labels should probably be the same as the codes: 2012, 2013, 2014.
UNITS: Since this is earnings and I am in Norway, our pretend data will have the unit NOK.

As for the rest of the values, it doesn’t matter as much for the example, but we’ll fill in:

for MATRIX, we’ll put 00001 - our first of many tables
for TITLE, we’ll put “Average and median earnings by gender”
We’ll set SUBJECT-CODE to LM short for “Labor Market”
Hence,SUBJECT-AREA will be “Labor Market”

The pxR refers to these metadata as the skeleton of the PX object. While the documentation doesn’t say anything about how to define such a skeleton from scratch, it is constructed as a list of lists in R.

px_skeleton <- list(
    STUB = list(
        value=c("gender", "year")
    ),
    HEADING = list(
        value = c("contentscode")
    ),
    CODES = list(
        gender = factor(c("M", "F", "T")),
        period = factor(c("2012", "2013", "2014")),
        contentscode = factor(c('mean_earnings', 'median_earnings'))
    ),
    VALUES = list(
        gender = factor(c("Male", "Female", "Total")),
        period = factor(c("2012", "2013", "2014")),
        contentscode = factor(c('Mean earnings', 'Median earnings'))
    ),
    UNITS = list(
        value = "nok"
    ),
    TITLE = list(
        value = "Average and median earnings by gender"
    ),
    MATRIX = list(
        value = "00001"
    ),
    SUBJECT.AREA = list(
        value = "Labor Market"
    ),
    SUBJECT.CODE = list(
        value = "LM"
    ),
    DECIMALS = list(
        value= 2
    )
)

Now, we have both a data frame (with values in a single column), and what pxR calls a skeleton, so we can create a new PX data object in R.

px_data <- as.px(df_long, px_skeleton)

Now, let’s write it to disk with

write.px(px_data, filename = "data.px")

If we take a look at the file, we can see the different header info and the values:

CHARSET="ANSI";
CREATION-DATE="20240926 23:09";
DECIMALS=2;
MATRIX="00001";
SUBJECT-CODE="LM";
SUBJECT-AREA="Labor Market";
TITLE="Average and median earnings by gender";
UNITS="nok";
STUB="contentscode","year";
HEADING="gender";
VALUES("gender")="F","M","T";
VALUES("year")="2012","2013","2014";
VALUES("contentscode")="mean_earnings","median_earnings";
CODES("gender")="M","F","T";
CODES("contentscode")="mean_earnings","median_earnings";
LAST-UPDATED="20240926 23:08:32";
INFO="File generated using R and package pxR (http://pxr.r-forge.r-project.org/)";
DATA=
4.3 5.34 12.3
3.3 33.3 34.2
83.3 553 22.3
323 100 847
423 102 324
434 303 214
;

We can also confirm the file was written correctly, by reading it and comparing it to the original data frame.

reread_px <- read.px("data.px")

reread_df <- as.data.frame(reread_px)

dplyr::all_equal(df_long, reread_df)

This should output TRUE, potentially along with some warnings that the dplyr::all_equal function is deprecated.

Ending notes 🔗

This example was created without a PXWeb installation to work with. We can verify that the file is written and re-read without data loss, but there is no guarantee the PXWeb application will accept the file as we created it. The pxR library notes that it does not have a comprehensive feature-set for validating PX files. We also don’t for certain what the table would look like in the PXWeb application.

But with those caveats, we are now able to understand and use the PX data format. In case you might need it.