In one of my first classes of ECON 101, the lecturer talked about economic models and likened them to maps. For reasons I would understand later, he argued the idea that maps are a miniaturized simplified version of the landscape. Many might wish for a more detailed map, but as the map gets more detailed you would end up with a 1:1 map to drape over the landscape. Needless to say, such a map would serve no purpose. The simplification is the point.
From time to time, I see a similar point made about data. Whatever you have in your spreadsheet or database is a lossy, downsampled version of reality. If you work with is accounting data or network traffic or something similar, you might think data is reality and you wouldn’t be entirely wrong. The accounts are the accounts, if someone forgot to file an expense report, well, then it wasn’t an expense. Not for the company, anyways. But accounting includes some judgements. What even counts as an expense is, at the margins at least, debated and adjusted by professional bodies and politicians alike. For many other types of data, the questions are even more numerous. The most obvious examples might be from surveys. Surveys are an entire field of study, and I’m not talking about sampling - I’m talking about survey design. How should questions be formulated, in what order should they appear, what should the options be, all of these questions affect how survey respondents understand and answer the questions. And depending on the subject of your survey, you might not get honest answers even if everything else is well thought out.
For a number of years, I worked with register data. Regiters don’t suffer from the same issues as surveys, they have their own set of issues. Theories of register data aren’t as mature as for survey data. But the issue I would start with is this: With register data, you are not the intended audience.
Register or administrative data is (primarily at least) collected for administrative purposes. Tax collection, welfare payments, school administration, all register data comes from systems like that. Indeed, even some surveys are answered by dumping data from HR or similar systems.
Identifying the gap between the goal of the register data and the goal of the statistical analysis, and the goals and motivations someone might have when submitting data in the register is key to understanding the data.
As a trivial example, people are incentivized to understate their income for tax purposes. And you can easily imagine a well-executed survey of income that yields a higher average income than the tax data does. A more complex story can be told, and is being told, about school results. GPAs reflect ability, but also something else. Some schools might be strict graders, other might grade on a curve, some students might be good at cheating, some might impress their teachers, and some students might have very involved parents that put pressure on the teacher.
If you work with register data, I encourage you to study how data is entered in these systems. The rules, handbooks, interpretations, constraints and incentives of civil servants and others who maintain the systems is important and gives a lot of information about what quirks we can expect. Equal attention should be paid to changes in these systems. The recent reported rise in maternal mortality in the U.S. is likely due to a change in the form for registering deaths, coupled with a change in how the number is calculated.
A less morbid example, closer to my heart, has to do with reporting normal working hours in Norway. A lot of employment contracts don’t have any real specified working hours. The contract might simply say something like “up to 100% position” or “when called upon”. When reporting this data to the tax authorities and social services administration, many companies simply reported 0%. Obviously not really the case, but the advantage is that it is obvious. As some analysts got frustrated with all the missing work time data, they communicated that 0% contracts weren’t well received. And so, many employers started reporting these as 100% jobs. An almost equally untrue claim for most of these jobs, but with the disadvantage of not being obvious. Most jobs are full time, and so for these jobs that used to report 0%, 100% becomes a masked missing value. TODO
Paying attention to these details can be vital to understanding the data. A fun and illustrative example is the hundreds of pages that make up the manual for the population register. Questions like “What even is a place of residence”, when is a person moved out, and and how should we enter a name that contains non-norwegian characters are answered in manuals like these. These are questions that for very real, often legal reasons need to be determined - categorized if you will - and the long-winded guidelines that are used by people working with this every day is the primary tool to determine these questions. Hopefully, in a manner that is consistent across time and people.
One example of people who get into these details is Moody’s Inside Economics podcast. Accompanying commentary about recession risk and interest rates are discussions about how survey questions interpreted, problems with double counting and the intricates about seasonal adjustment. Knowing these datils are necesary when Moody’s wants to understand economic data - they are even hiring out of BLS, which makes this data.
Data isn’t reality. Not all of it, anyways. But it is one of many prisms through which to see the world. It gives a zoomed-out, pixelated view, but as long as you keep an eye out for potential distortions there is a lot to learn.