Wrangling interval data using lubridate
One common issue we encounter in helping researchers work with the housing register data of Statistics Netherlands is its transactional nature: each row in the housing register table contains data on when someone registered and deregistered at an address (more info in Dutch here).
In this post, we show how to use this transactional data to perform one of the most common transformations we see: what part of a certain time interval (e.g, the entire year 2021 or January 1999) did the people I’m interested in live in the Netherlands? To solve this issue, we will use time interval objects, as implemented in the package {lubridate}
which is part of the {tidyverse}
since version 2.0.0.
library(tidyverse)
The data
Obviously, we cannot share actual Statistics Netherlands microdata here, so we first generate some tables that capture the gist of the data structure. First, let’s generate some basic person identifiers and some info about each person:
Code
(person_df <- tibble(
person_id = factor(c("A10232", "A39211", "A28183", "A10124")),
firstname = c("Aron", "Beth", "Carol", "Dave"),
income_avg = c(14001, 45304, 110123, 43078)
))
# A tibble: 4 × 3
person_id firstname income_avg
<fct> <chr> <dbl>
1 A10232 Aron 14001
2 A39211 Beth 45304
3 A28183 Carol 110123
4 A10124 Dave 43078
Then, we create a small example of housing transaction register data. In this data, for any period where a person is not registered to a house, they are assumed to live abroad (because everyone in the Netherlands is required to be registered at an address).
Code
(house_df <- tibble(
person_id = factor(c("A10232", "A10232", "A10232", "A39211", "A39211", "A28183", "A28183", "A10124")),
house_id = factor(c("H1200E", "H1243D", "H3432B", "HA7382", "H53621", "HC39EF", "HA3A01", "H222BA")),
start_date = ymd(c("20200101", "20200112", "20211120", "19800101", "19900101", "20170303", "20190202", "19931023")),
end_date = ymd(c("20200112", "20211120", "20230720", "19891231", "20170102", "20180720", "20230720", "20230720"))
))
# A tibble: 8 × 4
person_id house_id start_date end_date
<fct> <fct> <date> <date>
1 A10232 H1200E 2020-01-01 2020-01-12
2 A10232 H1243D 2020-01-12 2021-11-20
3 A10232 H3432B 2021-11-20 2023-07-20
4 A39211 HA7382 1980-01-01 1989-12-31
5 A39211 H53621 1990-01-01 2017-01-02
6 A28183 HC39EF 2017-03-03 2018-07-20
7 A28183 HA3A01 2019-02-02 2023-07-20
8 A10124 H222BA 1993-10-23 2023-07-20
Interval objects!
Notice how each transaction in the housing data has a start and end date, indicating when someone registered and deregistered at an address. A natural representation of this information is as a single object: a time interval. The package {lubridate}
has support for specific interval objects, and several operations on intervals:
- computing the length of an interval with
int_length()
- computing whether two intervals overlap with
int_overlap()
- and much more… as you can see here
So let’s transform these start and end columns into a single interval column!
house_df <-
house_df |>
mutate(
# create the interval
int = interval(start_date, end_date),
# drop the start/end columns
.keep = "unused"
)
house_df
# A tibble: 8 × 3
person_id house_id int
<fct> <fct> <Interval>
1 A10232 H1200E 2020-01-01 UTC--2020-01-12 UTC
2 A10232 H1243D 2020-01-12 UTC--2021-11-20 UTC
3 A10232 H3432B 2021-11-20 UTC--2023-07-20 UTC
4 A39211 HA7382 1980-01-01 UTC--1989-12-31 UTC
5 A39211 H53621 1990-01-01 UTC--2017-01-02 UTC
6 A28183 HC39EF 2017-03-03 UTC--2018-07-20 UTC
7 A28183 HA3A01 2019-02-02 UTC--2023-07-20 UTC
8 A10124 H222BA 1993-10-23 UTC--2023-07-20 UTC
We will want to compare this interval with a reference interval to compute the proportion of time that a person lived in the Netherlands within the reference interval. Therefore, we quickly define a new interval operation which truncates an interval to a reference interval.
Don’t worry too much about it for now, we will use it later. Do notice that we’re always using the int_*()
functions defined by {lubridate}
to interact with the interval objects.
# utility function to truncate an interval object to limits (also vectorized so it works in mutate())
int_truncate <- function(int, int_limits) {
int_start(int) <- pmax(int_start(int), int_start(int_limits))
int_end(int) <- pmin(int_end(int), int_end(int_limits))
return(int)
}
Computing the proportion in the Netherlands
The next step is to define a function that computes for each person a proportion overlap for a reference interval. By creating a function, it will be easy later to do the same operation for different intervals (e.g., different reference years) to work with the rich nature of the Statistics Netherlands microdata. To compute this table, we make
extensive use of the {tidyverse}
, with verbs like filter()
, mutate()
, and summarize()
. If you want to know more about these, take a look at the {dplyr}
documentation (but of course you can also use your own flavour of data processing, such as {data.table}
or base R
).
# function to compute overlap proportion per person
proportion_tab <- function(housing_data, reference_interval) {
# start with the housing data
housing_data |>
# only retain overlapping rows, this makes the following
# operations more efficient by only computing what we need
filter(int_overlaps(int, reference_interval)) |>
# then, actually compute the overlap of the intervals
mutate(
# use our earlier truncate function
int_tr = int_truncate(int, reference_interval),
# then, it's simple to compute the overlap proportion
prop = int_length(int_tr) / int_length(reference_interval)
) |>
# combine different intervals per person
summarize(prop_in_nl = sum(prop), .by = person_id)
}
Now we’ve defined this function, let’s try it out for a specific year such as 2017!
int_2017 <- interval(ymd("20170101"), ymd("20171231"))
prop_2017 <- proportion_tab(house_df, int_2017)
prop_2017
# A tibble: 3 × 2
person_id prop_in_nl
<fct> <dbl>
1 A39211 0.00275
2 A28183 0.832
3 A10124 1
Now we’ve computed this proportion, notice that we only have three people. This means that the other person was living abroad in that time,
with a proportion in the Netherlands of 0. To nicely display this information, we can join the proportion table with the original person
dataset and replace the NA
values in the proportion column with 0.
left_join(person_df, prop_2017, by = "person_id") |>
mutate(prop_in_nl = replace_na(prop_in_nl, 0))
# A tibble: 4 × 4
person_id firstname income_avg prop_in_nl
<fct> <chr> <dbl> <dbl>
1 A10232 Aron 14001 0
2 A39211 Beth 45304 0.00275
3 A28183 Carol 110123 0.832
4 A10124 Dave 43078 1
Success! We now have a dataset for each person with the proportion of time they lived in the Netherlands in 2017. If you look at the original housing dataset, you may see the following patterns reflected in the proportion:
- Aron indeed did not live in the Netherlands at this time.
- Beth moved away on January 2nd, 2017.
- Carol moved into the Netherlands on March 3rd, 2017 and remained there until 2018
- Dave lived in the Netherlands this entire time.
Conclusion
In this post, we used interval objects and operations from the {lubridate}
package to wrangle transactional housing data into a
proportion of time spent living in the Netherlands. The advantage of using this package and its functions is that any particularities with
timezones, date comparison, and leap years are automatically dealt with so that we could focus on the end result rather than the details.
If you are doing similar work and you have a different method, let us know! In addition, if you have further questions about working with Statistics Netherlands microdata or other complex or large social science datasets, do not hesitate to contact us on our website: https://odissei-soda.nl.
Bonus appendix: multiple time intervals
Because we created a function that takes in the transaction data and a reference interval, we can do the same thing for multiple time intervals (e.g., years) and combine the data together in one wide or long dataset. This is one way to do this:
library(glue) # for easy string manipulation
# initialize an empty dataframe with all our columns
nl_prop <- tibble(person_id = factor(), prop_in_nl = double(), yr = integer())
# then loop over the years of interest
for (yr in 2017L:2022L) {
# construct reference interval for this year
ref_int <- interval(ymd(glue("{yr}0101")), ymd(glue("{yr}1231")))
# compute the proportion table for this year
nl_prop_yr <- proportion_tab(house_df, ref_int) |> mutate(yr = yr)
# append this year to the dataframe
nl_prop <- bind_rows(nl_prop, nl_prop_yr)
}
# we can pivot it to a wide format
nl_prop_wide <-
nl_prop |>
pivot_wider(
names_from = yr,
names_prefix = "nl_prop_",
values_from = prop_in_nl
)
# and join it with the original person data, replacing NAs with 0 again
person_df |>
left_join(nl_prop_wide, by = "person_id") |>
mutate(across(starts_with("nl_prop_"), \(p) replace_na(p, 0)))
# A tibble: 4 × 9
person_id firstname income_avg nl_prop_2017 nl_prop_2018 nl_prop_2019
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 A10232 Aron 14001 0 0 0
2 A39211 Beth 45304 0.00275 0 0
3 A28183 Carol 110123 0.832 0.549 0.912
4 A10124 Dave 43078 1 1 1
# ℹ 3 more variables: nl_prop_2020 <dbl>, nl_prop_2021 <dbl>,
# nl_prop_2022 <dbl>