Wrangling interval data using lubridate

September 29, 2023 | 7

Wrangling interval data using lubridate

One common issue we encounter in helping researchers work with the housing register data of Statistics Netherlands is its transactional nature: each row in the housing register table contains data on when someone registered and deregistered at an address (more info in Dutch here).

In this post, we show how to use this transactional data to perform one of the most common transformations we see: what part of a certain time interval (e.g, the entire year 2021 or January 1999) did the people I’m interested in live in the Netherlands? To solve this issue, we will use time interval objects, as implemented in the package {lubridate} which is part of the {tidyverse} since version 2.0.0.

library(tidyverse)

The data

Obviously, we cannot share actual Statistics Netherlands microdata here, so we first generate some tables that capture the gist of the data structure. First, let’s generate some basic person identifiers and some info about each person:

Code

(person_df <- tibble(
  person_id = factor(c("A10232", "A39211", "A28183", "A10124")), 
  firstname = c("Aron", "Beth", "Carol", "Dave"),
  income_avg = c(14001, 45304, 110123, 43078)
))

# A tibble: 4 × 3
  person_id firstname income_avg
  <fct>     <chr>          <dbl>
1 A10232    Aron           14001
2 A39211    Beth           45304
3 A28183    Carol         110123
4 A10124    Dave           43078

Then, we create a small example of housing transaction register data. In this data, for any period where a person is not registered to a house, they are assumed to live abroad (because everyone in the Netherlands is required to be registered at an address).

Code

(house_df <- tibble(
  person_id  = factor(c("A10232", "A10232", "A10232", "A39211", "A39211", "A28183", "A28183", "A10124")),
  house_id   = factor(c("H1200E", "H1243D", "H3432B", "HA7382", "H53621", "HC39EF", "HA3A01", "H222BA")),
  start_date = ymd(c("20200101", "20200112", "20211120", "19800101", "19900101", "20170303", "20190202", "19931023")),
  end_date   = ymd(c("20200112", "20211120", "20230720", "19891231", "20170102", "20180720", "20230720", "20230720"))
))

# A tibble: 8 × 4
  person_id house_id start_date end_date  
  <fct>     <fct>    <date>     <date>    
1 A10232    H1200E   2020-01-01 2020-01-12
2 A10232    H1243D   2020-01-12 2021-11-20
3 A10232    H3432B   2021-11-20 2023-07-20
4 A39211    HA7382   1980-01-01 1989-12-31
5 A39211    H53621   1990-01-01 2017-01-02
6 A28183    HC39EF   2017-03-03 2018-07-20
7 A28183    HA3A01   2019-02-02 2023-07-20
8 A10124    H222BA   1993-10-23 2023-07-20

Interval objects!

Notice how each transaction in the housing data has a start and end date, indicating when someone registered and deregistered at an address. A natural representation of this information is as a single object: a time interval. The package {lubridate} has support for specific interval objects, and several operations on intervals:

computing the length of an interval with int_length()
computing whether two intervals overlap with int_overlap()
and much more… as you can see here

So let’s transform these start and end columns into a single interval column!

house_df <- 
  house_df |> 
  mutate(
    # create the interval
    int = interval(start_date, end_date), 
    # drop the start/end columns
    .keep = "unused"                      
  )

house_df

# A tibble: 8 × 3
  person_id house_id int                           
  <fct>     <fct>    <Interval>                    
1 A10232    H1200E   2020-01-01 UTC--2020-01-12 UTC
2 A10232    H1243D   2020-01-12 UTC--2021-11-20 UTC
3 A10232    H3432B   2021-11-20 UTC--2023-07-20 UTC
4 A39211    HA7382   1980-01-01 UTC--1989-12-31 UTC
5 A39211    H53621   1990-01-01 UTC--2017-01-02 UTC
6 A28183    HC39EF   2017-03-03 UTC--2018-07-20 UTC
7 A28183    HA3A01   2019-02-02 UTC--2023-07-20 UTC
8 A10124    H222BA   1993-10-23 UTC--2023-07-20 UTC

We will want to compare this interval with a reference interval to compute the proportion of time that a person lived in the Netherlands within the reference interval. Therefore, we quickly define a new interval operation which truncates an interval to a reference interval. Don’t worry too much about it for now, we will use it later. Do notice that we’re always using the int_*() functions defined by {lubridate} to interact with the interval objects.

# utility function to truncate an interval object to limits (also vectorized so it works in mutate())
int_truncate <- function(int, int_limits) {
  int_start(int) <- pmax(int_start(int), int_start(int_limits))
  int_end(int)   <- pmin(int_end(int), int_end(int_limits))
  return(int)
}

Computing the proportion in the Netherlands

The next step is to define a function that computes for each person a proportion overlap for a reference interval. By creating a function, it will be easy later to do the same operation for different intervals (e.g., different reference years) to work with the rich nature of the Statistics Netherlands microdata. To compute this table, we make extensive use of the {tidyverse}, with verbs like filter(), mutate(), and summarize(). If you want to know more about these, take a look at the {dplyr} documentation (but of course you can also use your own flavour of data processing, such as {data.table} or base R).

# function to compute overlap proportion per person
proportion_tab <- function(housing_data, reference_interval) {
  
  # start with the housing data
  housing_data |> 
    
    # only retain overlapping rows, this makes the following
    # operations more efficient by only computing what we need
    filter(int_overlaps(int, reference_interval)) |> 
    
    # then, actually compute the overlap of the intervals
    mutate(
      
      # use our earlier truncate function
      int_tr = int_truncate(int, reference_interval),
      
      # then, it's simple to compute the overlap proportion
      prop = int_length(int_tr) / int_length(reference_interval)
      
    ) |> 
    
    # combine different intervals per person
    summarize(prop_in_nl = sum(prop), .by = person_id)
  
}

Now we’ve defined this function, let’s try it out for a specific year such as 2017!

int_2017  <- interval(ymd("20170101"), ymd("20171231"))
prop_2017 <- proportion_tab(house_df, int_2017)

prop_2017

# A tibble: 3 × 2
  person_id prop_in_nl
  <fct>          <dbl>
1 A39211       0.00275
2 A28183       0.832  
3 A10124       1

Now we’ve computed this proportion, notice that we only have three people. This means that the other person was living abroad in that time, with a proportion in the Netherlands of 0. To nicely display this information, we can join the proportion table with the original person dataset and replace the NA values in the proportion column with 0.

left_join(person_df, prop_2017, by = "person_id") |> 
  mutate(prop_in_nl = replace_na(prop_in_nl, 0))

# A tibble: 4 × 4
  person_id firstname income_avg prop_in_nl
  <fct>     <chr>          <dbl>      <dbl>
1 A10232    Aron           14001    0      
2 A39211    Beth           45304    0.00275
3 A28183    Carol         110123    0.832  
4 A10124    Dave           43078    1

Success! We now have a dataset for each person with the proportion of time they lived in the Netherlands in 2017. If you look at the original housing dataset, you may see the following patterns reflected in the proportion:

Aron indeed did not live in the Netherlands at this time.
Beth moved away on January 2nd, 2017.
Carol moved into the Netherlands on March 3rd, 2017 and remained there until 2018
Dave lived in the Netherlands this entire time.

Conclusion

In this post, we used interval objects and operations from the {lubridate} package to wrangle transactional housing data into a proportion of time spent living in the Netherlands. The advantage of using this package and its functions is that any particularities with timezones, date comparison, and leap years are automatically dealt with so that we could focus on the end result rather than the details.

If you are doing similar work and you have a different method, let us know! In addition, if you have further questions about working with Statistics Netherlands microdata or other complex or large social science datasets, do not hesitate to contact us on our website: https://odissei-soda.nl.

Bonus appendix: multiple time intervals

Because we created a function that takes in the transaction data and a reference interval, we can do the same thing for multiple time intervals (e.g., years) and combine the data together in one wide or long dataset. This is one way to do this:

library(glue) # for easy string manipulation

# initialize an empty dataframe with all our columns
nl_prop <- tibble(person_id = factor(), prop_in_nl = double(), yr = integer())

# then loop over the years of interest
for (yr in 2017L:2022L) {
  # construct reference interval for this year
  ref_int <- interval(ymd(glue("{yr}0101")), ymd(glue("{yr}1231")))
  # compute the proportion table for this year
  nl_prop_yr <- proportion_tab(house_df, ref_int) |> mutate(yr = yr)
  # append this year to the dataframe
  nl_prop <- bind_rows(nl_prop, nl_prop_yr)
}

# we can pivot it to a wide format
nl_prop_wide <- 
  nl_prop |> 
  pivot_wider(
    names_from = yr, 
    names_prefix = "nl_prop_", 
    values_from = prop_in_nl
  )

# and join it with the original person data, replacing NAs with 0 again
person_df |> 
  left_join(nl_prop_wide, by = "person_id") |> 
  mutate(across(starts_with("nl_prop_"), \(p) replace_na(p, 0)))

# A tibble: 4 × 9
  person_id firstname income_avg nl_prop_2017 nl_prop_2018 nl_prop_2019
  <fct>     <chr>          <dbl>        <dbl>        <dbl>        <dbl>
1 A10232    Aron           14001      0              0            0    
2 A39211    Beth           45304      0.00275        0            0    
3 A28183    Carol         110123      0.832          0.549        0.912
4 A10124    Dave           43078      1              1            1    
# ℹ 3 more variables: nl_prop_2020 <dbl>, nl_prop_2021 <dbl>,
#   nl_prop_2022 <dbl>