NetCBS: creating network measures using CBS networks (POPNET) in the RA

Registry data from the Central Bureau of Statistics (CBS) in the Netherlands contains information on the social context of individuals: family, friends, schoolmates, neighbors, housemates and colleagues. These data allow researchers to study how a person’s embeddedment in the network of social contexts affect their outcomes in health, education, and labor market. For example, the characteristics of the parents of a student’s classmates can be used to study the relationship between social networks and educational outcomes. CBS makes available these data through the POPNET network files.

If you are interested in using these data please refer to the CBS website for more information on how to access the data.

How to work with the POPNET network files? Analyzing the network files is not straightforward, as the size of the files is extremely large (hundreds of millions of observations) and studying them requires merging multiple files and aggregating the data in a specific way. The netCBS library is designed to simplify this process by providing a simple query system to specify the relationships between the main sample dataframe and the context data. The library then merges the network files and aggregates the data based on the query, returning the desired network measures.

First, you will need to install it in your RA environment:

pip install netcbs

Let’s imagine you are interested in understanding how educational attainment depends on the income and age of parents of the other children in the classroom.

We will need the following data:

Your sample df_sample: in this case the children

RINPERSOON     RINPERSOONS
1312231231     R           
2234523452     R           
2345234333     R         
4425345234     R    
...

The characteristics of the partents df_agg: in this case income and age

RINPERSOON     RINPERSOONS    Income  Age
2435235880     30000          23321   45          
8438423423     40000          74329   32
2345234333     50000          63123   41

The link between the children and the paretns (dataset FAMILIENETWERTAB), and between children and schoolmates (dataset KLASGENOTENNETWERKTAB): netCBS will take care of this for you.

You can then use the netcbs library to calculate the average income and age of the parents of the children’s classmates.

import netcbs
query =  "[Income, Age] -> Family[301] -> Schoolmates[all] -> Sample"
df = netcbs.transform(query, 
                      df_sample = df_sample,  # dataset with the sample to study
                      df_agg = df_agg, # dataset with the income variable
                      year=2021, # year to study
                      cbsdata_path='G:/Bevolking', # path to the CBS data
                      agg_funcs=[pl.mean, pl.sum, pl.count], # calculate the average
                      return_pandas=False, # returns a pandas dataframe instead of a polars dataframe
                      lazy=True # use polars lazy evaluation (faster/less memory usage)
                      )

How does the query works The library uses a query system to specify the relationships between the main sample dataframe and the context data. The query consists of a series of context types separated by arrows (->), with optional relationship types in square brackets. For example, the query "[Income, Age] -> Family[301] -> Schoolmates[all] -> Sample" specifies that the income and age of the parents of the student’s classmates should be calculated based on the provided sample dataframe. Let’s break the query down:

[Income, Age] specifies the columns to be aggregated. In this case, we are interested in the income and age of the parents of the children’s classmates.
Family[301] specifies the relationship between the children and their parents. The number in square brackets indicates the relationship type, which is 301 for the parent-child relationship. The relationship types are specified in the CBS data documentation, or by printing the netcbs.context2types and netcbs.codebook.
Schoolmates[all] specifies the relationship between the children and their classmates. The keyword all indicates that all classmates should be included in the calculation.
Sample is always the end of the query

The library has several parameters:

The aggregation functions are specified in the agg_funcs parameter. In this case, we are calculating the average (pl.mean), sum (pl.mean) and number (pl.count) for the income and age of the parents of the children’s classmates. The number allow us to distinguish parents with 0, 1 or 2 parents alive.
year specifies the year of the CBS data to be used.
cbsdata_path specifies the path to the CBS data. Leave this unchanged
return_pandas specifies whether to return a pandas dataframe instead of a polars dataframe. This can be useful for further analysis in pandas.
lazy specifies whether to use polars lazy evaluation. We recommend to use the polars lazy evaluation (lazy=True) to reduce memory usage and speed up the calculations. For debugging this can be disabled by setting lazy=False.

More examples See this Jupyter notebook for accessible information and examples.

Citation The package netCBS is published under an MIT license. When using netCBS for academic work, please cite:

Garcia-Bernardo, Javier (2024). netCBS: A Python library to efficiently create network measures using CBS networks (POPNET) in the RA (0.1). Zenodo. 10.5281/zenodo.13908120