NetCBS: creating network measures using CBS networks (POPNET) in the RA
Registry data from the Central Bureau of Statistics (CBS) in the Netherlands contains information on the social context of individuals: family, friends, schoolmates, neighbors, housemates and colleagues. These data allow researchers to study how a person’s embeddedment in the network of social contexts affect their outcomes in health, education, and labor market. For example, the characteristics of the parents of a student’s classmates can be used to study the relationship between social networks and educational outcomes. CBS makes available these data through the POPNET network files.
If you are interested in using these data please refer to the CBS website for more information on how to access the data.
How to work with the POPNET network files?
Analyzing the network files is not straightforward, as the size of the files is extremely large (hundreds of millions of observations) and studying them requires merging multiple files and aggregating the data in a specific way. The netCBS
library is designed to simplify this process by providing a simple query system to specify the relationships between the main sample dataframe and the context data. The library then merges the network files and aggregates the data based on the query, returning the desired network measures.
First, you will need to install it in your RA environment:
pip install netcbs
Let’s imagine you are interested in understanding how educational attainment depends on the income and age of parents of the other children in the classroom.
We will need the following data:
-
Your sample
df_sample
: in this case the childrenRINPERSOON RINPERSOONS 1312231231 R 2234523452 R 2345234333 R 4425345234 R ...
-
The characteristics of the partents
df_agg
: in this case income and ageRINPERSOON RINPERSOONS Income Age 2435235880 30000 23321 45 8438423423 40000 74329 32 2345234333 50000 63123 41
-
The link between the children and the paretns (dataset
FAMILIENETWERTAB
), and between children and schoolmates (datasetKLASGENOTENNETWERKTAB
):netCBS
will take care of this for you.
You can then use the netcbs
library to calculate the average income and age of the parents of the children’s classmates.
import netcbs
query = "[Income, Age] -> Family[301] -> Schoolmates[all] -> Sample"
df = netcbs.transform(query,
df_sample = df_sample, # dataset with the sample to study
df_agg = df_agg, # dataset with the income variable
year=2021, # year to study
cbsdata_path='G:/Bevolking', # path to the CBS data
agg_funcs=[pl.mean, pl.sum, pl.count], # calculate the average
return_pandas=False, # returns a pandas dataframe instead of a polars dataframe
lazy=True # use polars lazy evaluation (faster/less memory usage)
)
How does the query works
The library uses a query system to specify the relationships between the main sample dataframe and the context data. The query consists of a series of context types separated by arrows (->), with optional relationship types in square brackets. For example, the query "[Income, Age] -> Family[301] -> Schoolmates[all] -> Sample"
specifies that the income and age of the parents of the student’s classmates should be calculated based on the provided sample dataframe. Let’s break the query down:
[Income, Age]
specifies the columns to be aggregated. In this case, we are interested in the income and age of the parents of the children’s classmates.Family[301]
specifies the relationship between the children and their parents. The number in square brackets indicates the relationship type, which is 301 for the parent-child relationship. The relationship types are specified in the CBS data documentation, or by printing thenetcbs.context2types
andnetcbs.codebook
.Schoolmates[all]
specifies the relationship between the children and their classmates. The keywordall
indicates that all classmates should be included in the calculation.Sample
is always the end of the query
The library has several parameters:
- The aggregation functions are specified in the
agg_funcs
parameter. In this case, we are calculating the average (pl.mean), sum (pl.mean) and number (pl.count) for the income and age of the parents of the children’s classmates. The number allow us to distinguish parents with 0, 1 or 2 parents alive. year
specifies the year of the CBS data to be used.cbsdata_path
specifies the path to the CBS data. Leave this unchangedreturn_pandas
specifies whether to return a pandas dataframe instead of a polars dataframe. This can be useful for further analysis in pandas.lazy
specifies whether to use polars lazy evaluation. We recommend to use the polars lazy evaluation (lazy=True) to reduce memory usage and speed up the calculations. For debugging this can be disabled by setting lazy=False.
More examples See this Jupyter notebook for accessible information and examples.
Citation
The package netCBS
is published under an MIT license. When using netCBS
for academic work, please cite:
Garcia-Bernardo, Javier (2024). netCBS: A Python library to efficiently create network measures using CBS networks (POPNET) in the RA (0.1). Zenodo. 10.5281/zenodo.13908120