Projects

Here you can find descriptions and links to projects we worked on in the past. We monitor our projects to follow the principles of FAIR: Findable, Accessible, Interoperable, and Reusable. Read more here. We additionally provide an overview of consultations we have done. See here.

Judicial Signals

We study the separation of powers between the judiciary and the legislature in practice. An NLP pipeline designed for legal texts is developed.

Contributors: Wendy Yan, Qixiang Fang

Learn more about our fellowships

Project page/code

Time-use imputation for modelling emissions

How does gendered behaviour influence the carbon emissions of families in the Netherlands? In this project, we combine time-use surveys with expenditure surveys on the large-scale microdata of Statistics Netherlands to answer this question.

Contributors: Maike Weiper, Erik-Jan van Kesteren, Qixiang Fang

Learn more about our fellowships

Project page/code

Future Time Orientation and Life Project

This project focuses on creating an automated system for classifying life goals into theoretically defined categories using Large Language Models (LLMs).

Contributors: Shiyu Dong, Qixiang Fang

Project page/code

Using NLP techniques to analyze biodiversity narratives in financial reports

To extract information about biodiversity in a wide range of companies' financial reports, we are helping to develop a LLM-based annotation pipeline which parse unstructured pdf reports.

Contributors: Catalina Papari, Qixiang Fang

Project page/code

LLM for social sciences

This project aims to create learning materials, tools and guidelines on how social scientists can scientifically integrate large language models (LLMs) into their research workflows.

Contributors: Qixiang Fang, Qixiang Fang

COVID-19 vaccination decision-making profiles

Can we leverage administrative records to predict survey-derived outcomes? In this project we aim to predict outcomes originally derived from LISS panel (membership in decision-making profiles related to COVID-19 vaccination) with sociodemographic administrative data from CBS microdata.

Contributors: Isabelle Wolf, Javier Garcia-Bernardo, Taymara Abreu

LLMs for self-regulated learning

In this project we explore the use of a large language model (LLM) to apply a deductive coding process to measure the quality of self-regulated learning (SRL) behaviours reflected in textual data collected from higher education students.

Contributors: Gabrielle Martins van Jaarsveld, Qixiang Fang

Disease map

This project aims to create a historical disease database (19th-20th century) for municipalities in the Netherlands.

Contributors: Kristina Thompson, Qixiang Fang

Creating a historical disease database

How common was cholera in May 1866 in Rotterdam? What about the disease environment in Groningen in the early 1830's? In this project we are creating a historical database of common diseases such as cholera using Delpher newspaper data.

Contributors: Kristina Thompson, Erik-Jan van Kesteren, Qixiang Fang

Learn more about our fellowships

Harmonizing longitudinal datasets to estimate psychological impact of COVID

Harmonize recovery in mental health after the COVID-19 pandemic in different datasets. We built pipelines to harmonize different datasets, and consulted on how to analyze and visualize the data.

Contributors: Keenan Ramsey, Erik-Jan van Kesteren, Javier Garcia-Bernardo

The causal impact of prenatal poverty

What are the impacts of poverty in the prenatal period on children's health outcomes? Which mechanisms might be suitable policy intervention targets for these potential negative effects?

Contributors: Nadya Ali, Erik-Jan van Kesteren, Javier Garcia-Bernardo, Marion van den Heuvel

Learn more about our fellowships

Policy intervention assessment in primary schools

We are helping to compute individual causal effects of a policy intervention in primary schools in Rotterdam. For this, we are exploring the use of advanced quasi-experimental methods such as synthetic controls.

Contributors: Gijs Custers, Erik-Jan van Kesteren, Oisín Ryan

Learn more about causal impact assessment

Machine learning pipeline for early life opportunity

We are helping with the creation of a data analysis and machine learning pipeline for the project "Kansrijke start", which aims to investigate which markers in the first 1000 days since conception are predictive of adverse events later in life.

Contributors: Wessel Kraaij, Erik-Jan van Kesteren, Anton Schreuder, Richard van Dijk

Computational models for word and non-word associations

We are collaborating on a project for computationally modelling people's intuitions about various associations of both real words and non-words (e.g., novel words or company names) for the Dutch language. The project will result in an easy-to-use openly available application in which (non-)words can be analyzed for various associations that they may evoke as well as give a list of semantically similar words.

Contributors: Giovanni Cassani, Erik-Jan van Kesteren, Aron Joosse

Learn more about our fellowships

Project page/code

Predicting fertility with interpretable machine learning

We collaborate on finding the limits of predictability for fertility intentions using a mass collaboration (benchmark). We are opening up the participant's models to understand what information is captured by highly predictive models, and to disaggregate the predictions of the models to assess geographical and demographic biases.

Contributors: Gert Stulp, Javier Garcia-Bernardo

Project page/code

Trust in public institutions

We are creating a pipeline and dashboard to evaluate how trust on European public institutions evolves during pandemic.

Contributors: Patrick Brown, Javier Garcia-Bernardo, Matthijs Vollenbroek, Stijn Peeters, Marc Tuters

COVID-19 spread in social networks

Together with RIVM and the ministry of health, the SoDa team is co-authoring a scientific paper on the spread of COVID-19 in schools in the Netherlands.

Contributors: RIVM, Javier Garcia-Bernardo

Linking datasets based on company names

Linking databases on company names is a challenging task. Company names are usually not unique and can have many spelling variations. We helped conduct a sensitivity analysis for different methods of linking these databases, which can be used to answer many different social science research questions about companies in the Netherlands.

Contributors: Peter Gerbrands, Jonathan de Bruin, Wim Coreynen

Project page/code

Synthetic register data for open science

We created a workflow using existing software packages to generate synthetic datasets for the Statistics Netherlands microdata architecture. These synthetic datasets can then be used as example datasets when sharing analyses (but not original data!) with researchers.

Contributors: Jan Kabatek, Erik-Jan van Kesteren, Kyuri Park

Benchmarking for social science

We helped to design and set up a benchmark for a social data science challenge at the end of the SICSS-ODISSEI summerschool. The benchmark was based on microdata from Statistics Netherlands.

Contributors: Paulina Pankowska, Javier Garcia-Bernardo, Adrienne Mendrik

Metadata for synthetization

We are developing a metadata format which includes variable-level statistical information. This format can then be used to generate fake, synthetic datasets for testing purposes using a python package.

Contributors: Ricarda Braukmann , Erik-Jan van Kesteren, Raoul Schram

Project page/code

Deviance in art automatic data collection

We created an automated data collection program (a "scraper") to make a database of artworks with varying levels of deviance. This database will be used in research on how to measure deviance in art.

Contributors: Eftychia Stamkou, Javier Garcia-Bernardo, Raoul Schram

Project page/code

Supercomputing for social scientists

We co-created and co-taught a full-day workshop on high-performance computing for social scientists.

Contributors: Carlos Teijeiro Barjas, Erik-Jan van Kesteren, Benjamin Czaja

Project page/code

Anonymity preserving collection of whatsapp data

We are creating a script to extract information from Whatsapp data packages, allowing to link data from different people while preserving the privacy of those people. This project is part of the ODISSEI LISS grant “Assessing Mobile Instant Messenger Networks with Donated Data”

Contributors: Laura Boeschoten, Javier Garcia-Bernardo, Parisa Zahedi, Shiva Nadi, Rense Corten

Project page/code

COVID-19 spread in social networks

Together with RIVM and the ministry of health, the SoDa team explored how to analyze the spread of COVID-19 in social networks, making use of the CBS social network files. The outcome of the project is a report for the ministry of health.

Contributors: RIVM, Javier Garcia-Bernardo

Population network data processing

We have helped with the implementation of the Statistics Netherlands population network data files in order to make them available to network researchers. These network data files can be used to develop network analysis models.

Contributors: Tom Emery, Javier Garcia-Bernardo

Project page/code

Empathy diagnostics dashboard

We created a pilot for an interactive questionnaire app which immediately generates a diagnostic report based on the inputs. This app is now used to study empathy in anti-social adolescents.

Contributors: Minet de Wied, Javier Garcia-Bernardo, Shiva Nadi, Parisa Zahedi

Inference from volunteer data

We created an analysis pipeline as part of a paper which outlines how to perform precise statistical inference (correcting for geospatial selection bias) using volunteer-generated data.

Contributors: Peter Lugtig, Erik-Jan van Kesteren, Annemarie Timmers, Javier Garcia-Bernardo

Project page/code

Housing market data engineering

We performed data engineering work to transform 10TB of online marketing (clicks) data from a large online housing platform into an analyzable format. These datasets are used in research surrounding search behaviour on the housing market in the Netherlands. We made the processed data available as an open dataset.

Contributors: Joep Steegmans, Jonathan de Bruin

Project page/code

Citizen science website

To get an overview of what citizen science projects are available in the Netherlands, we have created a website with an overview of such projects. The community can contribute their own projects via the gitub page!

Contributors: Peter Lugtig, Jonathan de Bruin, Leonardo Vida, Annemarie Timmers

Project page/code

Geoenrichment docker images

Geo-enrichment requires transferring large amounts of data from a geospatial database to a computer program. Public APIs served over the internet are usually too slow for this purpose. Hence, we have created a docker image so that the API for our osmenrich R package can be run locally.

Contributors: Peter Lugtig, Erik-Jan van Kesteren, Leonardo Vida

Project page/code

Kansenkaart analysis pipeline

Using large data sets from Statistics Netherlands, we developed a data pre-processing and analysis pipeline for estimating expectations concerning the inequality of opportunity in The Netherlands using the ODISSEI Secure Supercomputer (OSSC). These estimates will be available on the project website.

Contributors: Bastian Ravesteijn, Erik-Jan van Kesteren, Helen Lam

Find out more about ODISSEI OSSC

Project page/code

Geoenrichment

We have created an R-package to perform geo-enrichment of datasets using openstreetmaps. Enriching geo-coded (latitude/longitude) data sets with features from the physical surroundings enables researchers to take into account spatial surroundings in statistical models.

Contributors: Peter Lugtig, Erik-Jan van Kesteren, Leonardo Vida

Project page/code

Structural causal analysis & reconciliation

Contributors: ,

Map explorer

This repository contains a Vue.js web application that renders GeoJSON maps with dynamic region coloring. The application imports geographic boundary data in GeoJSON format and applies colors to regions based on external datasets (see /public). It can use any GeoJSON as the basis for the map and it can use a dataset to determine the coloring of the region. Everything runs locally in the browser with duckdb-wasm as the underlying SQL online analytical processing (OLAP) database.

Contributors: Niek de Schipper, Erik-Jan van Kesteren

Consultations

Incorporating machine learning in land-use prediction models

We discussed how to incorporate machine learning approaches in the existing land-use prediction models in use at PBL. We also focussed on how to validate these models in a "fair" way without giving preference to one or the other approach.

Contributors: Irena Itova, Erik-Jan van Kesteren, Bas van Bemmel, Taymara Abreu

Migrants’ inclusion in the financial system

We brainstormed about ways to investigate financial inclusion using register data, focussing on study loans

Contributors: Lisa van Dongen, Erik-Jan van Kesteren

Text analysis of primary school booklets

We discussed ways in which primary schools in the Netherlands could be analysed in terms of their substantive principles, using regulated information booklets the schools are required to make available.

Contributors: Sara Geven, Erik-Jan van Kesteren

Relating latent classes to substantive predictors

We went over an implemented latent class analysis and discussed how external variables could be related to these latent classes, ideally via latent class regression-type analyses.

Contributors: Christine Hedde-von Westernhagen, Erik-Jan van Kesteren

Causality and social tipping points

We discussed how to match individuals. We discussed how to use CBS data to study social diffusion at the spatial and family level.

Contributors: Christine Hedde-von Westernhagen, Javier Garcia-Bernardo

CBS longitudinal data

We discussed how to disentangle changes in network metrics due to changes in demography and actual changes in mechanisms

Contributors: Eszter Boyanki , Javier Garcia-Bernardo

LLM-driven annotation application and datasets for psychological research about life goals

We discussed a potential collaboration on developing an open-source, LLM-driven application for annotating open-ended survey responses about life goals and making the resulting dataset available.

Contributors: Vinicius Coscioni, Qixiang Fang, Qixiang Fang

Measuring diversity at tiny geographical scales using CBS data

We discussed potential ways to use CBS data to measure diversity at different scales---street, neighborhood, radius. We discussed ways in which SoDa could support the project.

Contributors: Jona de Jong, Javier Garcia-Bernardo

How to deal with a slow and memory-intensive R package

We discussed the quality of a specific R package, and how to deal with problems of trying to analyze a huge dataset with it.

Contributors: Joep Keuzenkamp, Erik-Jan van Kesteren

Options for speeding up causal inference code on Snellius

We talked about how to run computationally intensive bootstrapping procedures on the Dutch national supercomputer Snellius.

Contributors: Jack Fitzgerald, Erik-Jan van Kesteren

How to collect data from public government communications

We discussed how to collect text from the official government repository https://zoek.officielebekendmakingen.nl/ We discussed and made a quick example for a scraper using BeautifulSoup

Contributors: Gita Huijgen, Javier Garcia-Bernardo

Project page/code

How to make research about transparent AI communication more accessible through LLMs

We discussed different options for making communication about AI more transparent for a wide audience by customizing a large language model, and allowing companies to ask the LLM for suggestions on their communications.

Contributors: Sarah Marschlich, Erik-Jan van Kesteren, Alexandra Schwinges

Debugging complex multilevel models in the CBS RA

We discussed what may be the causes of models failing during the fitting. We discussed a sereies of steps to test each of those causes.

Contributors: Gemma Geuke, Javier Garcia-Bernardo

Improving computationally intensive spatial prediction model

We consulted on which approach to take in improving a spatial prediction model, how to set up such a project computationally, and we discussed how to assess and analyze whether this model actually improves upon the currently used model.

Contributors: Irena Itova, Erik-Jan van Kesteren

Machine learning models predicting worse than simple models

We discussed issues of overfitting and discussed strategies on how to solve this issue.

Contributors: Jiamin Ou, Javier Garcia-Bernardo

Understanding diffusion in networks

We consulted about how to distinguish influence from selection in network contagion, and what type of analysis are possible

Contributors: Raphael Hoheisel, Javier Garcia-Bernardo

Using NLP techniques to analyze biodiversity narratives in financial reports

We consulted on which NLP techniques could be appropriate for determining how companies talk about sustainability / biodiversity in their official reports.

Contributors: Catalina Papari, Qixiang Fang

Conceptualizing theories about boreout at work using online text data

We discussed how to validate a newly developed theory of the concept of bore-out in a systematic way against how this concept is portrayed in popular media and on websites. Additionally, we discussed to what extent automated text mining techniques could be useful in this content analysis process.

Contributors: Madelon van Hooff, Javier Garcia-Bernardo, Erik-Jan van Kesteren

Discussing future data donation studies

We explored different research ideas for using data donation in studies around music listening behaviour.

Contributors: Hekmat Alrouh, Erik-Jan van Kesteren

Finding closest geographical points quickly

We created a short script to find the closest neighbors within an x km radio to each person in a geographical dataset (each person has a latitude and longitude coordinates): https://github.com/sodascience/find_geo_peers_fast

Contributors: Ajay Bhaskarabhatla, Javier Garcia-Bernardo

Analyzing the effect of engagement on online therapy

We developed an analysis plan for a randomized trial on the effect of online therapy with pre-post measurements and several relevant predictors.

Contributors: Iris ten Klooster, Erik-Jan van Kesteren

Project page/code

Classifying economic segregation

We discussed how to use Non-Negative Matrix factorization to cluster neighborhoods into "segregation profiles".

Contributors: Ignacio Urria Yanez, Javier Garcia-Bernardo

Assessing network contagion

We discussed how to assess network contagion (emotions being transmitted in networks) using longitudinal data

Contributors: Yuanyuan Hu, Javier Garcia-Bernardo

Using CBS networks

What are options to use CBS networks

Contributors: Jona de Jong, Javier Garcia-Bernardo

Combining corporate microdata

We discussed how to make sure that the researcher was retrieving all pertinent information from the Orbis IP database

Contributors: Catalina Papari, Javier Garcia-Bernardo

Merging geographical data

Merging geographical data from different sources to predict the effect of coral bleaching on the economic impact of storms.

Contributors: Joep Keuzenkamp, Javier Garcia-Bernardo

Using ML to detect patterns in social data

How to use ML to model self-esteem change in education-to-work transitions

Contributors: Ketaki Diwan, Javier Garcia-Bernardo

Extrapolating models between CBS datasets

We discussed how to combine survey data from two, non-overlaping surveys at CBS.

Contributors: Maike Weiper, Javier Garcia-Bernardo

Extracting meaning from social media

We discussed which social media could be used to analyze geographical differences in emotions within Utrecht. We also discussed APIs and how to use them.

Contributors: Mimi Ramirez Aranda, Javier Garcia-Bernardo

Database of politician’s faces

We discussed how to create a database of politician's faces to understand how they are framed in social media.

Contributors: Wies Ruyters, Javier Garcia-Bernardo

Building a recommendation system for heat protection in homes

We discussed how to structure code to create a modular recommendation system based on a questionnaire about physical properties of study participants' homes.

Contributors: Maha Moustafa Habib Abdelraouf, Erik-Jan van Kesteren

Linking physical environment data to CBS microdata

We discussed how to combine spatial information with the ODIN travel survey at CBS to do spatial planning research on heat exposure.

Contributors: Maarten Hogeweij, Erik-Jan van Kesteren

Linking LISS and CBS data to perform network analysis

We discussed ODISSEI grant opportunities to access LISS and CBS, and how to practically use network files at CBS.

Contributors: Huyen Nguyen, Javier Garcia-Bernardo

Estimating gender bias using NLP methods

We discussed how to use LLMs to detect gender bias in movies (using subtitle files), at the individual movie level.

Contributors: Eftychia Stamkou, Erik-Jan van Kesteren & Javier Garcia-Bernardo

Implementing spatial data analysis for sociological research

We brainstormed about options and packages in R to incorporate spatial data in a sociological research project.

Contributors: Kevin Wittenberg, Erik-Jan van Kesteren, Javier Garcia-Bernardo

How to structure code when my scripts become big?

We discussed how to structure code and datasets to make a large project more reproducible / reusable and easier to maintain.

Contributors: Johannes Aengenheyster, Erik-Jan van Kesteren

Initializing transport ABMs with a synthetic population

We thought about which open and closed data sources and which methods would be best to create a synthetic population for initializing an agent-based model of transport behaviour. We also discussed how to validate how good the synthetic population was.

Contributors: Marco Pellegrino, Erik-Jan van Kesteren

Network analysis of symptoms

Discussing different methods of preprocessing a medical dataset for subsequent network analysis to create symptom networks.

Contributors: Willemijn van Waarden, Erik-Jan van Kesteren, Javier Garcia-Bernardo

Software sustainability for Rsiena

We are helping to improve the sustainability of the RSiena network analysis software package, by helping to write a grant proposal and through a brainstorm session on efficient collaboration on GitHub.

Contributors: Tom Snijders, Erik-Jan van Kesteren, Christian Steglich, Javier Garcia-Bernardo, Jonathan de Bruin

Online Housing Market search strategy

Which search behaviour leads to finding a house quickly on the housing market? We brainstormed about how to perform analysis for a study on this topic using a large database of online housing search behaviour.

Contributors: Joep Steegmans, Erik-Jan van Kesteren

FIRMBACKBONE

We regularly consult on FIRMBACKBONE, an initiative to collect an organically growing longitudinal data-infrastructure with information on Dutch companies for academic research. This data will become available for researchers affiliated with universities in The Netherlands through ODISSEI. We are consulting on the technical implementation of the FIRMBACKBONE project.

Contributors: Peter Gerbrands, Javier Garcia-Bernardo, Erik-Jan van Kesteren, Jonathan de Bruin

Computational efficiency

We brainstormed about how the analysis for a research project with big-data could be set up and whether it runs on a personal computer.

Contributors: Thijs Lindner, Erik-Jan van Kesteren

Synthetic data for agent-based models

Brainstorming with researchers about working with Statistics Netherlands data and generating synthetic data that can serve as input in an agent-based model.

Contributors: Sanne Hettinga, Erik-Jan van Kesteren, Corentin Kuster

eScience consultations

We regularly join consultations done by the eScience center for projects that fall within the social sciences, for example in preparation for the ODISSEI-eScience grants.

Contributors: Various researchers, Jonathan de Bruin, Erik-Jan van Kesteren, Javier Garcia-Bernardo

ODISSEI eScience Grants