Projects

Here you can find descriptions and links to projects we worked on in the past.

Collaborations

Creating a historical disease database

How common was cholera in May 1866 in Rotterdam? What about the disease environment in Groningen in the early 1830's? In this project we are creating a historical database of common diseases such as cholera using Delpher newspaper data.
Contributors: Kristina Thompson, Erik-Jan van Kesteren, Qixiang Fang

Harmonizing longitudinal datasets to estimate psychological impact of COVID

Harmonize recovery in mental health after the COVID-19 pandemic in different datasets. We built pipelines to harmonize different datasets, and consulted on how to analyze and visualize the data.
Contributors: Keenan Ramsey, Erik-Jan van Kesteren, Javier Garcia-Bernardo

The causal impact of prenatal poverty

What are the impacts of poverty in the prenatal period on children's health outcomes? Which mechanisms might be suitable policy intervention targets for these potential negative effects?
Contributors: Nadya Ali, Erik-Jan van Kesteren, Javier Garcia-Bernardo, Marion van den Heuvel

Policy intervention assessment in primary schools

We are helping to compute individual causal effects of a policy intervention in primary schools in Rotterdam. For this, we are exploring the use of advanced quasi-experimental methods such as synthetic controls.
Contributors: Gijs Custers, Erik-Jan van Kesteren, Oisín Ryan

Machine learning pipeline for early life opportunity

We are helping with the creation of a data analysis and machine learning pipeline for the project "Kansrijke start", which aims to investigate which markers in the first 1000 days since conception are predictive of adverse events later in life.
Contributors: Wessel Kraaij, Erik-Jan van Kesteren, Anton Schreuder, Richard van Dijk

Computational models for word and non-word associations

We are collaborating on a project for computationally modelling people's intuitions about various associations of both real words and non-words (e.g., novel words or company names) for the Dutch language. The project will result in an easy-to-use openly available application in which (non-)words can be analyzed for various associations that they may evoke as well as give a list of semantically similar words.
Contributors: Giovanni Cassani, Erik-Jan van Kesteren, Aron Joosse

Predicting fertility with interpretable machine learning

We collaborate on finding the limits of predictability for fertility intentions using a mass collaboration (benchmark). We are opening up the participant's models to understand what information is captured by highly predictive models, and to disaggregate the predictions of the models to assess geographical and demographic biases.
Contributors: Gert Stulp, Javier Garcia-Bernardo

Trust in public institutions

We are creating a pipeline and dashboard to evaluate how trust on European public institutions evolves during pandemic.
Contributors: Patrick Brown, Javier Garcia-Bernardo, Matthijs Vollenbroek, Stijn Peeters, Marc Tuters

COVID-19 spread in social networks

Together with RIVM and the ministry of health, the SoDa team is co-authoring a scientific paper on the spread of COVID-19 in schools in the Netherlands.
Contributors: RIVM, Javier Garcia-Bernardo

Linking datasets based on company names

Linking databases on company names is a challenging task. Company names are usually not unique and can have many spelling variations. We helped conduct a sensitivity analysis for different methods of linking these databases, which can be used to answer many different social science research questions about companies in the Netherlands.
Contributors: Peter Gerbrands, Jonathan de Bruin, Wim Coreynen

Synthetic register data for open science

We created a workflow using existing software packages to generate synthetic datasets for the Statistics Netherlands microdata architecture. These synthetic datasets can then be used as example datasets when sharing analyses (but not original data!) with researchers.
Contributors: Jan Kabatek, Erik-Jan van Kesteren, Kyuri Park

Benchmarking for social science

We helped to design and set up a benchmark for a social data science challenge at the end of the SICSS-ODISSEI summerschool. The benchmark was based on microdata from Statistics Netherlands.
Contributors: Paulina Pankowska, Javier Garcia-Bernardo, Adrienne Mendrik

Metadata for synthetization

We are developing a metadata format which includes variable-level statistical information. This format can then be used to generate fake, synthetic datasets for testing purposes using a python package.
Contributors: Ricarda Braukmann, Erik-Jan van Kesteren, Raoul Schram

Deviance in art automatic data collection

We created an automated data collection program (a "scraper") to make a database of artworks with varying levels of deviance. This database will be used in research on how to measure deviance in art.
Contributors: Eftychia Stamkou, Javier Garcia-Bernardo, Raoul Schram

COVID-19 spread in social networks

Together with RIVM and the ministry of health, the SoDa team explored how to analyze the spread of COVID-19 in social networks, making use of the CBS social network files. The outcome of the project is a report for the ministry of health.
Contributors: RIVM, Javier Garcia-Bernardo

Anonymity preserving collection of whatsapp data

We are creating a script to extract information from Whatsapp data packages, allowing to link data from different people while preserving the privacy of those people. This project is part of the ODISSEI LISS grant “Assessing Mobile Instant Messenger Networks with Donated Data”
Contributors: Laura Boeschoten, Javier Garcia-Bernardo, Parisa Zahedi, Shiva Nadi, Rense Corten

Supercomputing for social scientists

We co-created and co-taught a full-day workshop on high-performance computing for social scientists.
Contributors: Carlos Teijeiro Barjas, Erik-Jan van Kesteren, Benjamin Czaja

Population network data processing

We have helped with the implementation of the Statistics Netherlands population network data files in order to make them available to network researchers. These network data files can be used to develop network analysis models.
Contributors: Tom Emery, Javier Garcia-Bernardo

Empathy diagnostics dashboard

We created a pilot for an interactive questionnaire app which immediately generates a diagnostic report based on the inputs. This app is now used to study empathy in anti-social adolescents.
Contributors: Minet de Wied, Javier Garcia-Bernardo, Shiva Nadi, Parisa Zahedi

Inference from volunteer data

We created an analysis pipeline as part of a paper which outlines how to perform precise statistical inference (correcting for geospatial selection bias) using volunteer-generated data.
Contributors: Peter Lugtig, Erik-Jan van Kesteren, Annemarie Timmers, Javier Garcia-Bernardo

Housing market data engineering

We performed data engineering work to transform 10TB of online marketing (clicks) data from a large online housing platform into an analyzable format. These datasets are used in research surrounding search behaviour on the housing market in the Netherlands. We made the processed data available as an open dataset.
Contributors: Joep Steegmans, Jonathan de Bruin

Citizen science website

To get an overview of what citizen science projects are available in the Netherlands, we have created a website with an overview of such projects. The community can contribute their own projects via the gitub page!
Contributors: Peter Lugtig, Jonathan de Bruin, Leonardo Vida, Annemarie Timmers

Geoenrichment

We have created an R-package to perform geo-enrichment of datasets using openstreetmaps. Enriching geo-coded (latitude/longitude) data sets with features from the physical surroundings enables researchers to take into account spatial surroundings in statistical models.
Contributors: Peter Lugtig, Erik-Jan van Kesteren, Leonardo Vida

Geoenrichment docker images

Geo-enrichment requires transferring large amounts of data from a geospatial database to a computer program. Public APIs served over the internet are usually too slow for this purpose. Hence, we have created a docker image so that the API for our osmenrich R package can be run locally.
Contributors: Peter Lugtig, Erik-Jan van Kesteren, Leonardo Vida

Kansenkaart analysis pipeline

Using large data sets from Statistics Netherlands, we developed a data pre-processing and analysis pipeline for estimating expectations concerning the inequality of opportunity in The Netherlands using the ODISSEI Secure Supercomputer (OSSC). These estimates will be available on the project website.
Contributors: Bastian Ravesteijn, Erik-Jan van Kesteren, Helen Lam

Consultations

Understanding diffusion in networks

We consulted about how to distinguish influence from selection in network contagion, and what type of analysis are possible
Contributors: Raphael Hoheisel, Javier Garcia-Bernardo

Using NLP techniques to analyze biodiversity narratives in financial reports

We consulted on which NLP techniques could be appropriate for determining how companies talk about sustainability / biodiversity in their official reports.
Contributors: Catalina Papari, Qixiang Fang

Conceptualizing theories about boreout at work using online text data

We discussed how to validate a newly developed theory of the concept of bore-out in a systematic way against how this concept is portrayed in popular media and on websites. Additionally, we discussed to what extent automated text mining techniques could be useful in this content analysis process.
Contributors: Madelon van Hooff, Javier Garcia-Bernardo, Erik-Jan van Kesteren

Discussing future data donation studies

We explored different research ideas for using data donation in studies around music listening behaviour.
Contributors: Hekmat Alrouh, Erik-Jan van Kesteren

Finding closest geographical points quickly

We created a short script to find the closest neighbors within an x km radio to each person in a geographical dataset (each person has a latitude and longitude coordinates): https://github.com/sodascience/find_geo_peers_fast
Contributors: Ajay Bhaskarabhatla, Javier Garcia-Bernardo

Analyzing the effect of engagement on online therapy

We developed an analysis plan for a randomized trial on the effect of online therapy with pre-post measurements and several relevant predictors.
Contributors: Iris ten Klooster, Erik-Jan van Kesteren

Classifying economic segregation

We discussed how to use Non-Negative Matrix factorization to cluster neighborhoods into "segregation profiles".
Contributors: Ignacio Urria Yanez, Javier Garcia-Bernardo

Assessing network contagion

We discussed how to assess network contagion (emotions being transmitted in networks) using longitudinal data
Contributors: Yuanyuan Hu, Javier Garcia-Bernardo

Using CBS networks

What are options to use CBS networks
Contributors: Jona de Jong, Javier Garcia-Bernardo

Combining corporate microdata

We discussed how to make sure that the researcher was retrieving all pertinent information from the Orbis IP database
Contributors: Catalina Papari, Javier Garcia-Bernardo

Merging geographical data

Merging geographical data from different sources to predict the effect of coral bleaching on the economic impact of storms.
Contributors: Joep Keuzenkamp, Javier Garcia-Bernardo

Using ML to detect patterns in social data

How to use ML to model self-esteem change in education-to-work transitions
Contributors: Ketaki Diwan, Javier Garcia-Bernardo

Extrapolating models between CBS datasets

We discussed how to combine survey data from two, non-overlaping surveys at CBS.
Contributors: Maike Weiper, Javier Garcia-Bernardo

Database of politician’s faces

We discussed how to create a database of politician's faces to understand how they are framed in social media.
Contributors: Wies Ruyters, Javier Garcia-Bernardo

Extracting meaning from social media

We discussed which social media could be used to analyze geographical differences in emotions within Utrecht. We also discussed APIs and how to use them.
Contributors: Mimi Ramirez Aranda, Javier Garcia-Bernardo

Linking physical environment data to CBS microdata

We discussed how to combine spatial information with the ODIN travel survey at CBS to do spatial planning research on heat exposure.
Contributors: Maarten Hogeweij, Erik-Jan van Kesteren

Building a recommendation system for heat protection in homes

We discussed how to structure code to create a modular recommendation system based on a questionnaire about physical properties of study participants' homes.
Contributors: Maha Moustafa Habib Abdelraouf, Erik-Jan van Kesteren

Linking LISS and CBS data to perform network analysis

We discussed ODISSEI grant opportunities to access LISS and CBS, and how to practically use network files at CBS.
Contributors: Huyen Nguyen, Javier Garcia-Bernardo

Estimating gender bias using NLP methods

We discussed how to use LLMs to detect gender bias in movies (using subtitle files), at the individual movie level.
Contributors: Eftychia Stamkou, Erik-Jan van Kesteren & Javier Garcia-Bernardo

Implementing spatial data analysis for sociological research

We brainstormed about options and packages in R to incorporate spatial data in a sociological research project.
Contributors: Kevin Wittenberg, Erik-Jan van Kesteren, Javier Garcia-Bernardo

How to structure code when my scripts become big?

We discussed how to structure code and datasets to make a large project more reproducible / reusable and easier to maintain.
Contributors: Johannes Aengenheyster, Erik-Jan van Kesteren

Initializing transport ABMs with a synthetic population

We thought about which open and closed data sources and which methods would be best to create a synthetic population for initializing an agent-based model of transport behaviour. We also discussed how to validate how good the synthetic population was.
Contributors: Marco Pellegrino, Erik-Jan van Kesteren

Network analysis of symptoms

Discussing different methods of preprocessing a medical dataset for subsequent network analysis to create symptom networks.
Contributors: Willemijn van Waarden, Erik-Jan van Kesteren, Javier Garcia-Bernardo

Software sustainability for Rsiena

We are helping to improve the sustainability of the RSiena network analysis software package, by helping to write a grant proposal and through a brainstorm session on efficient collaboration on GitHub.
Contributors: Tom Snijders, Erik-Jan van Kesteren, Christian Steglich, Javier Garcia-Bernardo, Jonathan de Bruin

Online Housing Market search strategy

Which search behaviour leads to finding a house quickly on the housing market? We brainstormed about how to perform analysis for a study on this topic using a large database of online housing search behaviour.
Contributors: Joep Steegmans, Erik-Jan van Kesteren

Firmbackbone

We regularly consult on FIRMBACKBONE, an initiative to collect an organically growing longitudinal data-infrastructure with information on Dutch companies for academic research. This data will become available for researchers affiliated with universities in The Netherlands through ODISSEI. We are consulting on the technical implementation of the FIRMBACKBONE project.
Contributors: Peter Gerbrands, Javier Garcia-Bernardo, Erik-Jan van Kesteren, Jonathan de Bruin

Computational efficiency

We brainstormed about how the analysis for a research project with big-data could be set up and whether it runs on a personal computer.
Contributors: Thijs Lindner, Erik-Jan van Kesteren

Synthetic data for agent-based models

Brainstorming with researchers about working with Statistics Netherlands data and generating synthetic data that can serve as input in an agent-based model.
Contributors: Sanne Hettinga, Erik-Jan van Kesteren, Corentin Kuster

eScience consultations

We regularly join consultations done by the eScience center for projects that fall within the social sciences, for example in preparation for the ODISSEI-eScience grants.
Contributors: Various researchers, Jonathan de Bruin, Erik-Jan van Kesteren, Javier Garcia-Bernardo