Collecting online platforms data for science: an example using WhatsApp

September 8, 2023 | 6

Collecting online platforms data for science: an example using WhatsApp

These days, our online presence leaves traces of our behavior everywhere. There is data of what we do and say in platforms such as WhatsApp, Instagram, online stores, and many others. Of course, this so-called ‘digital trace data’ is of interest for social scientists: new, rich, enormous datasets that can be used to describe and understand our social world. However, this data is commonly owned by private companies. How can social scientists access and make sense of this data?

In this tutorial, we use data donation and the Port software to get access to WhatsApp group-chat data in a way that completely preserves privacy of research participants. Our goal is to show a small peek of what can be achieved with these methods. If you have an idea for your own research that entails collecting digital trace data, don’t hesitate to contact us! We can help you think about data acquisition, analysis and more.

With data donation, it is possible to collect data about any online platform: under the General Data Protection Regulation (EU law), companies are required to provide their data to any citizen that requests it. This data is available in so-called Data Download Packages (DDP’s), which are rather cumbersome to work with and contain personal information. Therefore, the Port software processes these DDP’s so that the data is in a format ready for analysis, while completely guaranteeing privacy of the respondents. The only thing research participants have to do is request their DDP’s, see which information they are sharing and consent to sharing it.

Since we do not dive in with a lot of detail, we refer to Port’s github for more details on how to get started with your own project. There you can find a full guide to install and use Port, examples of past studies done with it, a tutorial for creating your own data donation workflow, and more. You can also read more about data donation in general here and here.

An application with WhatsApp data

In this example, we extract some basic information from WhatsApp group chats, such as how many messages links, locations, and pictures were shared, as well as which person in the group the participant responded most to.

Note that this is the only information we want to collect from the participants of the study, not the whole group chat file!

The first step in creating a DDP processing script is to obtain an example DDP and examine it. This example DDP can be, for example, your own DDP requested from WhatsApp. Usually, platforms provide a (compressed) folder with many different files; i.e., data in a format that is not ready to use. Once uncompressed, a WhatsApp group chat file could look like this:

[16/03/2022, 15:10:17] Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more.
[16/03/2022, 15:20:25] person1: Hi shiva!
[16/03/2022, 15:25:38] person2: Hi 👋
[16/03/2022, 15:26:48] person3: Hoi!
[16/03/2022, 18:39:29] person2: https://youtu.be/KBmUTY6mK_E
[16/03/2022, 18:35:51] person1: ‎Location: https://maps.google.com/?q=52.089451,5.108469
[20/03/2022, 20:08:51] person4: I’m about to generate some very random messages so that I can make some screenshots for the explanation to participants
[24/03/2022, 20:19:38] person1: @user3 if you remove your Profile picture for a moment I will redo the screenshots 😁
[26/03/2022, 18:52:15] person2: Well done Utrecht 😁
[14/07/2020, 22:05:54] person4: 👍Bedankt

As part of a collaboration with data donation researchers using Port, we wrote a Python script¹ to convert this into the information we need. The main script is available here; in short, it does the following:

separate the header and the message itself
parse the date and time information in each message
remove unneeded information such as alert notifications
anonymize usernames
convert the extracted information to a nice data frame format to show the participant for consent

One big problem we had to overcome is that messages and alert notifications cannot be identified in the same way (i.e., using the same regular expression) on every device. Through trial-and-error, we tailored the steps to work with every operating system, language, and device required for this study. Indeed, if you design a study like this, it is very important to try out your script on many different DDPs from different people and devices. That way you will make sure you have covered possible variation in DPPs before actually starting data collection. This is a process that can take quite a while, so keep this in mind when you want to run a data donation study!

The end result

In Figure 1 you can see a (fictitious) snippet of the dataset obtained. This is how a dataset in which you combine donations from different users would look like. As can be seen, we have moved from a rather untidy text file to a tidy, directly analysable dataset, where each row corresponds to an user in a given data donation package, and the rest of the columns give information about that user. Particularly, the dataset displays the following information: data donation package id (ddp_id), an anonymized user name (user_name), number of words sent on the groupchat by the user (nwords), date of first and last messages sent on the groupchat by the user (date_first_mess, date_last_mess), number of urls, files and locations sent on the groupchat by the user (respectively, nurls, nfiles, nlocations), and the (other) user that has replied more to that user (replies_from), as well as the user that that user has replied to the most (replies_to).

Table 1. Snippet of fictitious dataset

ddp_id	user_name	nwords	date_first_mess	date_last_mess	nurls	nfiles	nlocations	replies_from	replies_to
1	User1_1	121	10/08/2023	27/08/2023	0	15	0	User1_2	User1_4
1	User1_2	17	11/08/2023	28/08/2023	3	1	2	User1_1	User1_1
1	User1_3	44	10/08/2023	28/08/2023	9	6	3	User1_2	User1_1
1	User1_4	50	12/08/2023	29/08/2023	0	3	1	User1_3	User1_1
2	User2_1	123	01/05/2022	01/11/2022	2	0	1	User2_2	User2_2
2	User2_2	250	01/05/2022	02/11/2022	0	32	3	User2_3	User2_1
2	User2_3	176	08/07/2022	04/12/2022	6	0	5	User2_2	User2_3
3	User3_1	12	05/06/2023	26/07/2023	12	2	0	User3_1	User3_2
3	User3_2	16	06/06/2023	26/07/2023	17	2	0	User3_2	User3_1

In Figure 2 you can see a screenshot of how the Port software would display the data to be shared (number of words or messages, date stamps…) and ask for consent to the research subjects. As you see, the Port software guarantees that research subjects are aware of what information they are sharing and consent to it. The rest of the DDPs, including sensitive data, is analyzed locally and does not leave the respondents’ devices.

zenodo_project — Figure 2. *How the Port software displays the data to be shared and asks for consent*

Conclusion

The aim of this post was to illustrate how to use data donation with the software Port to extract online platform data. We illustrated all of this with the extraction of group-chat information from WhatsApp data. The main challenge of this project was to write a robust script that transforms this data into a nice, readily usable format while maintaining privacy. If you want to implement something similar but do not know how or where to start, let us know and we can help!

This script uses a deprecated version of Port, but large part of the script can be reused; ↩︎