Collecting online platforms data for science: an example using WhatsApp September 8, 2023 | 6

Collecting online platforms data for science: an example using WhatsApp

These days, our online presence leaves traces of our behavior everywhere. There is data of what we do and say in platforms such as WhatsApp, Instagram, online stores, and many others. Of course, this so-called ‘digital trace data’ is of interest for social scientists: new, rich, enormous datasets that can be used to describe and understand our social world. However, this data is commonly owned by private companies. How can social scientists access and make sense of this data?

In this tutorial, we use data donation and the Port software to get access to WhatsApp group-chat data in a way that completely preserves privacy of research participants. Our goal is to show a small peek of what can be achieved with these methods. If you have an idea for your own research that entails collecting digital trace data, don’t hesitate to contact us! We can help you think about data acquisition, analysis and more.

With data donation, it is possible to collect data about any online platform: under the General Data Protection Regulation (EU law), companies are required to provide their data to any citizen that requests it. This data is available in so-called Data Download Packages (DDP’s), which are rather cumbersome to work with and contain personal information. Therefore, the Port software processes these DDP’s so that the data is in a format ready for analysis, while completely guaranteeing privacy of the respondents. The only thing research participants have to do is request their DDP’s, see which information they are sharing and consent to sharing it.

Since we do not dive in with a lot of detail, we refer to Port’s github for more details on how to get started with your own project. There you can find a full guide to install and use Port, examples of past studies done with it, a tutorial for creating your own data donation workflow, and more. You can also read more about data donation in general here and here.

An application with WhatsApp data

In this example, we extract some basic information from WhatsApp group chats, such as how many messages links, locations, and pictures were shared, as well as which person in the group the participant responded most to.

Note that this is the only information we want to collect from the participants of the study, not the whole group chat file!

The first step in creating a DDP processing script is to obtain an example DDP and examine it. This example DDP can be, for example, your own DDP requested from WhatsApp. Usually, platforms provide a (compressed) folder with many different files; i.e., data in a format that is not ready to use. Once uncompressed, a WhatsApp group chat file could look like this:

[16/03/2022, 15:10:17] Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more.
[16/03/2022, 15:20:25] person1: Hi shiva!
[16/03/2022, 15:25:38] person2: Hi 👋
[16/03/2022, 15:26:48] person3: Hoi!
[16/03/2022, 18:39:29] person2: https://youtu.be/KBmUTY6mK_E
[16/03/2022, 18:35:51] person1: ‎Location: https://maps.google.com/?q=52.089451,5.108469
[20/03/2022, 20:08:51] person4: I’m about to generate some very random messages so that I can make some screenshots for the explanation to participants
[24/03/2022, 20:19:38] person1: @user3 if you remove your Profile picture for a moment I will redo the screenshots 😁
[26/03/2022, 18:52:15] person2: Well done Utrecht 😁
[14/07/2020, 22:05:54] person4: 👍Bedankt

As part of a collaboration with data donation researchers using Port, we wrote a Python script1 to convert this into the information we need. The main script is available here; in short, it does the following:

  • separate the header and the message itself
  • parse the date and time information in each message
  • remove unneeded information such as alert notifications
  • anonymize usernames
  • convert the extracted information to a nice data frame format to show the participant for consent

One big problem we had to overcome is that messages and alert notifications cannot be identified in the same way (i.e., using the same regular expression) on every device. Through trial-and-error, we tailored the steps to work with every operating system, language, and device required for this study. Indeed, if you design a study like this, it is very important to try out your script on many different DDPs from different people and devices. That way you will make sure you have covered possible variation in DPPs before actually starting data collection. This is a process that can take quite a while, so keep this in mind when you want to run a data donation study!

The end result

In Figure 1 you can see a (fictitious) snippet of the dataset obtained. This is how a dataset in which you combine donations from different users would look like. As can be seen, we have moved from a rather untidy text file to a tidy, directly analysable dataset, where each row corresponds to an user in a given data donation package, and the rest of the columns give information about that user. Particularly, the dataset displays the following information: data donation package id (ddp_id), an anonymized user name (user_name), number of words sent on the groupchat by the user (nwords), date of first and last messages sent on the groupchat by the user (date_first_mess, date_last_mess), number of urls, files and locations sent on the groupchat by the user (respectively, nurls, nfiles, nlocations), and the (other) user that has replied more to that user (replies_from), as well as the user that that user has replied to the most (replies_to).

Table 1. Snippet of fictitious dataset

ddp_id user_name nwords date_first_mess date_last_mess nurls nfiles nlocations replies_from replies_to
1 User1_1 121 10/08/2023 27/08/2023 0 15 0 User1_2 User1_4
1 User1_2 17 11/08/2023 28/08/2023 3 1 2 User1_1 User1_1
1 User1_3 44 10/08/2023 28/08/2023 9 6 3 User1_2 User1_1
1 User1_4 50 12/08/2023 29/08/2023 0 3 1 User1_3 User1_1
2 User2_1 123 01/05/2022 01/11/2022 2 0 1 User2_2 User2_2
2 User2_2 250 01/05/2022 02/11/2022 0 32 3 User2_3 User2_1
2 User2_3 176 08/07/2022 04/12/2022 6 0 5 User2_2 User2_3
3 User3_1 12 05/06/2023 26/07/2023 12 2 0 User3_1 User3_2
3 User3_2 16 06/06/2023 26/07/2023 17 2 0 User3_2 User3_1

In Figure 2 you can see a screenshot of how the Port software would display the data to be shared (number of words or messages, date stamps…) and ask for consent to the research subjects. As you see, the Port software guarantees that research subjects are aware of what information they are sharing and consent to it. The rest of the DDPs, including sensitive data, is analyzed locally and does not leave the respondents’ devices.

zenodo_project

Conclusion

The aim of this post was to illustrate how to use data donation with the software Port to extract online platform data. We illustrated all of this with the extraction of group-chat information from WhatsApp data. The main challenge of this project was to write a robust script that transforms this data into a nice, readily usable format while maintaining privacy. If you want to implement something similar but do not know how or where to start, let us know and we can help!


  1. This script uses a deprecated version of Port, but large part of the script can be reused; ↩︎