When working with text data in research projects, privacy is extremely important, especially when that data includes potentially personal or sensitive information. Whether you’re dealing with interview transcripts, survey responses, or scraped online content, failing to properly anonymise your data can have serious ethical consequences for researchers.
However, when dealing with large amounts of text data, the anonymisation process can be slow, and inefficient, with large scale manual anonymisation being prone to human error. One tool seeking to address this problem is Microsoft Presidio. This is an open-source solution for detecting and anonymising personally identifiable information in text. Presidio allows you to build a robust anonymisation pipeline using Named Entity Recognition models and configurable redaction policies, all running locally on your own device.
Getting started with Presidio can be a little overwhelming, especially if you’re looking for a clean, automated solution that fits into your research workflow. Many existing resources are written with developers in mind, and can be confusing for researchers to jump in to. With this tutorial, we walk you through our solution using Microsoft Presidio, so that you can safely and efficiently anonymise your research data.
Step 1: Prepare Your Data
This step is very simple, and all you will need to do is make sure you have the data you’d like to anonymise saved in a csv file. For the purpose of this tutorial, we will be using the following data set:
ID | Response |
---|---|
1 | I want to pass my exam, because I am going to France this summer and I don’t want to do a resit. |
2 | Brian, Thijs and I are meeting up at the cafe near Utrecht central station, to have a group study session. |
3 | I will ask Prof. Martins for feedback on my assignment by Friday. |
4 | I finished my assignment last week, and this week I just need to check it before submitting. |
Step 2: Install the Required Packages
First, make sure you have python installed on your machine. If you don’t, you can find the most recent version on the python website. Once you have this prepared, you can then install the following packages:
# install packages
pip install polars presidio_analyzer presidio_anonymizer
# download english NER language model
python -m spacy download en_core_web_lg
The polars
data frame library will be used for data import and export, but you are welcome to use any other package (such as pandas
). On the second line, we use the spacy
package to download en_core_web_lg
: a large language model which Presidio uses to recognise parts of speech and thus provide context-dependent anonymisation. Once you’ve downloaded all these packages and prepped your datafile, you’re ready to start the anonymisation process.
Step 3: Anonymise your Data
First you will import all the packages you just downloaded, and will need for this process:
import polars as pl
from presidio_analyzer import BatchAnalyzerEngine
from presidio_anonymizer import BatchAnonymizerEngine
Next, you will indicate the name of the datafile you would like to anonymise, and import the file into a dataframe:
# load the data
df = pl.read_csv("test_data.csv")
# turn data into a dictionary of columns
df_dict = df.to_dict(as_series=False)
# initialize the analyzer and anonymizer
batch_analyzer = BatchAnalyzerEngine()
batch_anonymizer = BatchAnonymizerEngine()
This final piece of code runs the anonymisation, and then outputs the anonymised data into a datafile with the same name as your input, with “_Anonymised” appended to the name. This is the piece of code which indicates what kind of information you would like to anonymise, and Presidio provides a selection of built in data types which you can chose from. For this tutorial, we are using “PERSON”, and “LOCATION”, meaning any references to specific names of people, or names of places will be anonymised. However, multiple other data types such as “EMAIL_ADDRESS”, “IP_ADDRESS”, or “DATE_TIME” can also be recognised and anonymised, and a full list of supported data types can be found here, and can be added to the entities=
code below as needed:
# Find entities
analyzer_results = batch_analyzer.analyze_dict(
input_dict=df_dict, entities=["PERSON", "LOCATION"], language="en"
)
# collect results into a list
analyzer_results = list(analyzer_results)
# perform anonymization
anonymizer_results = batch_anonymizer.anonymize_dict()
# turn into dataframe
df_anon = pl.DataFrame(anonymizer_results)
# write to csv
df_anon.write_csv("test_data_anon.csv")
The results from this process will be an anonymised csv file, with the example data used in this tutorial now looking like this:
ID | Response |
---|---|
1 | I want to pass my exam, because I am going to <LOCATION> this summer and I don’t want to do a resit. |
2 | <PERSON>, <PERSON> and I are meeting up at the cafe near <LOCATION> central station, to have a group study session. |
3 | I will ask Prof. <PERSON> for feedback on my assignment by Friday. |
4 | I finished my assignment last week, and this week I just need to check it before submitting. |