
Pseudonymization best practicesData security and privacy by design part 2
Every application and IT system needs to comply with privacy legislation such as GDPR. This is especially true for our Data Hub solution which stores and processes data of any kind, including personal data. In our previous article, we already provided a high-level overview of the topics of security and privacy. In this one, we wanted to share with you some best practices about the concept and process of pseudonymiziation itself.
At the end of this article, we’ve also shared and described our PySpark library which we’ve developed to perform pseudonymization within the Data Hub.
Pseudonymization is a de-identification process used on personal data in such a way that the data can no longer be linked to a specific person while remaining useful for data analysis and data processing. Pseudonymization is frequently used to comply with GDPR demands for secure data storage of personal information.
There are several ways to pseudonymize or anonymize your data so that it’s GDPR compliant. Some common methods which we will cover are masking, hashing, rounding, data truncation or simply the removal of personal data. So, let’s dig right into it!
Masking
Masking is a de-identification process where an important or unique part of the data is hidden with random characters or other data. Masking can be used to identify data without manipulating actual identities.
Credit card | Credit card masked |
---|---|
1234-5678-1234-5678 | ****-****-****-5678 |
Here we masked a credit card number up to the last four digits.
Take careful consideration of how to best mask your data because some types of masking are still not GDPR compliant. For example *******@gmail.com is GDPR compliant but if the data subject has a personalized email address, with lets say their last name, it will result in this masked email *******@ockerse.com which is not GDPR compliant.
Hashing
Hashing is a way of transforming data into a fixed-sized, unreadable string of information (hash value); the resulting hash value is always the same for the same input, allowing some data analysis to be possible.
Name | Hashed name |
---|---|
John Doe | Hash1 |
Roberto Ockerse | Hash2 |
John Doe | Hash1 |
Liam Li | Hash3 |
Here we hashed the full value for each name. As you can see, “John Doe” has the same hash value.
Hashing should not be considered as a suitable way of pseudonymization if the range of unique inputs is too small as it would be possible to calculate the original attribute hidden behind the result of the hash, with reasonable means.
Rounding
Rounding numerical data is a simple but effective pseudonymization method. The idea is to remove some amount of precision from individual values by replacing them with a broader category.
In this example, we’ve put salary values in buckets of 10,000 with max value: 60,000 and min value :18,000.
Salaries | Rounded salaries |
---|---|
45016 | 50000 |
12420 | 18000 |
87564 | 60000 |
37453 | 40000 |
124564 | 60000 |
14784 | 18000 |
When rounding numerical data, it’s a good practice to have a max and min value to default to, as data points that are way above or below the average could be used to identify a data subject.
Date Truncating
Date truncating, similar to rounding numerical data can be used as a form of de-identification by removing some precision from a date.
Birth date | Truncated Birth date |
---|---|
04/02/1987 | 01/01/1987 |
12/07/2008 | 01/01/2008 |
15/08/1993 | 01/01/1993 |
Here, we’ve truncated each individual birthday to the year and have removed the month and day of the data subject’s birth date.
Nullify
Lastly, the most thorough way of de-identify data is removing it.
Example:
Name | Nullified name |
---|---|
John Doe | Null |
Liam Li | Null |
Note: pseudonymization is a powerful tool for protecting personal data while keeping some information to analyze however it can be risky when combining data as it could then indirectly re-identify a person in what is called ‘The Mosaic Effect”
Pseudonymization library
Here at Anchormen we’ve developed a spark library that generalizes and simplifies the process of anonymization within the Data Hub.
The library takes a JSON configuration file specifying:
- which column to alter
- which method to use (hashing, masking, rounding, date truncation or nullifying )
- any other information required by the pseudonymization technique.
Take for example this employee.csv filled with typical information you would have for employees:
Name | Bank account | Phone number | Salary | |
---|---|---|---|---|
John Doe | john@doel.com | 1234-5678-1234-5678 | 0668525359 | 45016 |
Bob Li | bob@outlook.com | 5678-1234-5678-1234 | 0638671046 | 12420 |
Robert Garcia | robert@yahoo.com | 5000-2000-9000-5555 | 0653791287 | 37453 |
Janice Doe | janice@gmail.com | 4545-7878-1212-3232 | 0610348233 | 87564 |
All of this data could be used to identify someone and therefore must be pseudonymized or anonymized in some way to prevent identification. To do this, we create config.json file with instructions on how to de-identify each column.
[ { "column":"name", "strategy": "hash_by_token" "mode":"split", "token":" ", }, { "column":"email", "strategy": "mask_by_token", "mode":"split", "token":"@", "mask":"*", }, { "column":"bank_account", "strategy": "mask_by_token", "mode":"split_first", "token":"-", "mask":"X", "n_tokens":3 }, { "column": "phone_number", "strategy": "mask_n_chars", "n_chars": 8, "mask": "*", "mode": "last" }, { "column":"salary", "strategy":"round_numerical", "mode":"closest" , "max": 60000 , "min": 18000 , "bucket":10000 }, ]
The pseudonymize function takes in the data frame to de-identify and the path to the configuration file, it will then return a pseudonymized data frame that you can then store safely in the Data Hub.
Employee_pseudonymized.csv
Name | Bank account | Phone number | Salary | |
---|---|---|---|---|
hash1 Hash2 | ****@******* | XXXX-XXXX-XXXX-5678 | 06******** | 40000 |
Hash3 hash4 | ***@*********** | XXXX-XXXX-XXXX-1234 | 06******** | 18000 |
Hash5 Hash6 | ******@********* | XXXX-XXXX-XXXX-5555 | 06******** | 30000 |
Hash7 Hash2 | ******@********* | XXXX-XXXX-XXXX-3232 | 06******** | 60000 |
As you can see, the data frame is now fully de-identified but it’s still viable for analysis purposes. Taking a closer look at the name column, we told the pseudonymization library to split into sub-string on every space character found, then hash each of the sub-strings. This allows us to keep the relation between first and last names, for example the “Doe” last name got hashed as “Hash2” and we can see a “Hash2” in the first and last columns. We now know that the 2 last names are related but not what the original value is.
GDPR requires that any personal identifiable data be pseudonymized or anonymized in a way that it can no longer be linked to a single data subject. We saw how this can be done with some common methods such as hashing or rounding and to be mindful of what method to choose to de-identify you data, as sometimes the subject could still be linked.
The pseudonymization library cuts a lot of the development time needed to create and manage your own de-identification methods, giving you more time to focus on development and analysis of your data. We hope you’ll find it useful in your work! If you have any questions about the topic, best practices, the library or anything else, you can contact us here.