pseudonymization best practices

Pseudonymization best practices

Data security and privacy by design part 2

Every application and IT system needs to comply with privacy legislation such as GDPR. This is especially true for our Data Hub solution which stores and processes data of any kind, including personal data. In our previous article, we already provided a high-level overview of the topics of security and privacy. In this one, we wanted to share with you some best practices about the concept and process of pseudonymiziation itself.

At the end of this article, we’ve also shared and described our PySpark library which we’ve developed to perform pseudonymization within the Data Hub.

Pseudonymization is a de-identification process used on personal data in such a way that the data can no longer be linked to a specific person while remaining useful for data analysis and data processing. Pseudonymization is frequently used to comply with GDPR demands for secure data storage of personal information.

There are several ways to pseudonymize or anonymize your data so that it’s GDPR compliant. Some common methods which we will cover are masking, hashing, rounding, data truncation or simply the removal of personal data. So, let’s dig right into it!

Masking

Masking is a de-identification process where an important or unique part of the data is hidden with random characters or other data. Masking can be used to identify data without manipulating actual identities.

Credit card Credit card masked
1234-5678-1234-5678 ****-****-****-5678

Here we masked a credit card number up to the last four digits.

Take careful consideration of how to best mask your data because some types of masking are still not GDPR compliant. For example *******@gmail.com is GDPR compliant but if the data subject has a personalized email address, with lets say their last name, it will result in this masked email *******@ockerse.com which is not GDPR compliant.

Hashing

Hashing is a way of transforming data into a fixed-sized, unreadable string of information (hash value); the resulting hash value is always the same for the same input, allowing some data analysis to be possible.

Name Hashed name
John Doe Hash1
Roberto Ockerse Hash2
John Doe Hash1
Liam Li Hash3

Here we hashed the full value for each name. As you can see, “John Doe” has the same hash value.

Hashing should not be considered as a suitable way of pseudonymization if the range of unique inputs is too small as it would be possible to calculate the original attribute hidden behind the result of the hash, with reasonable means.

Rounding

Rounding numerical data is a simple but effective pseudonymization method. The idea is to remove some amount of precision from individual values by replacing them with a broader category.

In this example, we’ve put salary values in buckets of 10,000 with max value: 60,000 and min value :18,000.

Salaries Rounded salaries
45016 50000
12420 18000
87564 60000
37453 40000
124564 60000
14784 18000

When rounding numerical data, it’s a good practice to have a max and min value to default to, as data points that are way above or below the average could be used to identify a data subject.

Date Truncating

Date truncating, similar to rounding numerical data can be used as a form of de-identification by removing some precision from a date.

Birth date Truncated Birth date
04/02/1987 01/01/1987
12/07/2008 01/01/2008
15/08/1993 01/01/1993

Here, we’ve truncated each individual birthday to the year and have removed the month and day of the data subject’s birth date.

Nullify

Lastly, the most thorough way of de-identify data is removing it.

Example:

Name Nullified name
John Doe Null
Liam Li Null

Note: pseudonymization is a powerful tool for protecting personal data while keeping some information to analyze however it can be risky when combining data as it could then indirectly re-identify a person in what is called ‘The Mosaic Effect”

Pseudonymization library

Here at Anchormen we’ve developed a spark library that generalizes and simplifies the process of anonymization within the Data Hub.

The library takes a JSON configuration file specifying:

  • which column to alter
  • which method to use (hashing, masking, rounding, date truncation or nullifying )
  • any other information required by the pseudonymization technique.

Take for example this employee.csv filled with typical information you would have for employees:

Name Email Bank account Phone number Salary
John Doe john@doel.com 1234-5678-1234-5678 0668525359 45016
Bob Li bob@outlook.com 5678-1234-5678-1234 0638671046 12420
Robert Garcia robert@yahoo.com 5000-2000-9000-5555 0653791287 37453
Janice Doe janice@gmail.com 4545-7878-1212-3232 0610348233 87564

All of this data could be used to identify someone and therefore must be pseudonymized or anonymized in some way to prevent identification. To do this, we create config.json file with instructions on how to de-identify each column.

[
  {
    "column":"name",
    "strategy": "hash_by_token"
    "mode":"split",
    "token":" ",

  },

  {
    "column":"email",
    "strategy": "mask_by_token",
    "mode":"split",
    "token":"@",
    "mask":"*",

  },
  {
    "column":"bank_account",
    "strategy": "mask_by_token",
    "mode":"split_first",
    "token":"-",
    "mask":"X",
    "n_tokens":3

  },
  {
    "column": "phone_number",
    "strategy": "mask_n_chars",
    "n_chars": 8,
    "mask": "*",
    "mode": "last"
  },
  {
    "column":"salary",
    "strategy":"round_numerical",
    "mode":"closest" ,
    "max": 60000 ,
    "min": 18000 ,
    "bucket":10000
  },

]

The pseudonymize function takes in the data frame to de-identify and the path to the configuration file, it will then return a pseudonymized data frame that you can then store safely in the Data Hub.

pseudonymization

Employee_pseudonymized.csv

Name Email Bank account Phone number Salary
hash1 Hash2 ****@******* XXXX-XXXX-XXXX-5678 06******** 40000
Hash3 hash4 ***@*********** XXXX-XXXX-XXXX-1234 06******** 18000
Hash5 Hash6 ******@********* XXXX-XXXX-XXXX-5555 06******** 30000
Hash7 Hash2 ******@********* XXXX-XXXX-XXXX-3232 06******** 60000

As you can see, the data frame is now fully de-identified but it’s still viable for analysis purposes. Taking a closer look at the name column, we told the pseudonymization library to split into sub-string on every space character found, then hash each of the sub-strings. This allows us to keep the relation between first and last names, for example the “Doe” last name got hashed as “Hash2” and we can see a “Hash2” in the first and last columns. We now know that the 2 last names are related but not what the original value is.

GDPR requires that any personal identifiable data be pseudonymized or anonymized in a way that it can no longer be linked to a single data subject. We saw how this can be done with some common methods such as hashing or rounding and to be mindful of what method to choose to de-identify you data, as sometimes the subject could still be linked.

The pseudonymization library cuts a lot of the development time needed to create and manage your own de-identification methods, giving you more time to focus on development and analysis of your data. We hope you’ll find it useful in your work! If you have any questions about the topic, best practices, the library or anything else, you can contact us here.

Like this article and want to stay updated of more news and events?
Then sign up for our newsletter!

Don't miss out!

Subscribe to our newsletter and stay up to date with our latest articles and events!

Subscribe now

* These fields are required.