Azure Data Hub: data security and privacy by design
Security and privacy are important aspects of all IT projects. This is especially true for platforms intended to store and process data for analytical purposes such as our Azure Data Hub. So, whenever we are implementing a data hub for a customer, those things are always in the back of our minds.
Not a single implementation we’ve done is the same, simply because our customers have different policies and must comply with (slightly) different legislation, depending on their location and domain. In this article, we will describe Anchormen’s process of adherence to ‘security and privacy by design’, the main concerns of our clients, and some best practices we’ve picked up throughout the years.
Before diving in, it’s worth mentioning the Azure Data Hub itself. At its core, it’s an Azure native data platform which provides a solid solution to quickly deliver data-driven applications & insights using various types, volumes and speeds of data. It’s an implementation of the left side of Microsoft’s ‘Advanced Analytics Architecture’ which uses Data Factory, Data Lake storage, and Databricks. This Azure PaaS & SaaS based platform stores and processes data of any size, structure, speed and life cycle in a central place with a pay-per-use cost model. On top of this ‘foundation’ we also have a number of ‘extensions’ for IoT and streaming cases, Business Intelligence, and containerized applications. Check out this page for more information about the data platform.
It must be noted that the Advanced Analytics Architecture mentioned in the previous paragraph shows an overview of the services used but does not provide any details on other aspects such as networking, business continuity, security or privacy. Those are some of the things we will tackle in this article.
Our standard Data Hub deployment strikes a good balance between information security and usability. I find this important as the lack of usability leads to work-arounds and shadow IT which is rarely secure. The trick is to create a secure data platform while still giving data professionals the freedom to do their job. So our standard deployment includes:
- Hub and spoke network architecture with the option to put storage within the network (private links). Every Data Hub environment is isolated within its own Spoke network.
- Data encryption at rest as well as in transit
- Three user groups with roles providing different permissions on services and data
- A Key Vault to store secrets to avoid them ending up in code and repositories
- Auditing of data access
- Trusted IP ranges are allowed to access storage over the internet
This standard can be customized to comply with any organizational policy. In order to do the right customization, our Data Hub implementations start with workshops with architects, security- and privacy officers. We provide details about the Data Hub so that we can jointly position it within the existing IT landscape and comply with security and privacy policies. Topics covered within these workshops include (physical) separation of data, networking, personal data, role-based access control, auditing, and the processes surrounding all of this. With all of these things in mind, we customize the Data Hub design and implementation. Some example customizations are:
- Multiple isolated Data Hub environments in order to separate data domains or data sensitivities (i.e. public and internal in one environment and sensitive/secret in another)
- More fine-grained role-based access control
- Advanced network security using firewalls or network security appliances.
- The use of jump-boxes which users need to access data (i.e. they are no longer allowed to work with data directly from their own laptop)
Obviously, this only provides a high-level overview but hopefully gives an impression of what is possible.
Handling personal data follows roughly the same process as security; we have our own template, but in the end the customer must decide how they want personal information to be handled. We typically see a combination of the following approaches:
- Do not store or use personal data within the Data Hub. This makes life easy but does limit the use of the Data Hub as it can no longer be used for use-cases concerning data from people.
- Anonymize or pseudonymize data by an ETL solution or middleware before loading it in the Data Hub. This does provide the opportunity to use data related to people without knowing who the data is really about. And since the original data is not stored, there is no way (or extremely hard) to invade people privacy within the Data Hub. This scenario is not always feasible as such ETL/middleware responsible for pseudonymization are not always in place.
- Anonymize or pseudonymize data within the Data Hub itself. In this scenario original data is loaded into the integration/staging layer after which it is pseudonymized. We have developed a Spark library to do this at scale (briefly described below). It is possible to store both the original data and the pseudo-version. For obvious reasons the original data is rarely used, but it can be good to have it in case it’s really needed.
- Its also possible to store data in a’n SQL Database and use masking policies to pseudonymize the data when it is being read. This has the benefit of storing the data only once, but can be quite expensive for large datasets.
Pseudonymization is described within General Data Protection Regulation (GDPR) as an “appropriate technical and organizational measure” contributing to privacy by design. It refers to the process which replaces personal data with artificial identifiers. When done correctly it will no longer be possible to identify real people based on the data while it still leaves it usable for analysis. However, it’s important to be aware that this does not work in all cases and people might become identifiable when data is combined and/or scrutinized. This is typically the result of outliers in the data like large postal areas with only a few houses, very old people (i.e. year of birth is an outlier) or very low / high incomes.
As stated before, we have developed our own Spark library to anonymize or pseudonymize data using a config file. It supports the following strategies:
- Hashing of strings or parts of strings. For example, a name like ‘John Doe’ can be hashed entirely, or first and last name separately. This still allows to match on, for example last name if needed. Another example is full or partial hashing of email addresses (firstname.lastname@example.org to email@example.com or hash@hash)
- Masking of strings by replacing characters with ‘x’ or ‘*’. This can be used to mask credit card numbers (xxxx-xxxx-xxxx-1234) or remove precision from postal codes (1234AA to 1234**)
- Date and times can be rounded. For example change birth date to year of birth (1990-01-01 to 1990-01 or 1990)
- Rounding of numbers into large ‘buckets’, for example rounding of income into the following: below 20K, 20K-40K, 40-60K, 60K and above
- Fields can also be ‘nullified’ meaning that content is simply removed. Doing this anonymizes data and makes it less useful from an analytics perspective.
We are working on a follow-up post which dives deeper on the use of this library, so stay tuned for that one!
Properly pseudonymized data can still be very usable and minimizes the risk of invading someone’s privacy. People working with data should always be aware of this risk.
Legislation such as GDRP also requires organizations to handle privacy related ‘requests’ besides the obligation to protect personal information. People can ask for an overview of what data is stored, how it is used, request correction of data or request deletion of data. From my perspective most of this boils down to using proper processes to handle these requests and the use of data catalogs needed to know what is stored and where. This is a topic of its own which I will leave for a possible future blog post on the subject.