Singapore & more in numbers

Numbers, data, and statistics made easy

Data in the iron mask: How to protect important and sensitive information

🎭 This is a special post co-authored with Matthew Yerbury and Nicholas Biller, friends from HSBC’s People Data Science team. We use a Hollywood film from the 1990s to introduce the concept of data masking in a fun way!1

In The Man in the Iron Mask film, King Louis XIV was kidnapped, put in an iron mask, and replaced by his secret identical twin brother. Putting King Louis behind the mask kept his identity hidden—until the mask was unlocked by a key.

This story can help us understand the concept of data masking, which involves hiding and replacing original data with a scrambled copy. Just like how no one could identify King Louis because his most important features were hidden behind a mask.

Original images from: Wikimedia Commons, Alamy

In this post, we introduce the basics of data masking. We share some common data masking techniques and a simple flow chart to help practitioners consider when and which masking techniques may be needed.

Why mask data?

Data are masked to protect important but sensitive personal information. Companies have to follow strict legal regulations and ethical obligations surrounding how they use and share sensitive personal data to respect peoples’ right to anonymity.

As more companies move towards storing data in the cloud, they need to make sure that sensitive customer or employee information remain protected.

Data masking methods

Data can be masked using anonymisation or pseudonymisation techniques, depending on the level of data protection needed.

Anonymisation is an irreversible process where the original information, or its granularity, is lost. This is like throwing away the key to unlock the King’s iron mask. These are some examples of anonymisation techniques.

Technique Description Before Example After Example
Adding noise Add known levels of inaccuracy to data Savings
£23,034
£94,304
£25,153

Savings
£24,943
£94,502
£24,205

K-aggregation Rolls up data into groups, summarising rows in minimum sample size K Location
London
Birmingham
Sheffield


Location

UK

*K=3

Generalisation Converts values into categories Location
London
Birmingham
Sheffield

Location
UK
UK
UK

Pseudonymisation is a reversible process where the original information is preserved. This has fewer limitations on data analysis but is less secure because a key file that maps the original values to the masked values must be saved separately.

This is like having a key to unlock the iron mask. These are some examples of pseudonymisation techniques.

Technique Description Before Example After Example
Hashing Converts data into a random character sequence that is difficult to reverse without the key
Location
London
Birmingham
Sheffield

Location
59ead8d1e1242
269cb914b20d2
30d85fa7d5632

Substitution Replaces original values with alternative values Location
London
Birmingham
Sheffield

Location
Giraffe
Eagle
Mercedes

L-Diversification Splits each attribute into L different values Location
London




Location
Giraffe
Springbok
Bison
*L=3

2Note: Each hashed location actually has 32 characters but only the first 12 characters are displayed in the table.

When and how to mask your data

This is a simple flow chart to help you start thinking about when and which masking techniques may be needed for your project.

Summary

Data masking is like The Man in the Iron Mask, you hide sensitive and important information behind a mask. It’s difficult to identify what’s behind the mask without unlocking it with a key. Companies need to mask important and sensitive personal information for privacy reasons, especially as they move towards storing data in the cloud. We’ve shared some data masking techniques and considerations for choosing appropriate techniques.

P.s. We wanted to use this image of King Louis but couldn’t afford the copyright fees (click to see) 🙂


  1. About the authors: Kenneth Lim is a Data Scientist with IBM’s Data Science Elite team, Nicholas Biller is the Senior People Data Science Lead at HSBC, Matthew Yerbury is the Group Head of People Data Science at HSBC.↩