k-Anonymity and database encryption

In ADAM, we have a parent portal which allows parents to see some of the academic information we store about their children. This information needs to be secure but allow untrained individuals easy access to their portal. We want to have minimal interaction with the school to lower their support burden.

The task is made easier by the fact that we already know a great deal of information about our parents. Our role in identifying them is based on whether they are able to tell us information that we know that, because it was given to us by them, they should also know. Our approach uses a three-prong registration process. They are asked for their national identification number and their cellphone number. If we are able to match this information to a parent, we send a password reset-link to the email address stored on record. Subsequent logins require their ID number and this password. Forgotten passwords require reset via email.

While that is all good, it does mean that we are storing their ID numbers in our database. It would be prudent to encrypt all of these, but that leaves us with a problem when parents log in: decrypting all of the ID numbers to find out which one matches is a computationally expensive task.

I am looking at using a k-anonymity model which I learned about from Troy Hunt in his pwned passwords API. In simple terms, k-anonymity uses the principle of safety in numbers.

An example of this: instead of decrypting every ID number in the database, let’s find those that match the first four digits and just check those. To allow this to happen, we would need to have the ID number stored in encrypted format plus an additional field that just stored the first four digits of the ID number in plain text. We could search for those and see which possible ID numbers are likely matches. If someone were to discover the first four digits, the argument is that they still wouldn’t know the whole ID number.

That decreases the number of ID numbers we need to decrypt by a factor of about 400-fold (you might expect it to reduce the number by a factor of 104 = 10000 but there is limited variation in the first four digits of an ID number of a parent: most of whom will be born in the 60s and 70s meaning that the majority will have a first digit of 6 or 7. The third digit in an ID number is the leading 0 or 1 of the month: 2 x 10 x 2 x 10 = 400).

Instead of having to decrypt, say, 1000 ID numbers, we need only decrypt, on average, 2.5 on each login.

However, storing the first 4 characters in plain text still reveals the year and month of birth for a parent. This is potentially still leaking personal information and so we need to go further. Enter hashing.

Hashing is a one-way operation which allows us to verify the original information without necessarily knowing what it was. When people hear “one-way mathematical operation”, they tend to respond in disbelief. All of algebra, for example, operates on the idea of reversible operations! To illustrate a very simple example of one-way operations, consider a four digit number: 4371. We could “hash” this in many different ways. Let’s use the trivial method of adding all the digits: 4 + 3 + 7 + 1 = 15. If you knew just that the hash was “15”, there is no way that you could know for certain what the original number was. (An academic aside: this is meant to illustrates a one-way operation only and is not meant to suggest a good option for hashing! It is left as an exercise for the reader to decide why this algorithm is bad).

A commonly used hashing algorithm, such as SHA1, could be used to hash the ID numbers. Because the resulting hash is hexadecimal, this means that based on the hash, just by storing the first three digits of the hash, we can reduce the number of ID numbers that need to be checked by approximately 16 x 16 x 16 = 4096-fold. This, mostly, means that we will probably need to only decrypt one ID number at the most to check if it’s valid.

We would only store the first three characters of the hash. There are limited ways of using this information to reverse engineer the ID numbers without generating a list of every ID number and its hash. Even then, with only 4096 three-digit hashes and 1,28 billion (realistic) ID numbers possible, this means that approximately 310 000 ID numbers would likely generate a hash with the same first 3 characters. So even if we knew the truncated hash, that means that our ID number is one of 310 000 possible ID numbers that would reduce to the same hash. Our original ID number is a proverbial needle in a hay-stack with no way to tell what it is.

With most schools only having a few thousands of parents in their database, the three-character hash is an effective way to filter the number of possibilities to an acceptably low level.