tutorial

Introducing field masking on privacy streams

With the masked fields feature, we're introducing more ways to advance privacy in your data without sacrificing data utility.

by STRM.team on

Head over to the docs for the technical intro with code examples_

Better privacy and utility in data all at once

At STRM Privacy, we aim to help data teams strike a better balance between innovation and data privacy, and build with more confidence as a result.

As a new step in that direction, we have just released a new tool you can apply to your privacy streams: field masking.

Refresher: consent binding is a major leap

When dealing with privacy in data, we often see processing is limited to obfuscation and/or it’s treated as an access problem and that’s that. We believe that is a very limited view of data privacy1, often a missed opportunity and a legal risk if “solved” in that way.

It’s why we approach it a level deeper, binding the consent to every event to process data before consumption. This has tons of advantages apart from compliancy, like the ability to deal with changing consent over time and making RTBF a cheap operation (just throw away the encryption keys!).

In the typical situations we see, a user gives a specific consent (e.g. “I consent to my data being used for marketing or personalization purposes”).

Based on this, we split the events into a privacy stream for every consent level, with the appropriate processing applied on a per-field basis. If a user does not consent, the link to that specific person would be destroyed, and they show up as different users over time inside your data. If a user is OK with being identified, a customer ID’s will be decrypted inside the privacy stream and ready to use.

This is already a major leap in data privacy for many. However, there are tons of use cases that require to retain a consistent link (on the user level) between every event over time, without the need to expose the user. Or maybe you just want to go the extra mile to protect your users.

The extra mile: retain the user link, mask the identity

Imagine you are building a recommender system, and a large percentage of your users consent to personalization purposes (which requires you to retain the link to a user over time). This would expose the customer ID to any application and employee allowed to access the personalization privacy stream, like the scientists and engineers building the recommender. If you destroy the link to a user, you destroy the utility of the data for such a system. But the teams don’t need to know a user’s identity, just that it was a single person.

This extends beyond personalization of course: in finance you may want to analyze across accounts or credit cards. In healthcare you could predict care need. Online marketing is all about increasing reach and effectiveness of your spend, not tracking single users for the sake of it. In personal mobility you may want to optimize logistics and predict you need six scooters at 8am every monday morning near the corner of Amsterdam and Rue de Parme.

All of these use cases require you to link a single user, but not to expose their identity.

Apply masking to stream decryption to link the user but hide their identity

With the new field masking, you can specify which fields should be masked before exposing, to provide you a consistent value over time that hides the user but retains the pattern.

Masking a field is done through our CLI (console will follow later). In defining a privacy stream, you can set which fields should be masked. This is what defining masked fields in our privacy demo event contract looks like:

❯ strm create stream example
❯ strm create stream --derived-from example \
--levels 3 --masked-fields \
strmprivacy/example/1.3.0:uniqueIdentifier,notSensitiveValue,someSensitiveValue,consistentValue
--mask-seed=hi-there
❯ strm create stream --derived-from example --levels 3

Given a unique ID unique-5, this will mask that ID in every event and be present in the privacy stream example-M3 as the same, masked, value:

"uniqueIdentifier": "1083e8169d7138e990cc30095578452"

Note that you have to set a seed value yourself. Make sure you treat this as you would an API key, e.g. generate it randomly and then store it away from users in a secret manager. We expect data engineers to prepare this for their teams, often from a central point. That’s why we chose for the seed to be user-defined.

The full technical overview is described in our docs.

We look forward to hearing how you use the masked fields, and any implementation or feature requests you might have!

1 We should emphasize this does not help you to evade the need for consent (and it’s definitely not intended as such). “Identification” is not limited to singular ID’s (be it a customer ID, some sort of UUID, cookie etc). Patterns of behavior can very well lead to identification on just the string of events itself. Imagine you capture sensor data of braking and turning the wheel in automotive. Even without a user ID it would take just two days outside of a weekend (at least before the Metaverse 😝) to understand where someone lives and works! Get in touch if you need help on the implications for your busisiness and use case.

PS We’re hiring!

Want to work on features like field masking to help data teams build awesome data products without sacrificing privacy in the process? There’s plenty of cool work left. Did we mention we are hiring!?

Do you want to build faster and cheaper without worrying about privacy?