
In this post, we’ll show how you can integrate STRM’s privacy streams and privacy transformations with (native!) role-based access controls and foreign keys inside data warehouse solutions.
In short, this brings STRM privacy streams (which are localized, purpose-bound and use case specific data interfaces) to data warehousing (centralized + use case agnostic). There’s more background here.
On the fly decryption
So you have your records processed and transformed through STRM, and the encryption keys (the key stream
) are available in your databases. What’s next? In the following steps we’re going to show how to bring back the original plaintext data in Snowflake.
I’m using the UCI online retail dataset that we processed (described in this blogpost)
The processed data exist in this publicly accessible Gcloud Bucket. You must have any google account to view this bucket, although you could use any http client to download individual objects in the bucket without authentication.
Prepare
- make sure you have a Snowflake account, that helps a lot :-)
- download these two csv files
- download decrypter-1.0.0-all.jar that contains the thin wrapper around Google Tink that we wrote (github).
Execute these SQL commands (either in snowsql
or the web interface).
Load data
First create the tables. These map the shape of the uci dataset and the keys exports (from batch jobs)
create or replace TABLE "encrypted" (
"InvoiceNo" VARCHAR(500),
"StockCode" VARCHAR(500),
"Description" VARCHAR(500),
"Quantity" VARCHAR(500),
"InvoiceDate" VARCHAR(500),
"UnitPrice" VARCHAR(500),
"CustomerID" VARCHAR(500),
"Country" VARCHAR(500),
"strmMeta.eventContractRef" VARCHAR(500),
"strmMeta.nonce" VARCHAR(500),
"strmMeta.timestamp" VARCHAR(500),
"strmMeta.keyLink" VARCHAR(500),
"strmMeta.billingId" VARCHAR(500),
"strmMeta.consentLevels" VARCHAR(500)
);
create or replace TABLE "keys" (
"keyLink" VARCHAR(50),
"encryptionKey" VARCHAR(500)
);
Load the data with an ad-hoc Snowflake stage. The query assumes you’ve downloaded the files from the Prepare section and are using snowsql
from the directory that those files are stored in.
put file://encrypted-44MB.csv @~;
copy into "encrypted" from '@~/encrypted-44MB.csv.gz'
file_format=(type=CSV field_optionally_enclosed_by = '"' skip_header=1);
put file://keys.csv @~;
copy into "keys" from '@~/keys.csv.gz'
file_format=(type=CSV field_optionally_enclosed_by = '"' skip_header=1);
Have a look at the data
select * from "encrypted" limit 3;
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | strmMeta.eventContractRef | strmMeta.nonce | strmMeta.timestamp | strmMeta.keyLink | strmMeta.billingId | strmMeta.consentLevels |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | ASqc1Q0QalDEN+LHeyZSfGHE+s9Lqu4o+jM= | United Kingdom | strmprivacy/online-retail/1.0.0 | 0 | 1291188360000 | 0fd20015-40e4-484d-ab1f-182acff382ac | NULL | [2] |
536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | ASqc1Q0QalDEN+LHeyZSfGHE+s9Lqu4o+jM= | United Kingdom | strmprivacy/online-retail/1.0.0 | 0 | 1291188360000 | 0fd20015-40e4-484d-ab1f-182acff382ac | NULL | [2] |
536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | ASqc1Q0QalDEN+LHeyZSfGHE+s9Lqu4o+jM= | United Kingdom | strmprivacy/online-retail/1.0.0 | 0 | 1291188360000 | 0fd20015-40e4-484d-ab1f-182acff382ac | NULL | [2] |
select * from "keys" limit 2;
keyLink | encryptionKey |
---|---|
0fd20015-40e4-484d-ab1f-182acff382ac | {“primaryKeyId”:714921229,“key”:[{“keyData”:{"typeUr… |
1de9b1fc-9ad5-43a2-b501-af6933624f67 | {“primaryKeyId”:1369926497,“key”:[{“keyData”:{"typeUr… |
Create the decrypt UDF
This User Defined Function uses the compiled Kotlin code that contains the thin wrapper that we wrote around Google Tink (sources)[2].
CREATE OR REPLACE FUNCTION f_strm_decrypt(key VARCHAR, ciphertext VARCHAR) RETURNS VARCHAR
LANGUAGE JAVA
RETURNS NULL ON NULL INPUT
IMMUTABLE
IMPORTS=('@~/decrypter-1.0.0-all.jar')
HANDLER='Udf.decrypt'
AS
$$
import io.strmprivacy.aws.lambda.decrypter.Decrypt;
class Udf {
public static String decrypt(String key, String cipherText){
return Decrypt.INSTANCE.decrypt(key, cipherText);
}
}
$$;
And test it
select f_strm_decrypt('{"primaryKeyId":714921229,"key":[{"keyData":{"typeUrl":"type.googleapis.com/google.crypto.tink.AesSivKey","value":"EkBKGzBuy9C3UUmWaOzpe7NBEg6QK21FRhZ9MjuD5hpa0+hPJy0kn1HngA9QUT5aGbTNQQyow0V6qJCFoFRQNNTH","keyMaterialType":"SYMMETRIC"},"status":"ENABLED","keyId":714921229,"outputPrefixType":"TINK"}]}', 'ASqc1Q0QalDEN+LHeyZSfGHE+s9Lqu4o+jM=');
This should produce 17850 as output.
On-the-fly decryption
SELECT
f_strm_decrypt(k."encryptionKey", e."CustomerID") AS "decryptedCustomerID",
e."InvoiceNo", e."Description"
FROM
"encrypted" e
INNER JOIN "keys" k
ON e."strmMeta.keyLink" = k."keyLink"
LIMIT 10;
decryptedCustomerID | InvoiceNo | Description |
---|---|---|
17850 | 536365 | WHITE HANGING HEART T-LIGHT HOLDER |
17850 | 536365 | WHITE METAL LANTERN |
17850 | 536365 | CREAM CUPID HEARTS COAT HANGER |
17850 | 536365 | KNITTED UNION FLAG HOT WATER BOTTLE |
17850 | 536365 | RED WOOLLY HOTTIE WHITE HEART. |
17850 | 536365 | SET 7 BABUSHKA NESTING BOXES |
17850 | 536365 | GLASS STAR FROSTED T-LIGHT HOLDER |
17850 | 536366 | HAND WARMER UNION JACK |
17850 | 536366 | HAND WARMER RED POLKA DOT |
13047 | 536367 | ASSORTED COLOUR BIRD ORNAMENT |
Filtering on consent
First create a udf that figures out the maximum integer value from the strmMeta.consentLevels
column [3].
CREATE OR REPLACE FUNCTION f_json_max(json VARCHAR) RETURNS INT
LANGUAGE JAVA
RETURNS NULL ON NULL INPUT
IMMUTABLE
IMPORTS=('@~/decrypter-1.0.0-all.jar') -- only for the Gson library
HANDLER='Udf.maxFromJson'
AS
$$
import java.lang.reflect.Type;
import com.google.gson.Gson;
import com.google.gson.reflect.TypeToken;
import java.util.Collections;
import java.util.List;
class Udf {
static Gson gson = new Gson();
static Type token = new TypeToken<List<Integer>>(){}.getType();
public static Integer maxFromJson(String json){
try {
List<Integer> ints = gson.fromJson(json, token);
return Collections.max(ints);
}
catch (Exception e) {
return null;
}
}
}
$$;
SELECT
f_strm_decrypt(k."encryptionKey", e."CustomerID") AS "decryptedCustomerID",
e."InvoiceNo", e."Description",
f_json_max(e."strmMeta.consentLevels") as "maxConsent"
FROM
"encrypted" e
INNER JOIN "keys" k
ON e."strmMeta.keyLink" = k."keyLink"
WHERE "maxConsent" >= 2
LIMIT 10;
decryptedCustomerID | InvoiceNo | Description | maxConsent |
---|---|---|---|
17850 | 536365 | WHITE HANGING HEART T-LIGHT HOLDER | 2 |
17850 | 536365 | WHITE METAL LANTERN | 2 |
17850 | 536365 | CREAM CUPID HEARTS COAT HANGER | 2 |
17850 | 536365 | KNITTED UNION FLAG HOT WATER BOTTLE | 2 |
17850 | 536365 | RED WOOLLY HOTTIE WHITE HEART. | 2 |
17850 | 536365 | SET 7 BABUSHKA NESTING BOXES | 2 |
17850 | 536365 | GLASS STAR FROSTED T-LIGHT HOLDER | 2 |
17850 | 536366 | HAND WARMER UNION JACK | 2 |
17850 | 536366 | HAND WARMER RED POLKA DOT | 2 |
13047 | 536367 | ASSORTED COLOUR BIRD ORNAMENT | 2 |
Performance
SELECT
f_strm_decrypt(k."encryptionKey", e."CustomerID") AS "decryptedCustomerID",
e."InvoiceNo", e."Description",
f_json_max(e."strmMeta.consentLevels") as "maxConsent"
FROM
"encrypted" e
INNER JOIN "keys" k
ON e."strmMeta.keyLink" = k."keyLink"
WHERE "maxConsent" >= 2;
Executing this on a X-Small Virtual Warehouse executes 195000 rows in 9s so ca 22000 rows/s.
PS We’re hiring!
Want to work on features like query-time decryption for privacy streams to help data teams build awesome data products without sacrificing privacy in the process? There’s plenty of cool work left. Did we mention we are hiring!?
This file is a below the 50MB limit that exists for Snowflake ad-hoc loading stages. ↩︎
The trivial bit of inline Java has to do with the fact the Kotlin code doesn’t provide the zero-arguments constructor that the Snowflake interface requires. ↩︎
this UDF can probably be written in plain SQL, but I like this Java solution that leverages the decrypter file that we need anyway. ↩︎