tutorial

On-the-fly decryption in AWS Redshift

On the fly decryption with AWS Redshift

In this post, we’ll show how you can integrate STRM’s privacy streams and privacy transformations with (native!) role-based access controls and foreign keys inside data warehouse solutions.

In short, this brings STRM privacy streams (which are localized, purpose-bound and use case specific data interfaces) to data warehousing (centralized + use case agnostic). There’s more background here.

On the fly decryption

So you have your records processed and transformed through STRM, and the encryption keys (the key stream) are available in your databases. What’s next? In the following steps we’re going to show how to bring back the original plaintext data in Redshift.

AWS Redshift

AWS Redshift provides SQL access to tables from csv files, but the integration and on-the-fly decryption is less trivial than in BigQuery for the following reasons:

  1. There’s no schema auto-detection, which means you have to tell Redshift the type of your csv columns
  2. There is no built-in support for AEAD cryptographic functions.
  3. SQL UNNEST functions are not available, so parsing the json format strmMeta.consentLevels is non-trivial.

Redshift User Defined Functions (udf’s)

One can add arbitrary udf’s to Redshift via AWS Lambda. We’ve created one in the Kotlin language and put its source on github, and put the resulting artifact that is required for the lambda here on S3.
We’ve written this separate blogpost to describe the details of how to make the f_strm_decrypt function available on your Redshift instance.

Creating the tables

In Redshift, we need to create the tables (including column definitions) before we can import csv files.

create table "keys" (keyLink varchar(50), encryptionKey varchar(500));

I’m using the UCI online retail dataset that we anonymized in this blogpost.

CREATE TABLE "encrypted" (
"InvoiceNo" VARCHAR(500),
"StockCode" VARCHAR(500),
"Description" VARCHAR(500),
"Quantity" VARCHAR(500),
"InvoiceDate" VARCHAR(500),
"UnitPrice" VARCHAR(500),
"CustomerID" VARCHAR(500),
"Country" VARCHAR(500),
"strmMeta.eventContractRef" VARCHAR(500),
"strmMeta.nonce" VARCHAR(500),
"strmMeta.timestamp" VARCHAR(500),
"strmMeta.keyLink" VARCHAR(500),
"strmMeta.billingId" VARCHAR(500),
"strmMeta.consentLevels" VARCHAR(500)
);

Copying csv files

The resulting csv files from our Batchjobs and Data Subjects blogpost are publicly available in an AWS S3 bucket.

Important: Make sure you change the IAM role to one that fits your Redshift cluster. See here for details. I’ve left the role I used in the calls below so you can see its typical format.

Copy encryption keys into your keys table.

COPY keys FROM 's3://strm-batch-demo/uci_online_retail/keys.csv'
IAM_ROLE 'arn:aws:iam::079646392542:role/strm-udf-test-2'
REGION 'eu-west-1' FORMAT CSV
IGNOREHEADER 1;

And encrypted data points into your encrypted table.

COPY encrypted FROM 's3://strm-batch-demo/uci_online_retail/encrypted.csv'
IAM_ROLE 'arn:aws:iam::079646392542:role/strm-udf-test-2'
REGION 'eu-west-1' FORMAT CSV
IGNOREHEADER 1;

Have a quick look at your data points and note the encryption of the customerid column[1]

SELECT * FROM encrypted LIMIT 2;
invoicenostockcodedescriptionquantityinvoicedateunitpricecustomeridcountrystrmmeta.eventcontractrefstrmmeta.noncestrmmeta.timestampstrmmeta.keylinkstrmmeta.billingidstrmmeta.consentlevels
56221122268DECORATION SITTING BUNNY482011-08-03 13:22:000.19AUNMuvUJ8Hf2u9O/HDOUn1tem2YEeOXNLHQ=United Kingdomstrmprivacy/online-retail/1.0.001312370520000e12ec990-7112-4a10-8f72-14234602782b[2]
56221122266EASTER DECORATION HANGING BUNNY642011-08-03 13:22:000.19AUNMuvUJ8Hf2u9O/HDOUn1tem2YEeOXNLHQ=United Kingdomstrmprivacy/online-retail/1.0.001312370520000e12ec990-7112-4a10-8f72-14234602782b[2]

On the fly decryption

Note that the f_strm_decrypt() function in the query below is the AWS Lambda User Defined Function that we described in this blogpost. Make sure that is working first!

SELECT
e."strmMeta.keyLink",
f_strm_decrypt(k.encryptionKey, e.customerId) AS decryptedCustomerID,
e.CustomerId,
e.invoiceno, e.Description
FROM
"encrypted" e
INNER JOIN keys k
ON e."strmMeta.keyLink" = k.keyLink
LIMIT 100

This query takes between 10 and 60 seconds to run, depending on

  1. Lambda cold start
  2. Cached AEAD decryption primitives.

image

You cannot run the query in the query editor without a LIMIT clause because the result is too large [2]. You have to use UNLOAD to store your data into a bucket.

UNLOAD('SELECT
e."strmMeta.keyLink",
f_strm_decrypt(k.encryptionKey, e.customerId) AS decryptedCustomerID,
e.CustomerId,
e.invoiceno, e.Description
FROM
"encrypted" e
INNER JOIN keys k
ON e."strmMeta.keyLink" = k.keyLink'
)
TO 's3://strm-batch-demo/uci_online_retail/decrypted-redshift.csv'
IAM_ROLE 'arn:aws:iam::079646392542:role/strm-udf-test-2'
REGION 'eu-west-1'
FORMAT CSV;

With a warm start of the Lambda function, this query takes between 1 and 2 minutes.

Performance

The query of the full decrypted table took 1’45" for 500000 rows, so ca. 5000 rows per second. Query times vary.

Materialized Views

With this relatively slow (compared to BigQuery) performance a materialized view is the way to go!

More importantly (materialized) views allow your organization to provide decrypted datasets to only those having a right-to-know, and without making the decryption keys widely available in your organization.

Some columns

CREATE MATERIALIZED VIEW decrypted_some_columns
AUTO REFRESH YES
AS SELECT
e."strmMeta.keyLink",
f_strm_decrypt(k.encryptionKey, e.customerId) AS decryptedCustomerID,
e.CustomerId,
e.invoiceno, e.Description
FROM "encrypted" e
INNER JOIN keys k
ON e."strmMeta.keyLink" = k.keyLink

Performance is much better

SELECT count(*) FROM "dev"."public"."decrypted_some_columns";

The results were available in 70ms.

Including consent filtering

Redshift is somewhat primitive. I’m quoting from this blogpost

You need to figure out how to deal with that nasty JSON array living in the varchar(max) field you’re staring at. You’ve come to the right place. Redshift’s lack of an unnest, or flatten, function is a little frustrating given that Amazon’s other columnar SQL products, Athena and Spectrum, both have the ability to deal with arrays natively. Why no love for Redshift? Enough griping. Let’s write some SQL.

We found that by far the easiest solution to this limit is a Python User Defined Function that can find the maximum value in a json array easily.

CREATE OR REPLACE FUNCTION f_json_max (j varchar) RETURNS int IMMUTABLE as $$
def json_max(j):
import json
try:
return max(json.loads(j))
except:
return None
return json_max(j)
$$ LANGUAGE plpythonu

And the materialized view decrypted_consent_ge_2: [3]

CREATE MATERIALIZED VIEW decrypted_consent_ge_2 as
SELECT
*
FROM (
SELECT
e.*,
f_strm_decrypt(k.encryptionKey, e.CustomerId) AS decryptedCustomerID,
f_json_max(e."strmMeta.consentLevels") AS max_consent
FROM
"encrypted" AS e
INNER JOIN
"keys" AS k
ON
e."strmMeta.keyLink" = k.keyLink
WHERE
max_consent >= 2
)

Let’s try a little query on the view

SELECT customerId, count(*) AS ct FROM decrypted_consent_ge_2
GROUP BY customerID, invoiceNo ORDER by ct DESC LIMIT 3;

result
The query took ca. 4 seconds.

PS We’re hiring!

Want to work on features like query-time decryption for privacy streams to help data teams build awesome data products without sacrificing privacy in the process? There’s plenty of cool work left. Did we mention we are hiring!?


  1. the value is actually a base64 encoded binary value. ↩︎

  2. unlike BigQuery ↩︎

  3. consent Greater or Equal 2; hence the ge_2 suffix. ↩︎

Decrease risk and cost, increase speed encode privacy inside data with STRM.