tutorial

On-the-fly decryption in Google BigQuery

On the fly decryption in Google BigQuery

In this post, we’ll show how you can integrate STRM’s privacy streams and privacy transformations with (native!) role-based access controls and foreign keys inside data warehouse solutions.

In short, this brings STRM privacy streams (which are localized, purpose-bound and use case specific data interfaces) to data warehousing (centralized + use case agnostic). There’s more background here.

On the fly decryption

So you have your records processed and transformed through STRM, and the encryption keys (the key stream) are available in your databases. What’s next? In the following steps we’re going to show how to bring back the original plaintext data in BigQuery.

BigQuery

This SQL database runs on Google Cloud. Importing the csv files (from a local file or from a cloud bucket) is trivial by selecting auto-detect for the schema.
Make sure you have the Google cli tools for BigQuery (bq), cloud storage (gsutil) and Google Cloud (gcloud) installed.

Table structure

I’m using the UCI online retail dataset that we processed (described in this blogpost
The processed data exist in this publicly accessible Gcloud Bucket. You must have any google account to view this bucket, although you could use any http client to download individual objects in the bucket without authentication.

In the commands below I’m assuming you’ve set your default project:

gcloud config set project ...

Creating the BigQuery table from the csv data goes as follows:

Create a dataset

First you’ll create a BigQuery dataset in your own project.

bq --location EU mk --dataset on_the_fly_decryption

Import the UCI online retail csv data

We’re creating three tables: source, encrypted and keys.

bq --location EU load --source_format CSV --autodetect \
on_the_fly_decryption.source \
gs://strm-batch-job-demo-eu/batch-demo/uci_online_retail.csv

bq query 'select * from on_the_fly_decryption.source limit 3'
+---------+-----------+-------------+----------+---------------------+-------+-------------+----------------+
| Invoice | StockCode | Description | Quantity | InvoiceDate | Price | CustomerID | Country |
+---------+-----------+-------------+----------+---------------------+-------+-------------+----------------+
| 536414 | 22139 | NULL | 56 | 2010-12-01 11:52:00 | 0.0 | NULL | United Kingdom |
| 536545 | 21134 | NULL | 1 | 2010-12-01 14:32:00 | 0.0 | NULL | United Kingdom |
| 536547 | 37509 | NULL | 1 | 2010-12-01 14:33:00 | 0.0 | NULL | United Kingdom |
+---------+-----------+-------------+----------+---------------------+-------+-------------+----------------+

bq --location EU load --source_format CSV --autodetect \
on_the_fly_decryption.encrypted \
gs://strm-batch-job-demo-eu/batch-demo/uci_online_retail/encrypted.csv

The keys table doesn’t correctly figure out the column names so we need to provide an explicit schema.

bq load --schema=keyLink:STRING,encryptionKey:STRING \
--skip_leading_rows=1 \
--source_format=CSV on_the_fly_decryption.keys \
gs://strm-batch-job-demo-eu/batch-demo/uci_online_retail/keys.csv

Show some table information

bq show on_the_fly_decryption.encrypted
Table stream-machine-development:on_the_fly_decryption.encrypted

Last modified Schema Total Rows Total Bytes
----------------- -------------------------------------- ------------ -----------
20 Sep 14:48:06 |- Invoice: string 541910 110039356
|- StockCode: string
|- Description: string
|- Quantity: integer
|- InvoiceDate: timestamp
|- Price: float
|- CustomerID: string
|- Country: string
|- strmMeta_eventContractRef: string
|- strmMeta_nonce: integer
|- strmMeta_timestamp: integer
|- strmMeta_keyLink: string
|- strmMeta_billingId: string
|- strmMeta_consentLevels: string


bq show on_the_fly_decryption.keys
Table stream-machine-development:on_the_fly_decryption.keys

Last modified Schema Total Rows Total Bytes
----------------- -------------------------- ------------ -----------
20 Sep 15:35:44 |- keyLink: string 19573 6693265
|- encryptionKey: string

Sample query

SELECT
e.strmMeta_keyLink,
DETERMINISTIC_DECRYPT_STRING(k.key, FROM_BASE64(e.CustomerID), '') AS decryptedCustomerID,
e.CustomerID,
e.InvoiceNo,
e.Description
FROM
on_the_fly_decryption.encrypted e
INNER JOIN (
SELECT
k.keyLink,
KEYS.KEYSET_FROM_JSON( k.encryptionKey ) key
FROM
on_the_fly_decryption.keys k) k
ON e.strmMeta_keyLink = k.keyLink
LIMIT 3;
strmMeta_keyLink	decryptedCustomerID	CustomerID	InvoiceNo	Description
23db6086-feb1-464f-90a2-a821e66944bc 12446 AQFLfmNUSsp2j3XdntW3UVgbA7DfEGgBcEw= 571035 RED RETROSPOT CUP
23db6086-feb1-464f-90a2-a821e66944bc 12446 AQFLfmNUSsp2j3XdntW3UVgbA7DfEGgBcEw= 571035 PINK POLKADOT PLATE
23db6086-feb1-464f-90a2-a821e66944bc 12446 AQFLfmNUSsp2j3XdntW3UVgbA7DfEGgBcEw= 571035 SET OF 4 KNICK KNACK TINS DOILY

Handling consent in the query.

The strmMeta_consentLevels is filled in from a list of integer values, but is presented as a json array string [1]

SELECT * FROM (
SELECT
e.strmMeta_keyLink,
DETERMINISTIC_DECRYPT_STRING(k.key, FROM_BASE64(e.CustomerID), '') AS decryptedCustomerID,
e.CustomerID,
( SELECT MAX(CAST(num AS Int64)) FROM
UNNEST(JSON_VALUE_ARRAY(e.strmMeta_consentLevels)) AS num ) max_consent
FROM
on_the_fly_decryption.encrypted e
INNER JOIN (
SELECT
k.keyLink,
KEYS.KEYSET_FROM_JSON( k.encryptionKey ) key
FROM
on_the_fly_decryption.keys k) k
ON e.strmMeta_keyLink = k.keyLink
)
WHERE max_consent >= 1
LIMIT 3;

Performance

I’ve tested with this query

SELECT
*
FROM (
SELECT
e.strmMeta_keyLink,
DETERMINISTIC_DECRYPT_STRING(k.key, FROM_BASE64(e.CustomerID), '') AS decryptedCustomerID,
e.CustomerID,
( SELECT MAX(CAST(num AS Int64)) FROM
UNNEST(JSON_VALUE_ARRAY(e.strmMeta_consentLevels)) AS num ) max_consent
FROM
on_the_fly_decryption.encrypted e
INNER JOIN (
SELECT
k.keyLink,
KEYS.KEYSET_FROM_JSON( k.encryptionKey ) key
FROM
on_the_fly_decryption.keys k) k
ON
e.strmMeta_keyLink = k.keyLink)
WHERE
max_consent >= 2;

The 541826 rows were processed in 9 seconds, so over 60000 records/second.

(Materialized) View

CREATE MATERIALIZED VIEW my_view AS
SELECT ...;

The most important aspect of a (materialized) view is that it allows your organization to provide decrypted datasets to only those having a right-to-know, and without making the decryption keys widely available in your organization.

PS We’re hiring!

Want to work on features like bringing STRM to Big Query to help data teams build awesome data products without sacrificing privacy in the process? There’s plenty of cool work left. Did we mention we are hiring!?


  1. the CSV format does not support lists in a column ↩︎

Decrease risk and cost, increase speed encode privacy inside data with STRM.