
In this post, we’ll show how you can integrate STRM’s privacy streams and privacy transformations with (native!) role-based access controls and foreign keys inside Spark-based environments like Databricks.
In short, this brings STRM privacy streams (which are localized, purpose-bound and use case specific data interfaces) to data warehousing (centralized + use case agnostic). There’s more background here.
On the fly decryption
So you have your records processed and transformed through STRM, and the encryption keys (the key stream
) are available in your databases. What’s next? In the following steps we’re going to show how to bring back the original plaintext data.
Apache Spark
Put decrypter-1.0.0-all.jar in the jars
subdirectory of your Spark installation.
Check decryption
spark-shell
...
import io.strmprivacy.aws.lambda.decrypter.Decrypt
val key="""{"primaryKeyId":714921229,"key":[{"keyData":{"typeUrl":"type.googleapis.com/google.crypto.tink.AesSivKey","value":"EkBKGzBuy9C3UUmWaOzpe7NBEg6QK21FRhZ9MjuD5hpa0+hPJy0kn1HngA9QUT5aGbTNQQyow0V6qJCFoFRQNNTH","keyMaterialType":"SYMMETRIC"},"status":"ENABLED","keyId":714921229,"outputPrefixType":"TINK"}]}"""
val cipher_text="ASqc1Q0QalDEN+LHeyZSfGHE+s9Lqu4o+jM="
Decrypt.INSTANCE.decrypt(key, cipher_text)
res0: String = 17850
Read encrypted data
import org.apache.spark.SparkFiles
val urlfile = "https://storage.googleapis.com/strm-batch-job-demo-eu/batch-demo/uci_online_retail/encrypted.csv"
spark.sparkContext.addFile(urlfile)
val encrypted = spark.read.option("header",true).option("inferSchema", true).csv(SparkFiles.get("encrypted.csv"))
encrypted.show(3)
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | strmMeta.eventContractRef | strmMeta.nonce | strmMeta.timestamp | strmMeta.keyLink | strmMeta.billingId | strmMeta.consentLevels |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
536365 | 85123A | WHITE HANGING HEA… | 6 | 2010-12-01 08:26:00 | 2.55 | ASqc1Q0QalDEN+LHe… | United Kingdom | strmprivacy/onlin… | 0 | 1291188360000 | 0fd20015-40e4-484… | null | [2] |
536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | ASqc1Q0QalDEN+LHe… | United Kingdom | strmprivacy/onlin… | 0 | 1291188360000 | 0fd20015-40e4-484… | null | [2] |
536365 | 84406B | CREAM CUPID HEART… | 8 | 2010-12-01 08:26:00 | 2.75 | ASqc1Q0QalDEN+LHe… | United Kingdom | strmprivacy/onlin… | 0 | 1291188360000 | 0fd20015-40e4-484… | null | [2] |
Read the keys
spark.sparkContext.addFile(urlfile.replaceAll("encrypted", "keys"))
val keys = spark.read.option("header",true).option("inferSchema", true).option("escape","\"").csv(SparkFiles.get("keys.csv"))
keys.show(3)
val key = keys.take(1)(0).getAs[String]("encryptionKey")
Decrypt.INSTANCE.deterministicAead(key)
Decrypt on the fly
prepare the UDF
val decrypt = (key: String, cipher_text: String) => { Decrypt.INSTANCE.decrypt(key, cipher_text) }
val decryptUdf = udf(decrypt)
And decrypt on the fly
encrypted.join(keys, encrypted("`strmMeta.keyLink`")===keys("keyLink"), "inner").withColumn("decryptedCustomerID", decryptUdf(col("encryptionKey"),col( "CustomerID"))).select("keyLink", "CustomerID", "decryptedCustomerID").show(3)
keyLink | CustomerID | decryptedCustomerID |
---|---|---|
0fd20015-40e4-484… | ASqc1Q0QalDEN+LHe… | 17850 |
0fd20015-40e4-484… | ASqc1Q0QalDEN+LHe… | 17850 |
0fd20015-40e4-484… | ASqc1Q0QalDEN+LHe… | 17850 |
Extract purpose
val maxConsent = (consentLevels:String) => { consentLevels.drop(1).dropRight(1).split(",").map( i => i.toInt ).max}
maxConsent("[1,7,8]")
res43: Int = 8
val maxConsentUdf = udf(maxConsent)
Filter
encrypted.join(keys, encrypted("`strmMeta.keyLink`")===keys("keyLink"), "inner").
withColumn("decryptedCustomerID", decryptUdf(col("encryptionKey"),col( "CustomerID"))).
withColumn("maxConsent", maxConsentUdf(col("`strmMeta.consentLevels`"))).
where("maxConsent > 1").
select("keyLink", "CustomerID", "decryptedCustomerID","maxConsent").
show(3)
keyLink | CustomerID | decryptedCustomerID | maxConsent |
---|---|---|---|
0fd20015-40e4-484… | ASqc1Q0QalDEN+LHe… | 17850 | 2 |
0fd20015-40e4-484… | ASqc1Q0QalDEN+LHe… | 17850 | 2 |
0fd20015-40e4-484… | ASqc1Q0QalDEN+LHe… | 17850 | 2 |
And if you had > 3
in the where
statement, you wouldn’t see any data.
PS We’re hiring!
Want to work on features like on-the-fly decryption to help data teams build awesome data products without sacrificing privacy in the process? There’s plenty of cool work left. Did we mention we are hiring!?