tutorial

On-the-fly decryption in Apache Spark or Databricks

On the fly decryption with Apache Spark

In this post, we’ll show how you can integrate STRM’s privacy streams and privacy transformations with (native!) role-based access controls and foreign keys inside Spark-based environments like Databricks.

In short, this brings STRM privacy streams (which are localized, purpose-bound and use case specific data interfaces) to data warehousing (centralized + use case agnostic). There’s more background here.

On the fly decryption

So you have your records processed and transformed through STRM, and the encryption keys (the key stream) are available in your databases. What’s next? In the following steps we’re going to show how to bring back the original plaintext data.

Apache Spark

Put decrypter-1.0.0-all.jar in the jars subdirectory of your Spark installation.

Check decryption

spark-shell
...
import io.strmprivacy.aws.lambda.decrypter.Decrypt

val key="""{"primaryKeyId":714921229,"key":[{"keyData":{"typeUrl":"type.googleapis.com/google.crypto.tink.AesSivKey","value":"EkBKGzBuy9C3UUmWaOzpe7NBEg6QK21FRhZ9MjuD5hpa0+hPJy0kn1HngA9QUT5aGbTNQQyow0V6qJCFoFRQNNTH","keyMaterialType":"SYMMETRIC"},"status":"ENABLED","keyId":714921229,"outputPrefixType":"TINK"}]}"""
val cipher_text="ASqc1Q0QalDEN+LHeyZSfGHE+s9Lqu4o+jM="

Decrypt.INSTANCE.decrypt(key, cipher_text)
res0: String = 17850

Read encrypted data


import org.apache.spark.SparkFiles
val urlfile = "https://storage.googleapis.com/strm-batch-job-demo-eu/batch-demo/uci_online_retail/encrypted.csv"
spark.sparkContext.addFile(urlfile)
val encrypted = spark.read.option("header",true).option("inferSchema", true).csv(SparkFiles.get("encrypted.csv"))
encrypted.show(3)
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountrystrmMeta.eventContractRefstrmMeta.noncestrmMeta.timestampstrmMeta.keyLinkstrmMeta.billingIdstrmMeta.consentLevels
53636585123AWHITE HANGING HEA…62010-12-01 08:26:002.55ASqc1Q0QalDEN+LHe…United Kingdomstrmprivacy/onlin…012911883600000fd20015-40e4-484…null[2]
53636571053WHITE METAL LANTERN62010-12-01 08:26:003.39ASqc1Q0QalDEN+LHe…United Kingdomstrmprivacy/onlin…012911883600000fd20015-40e4-484…null[2]
53636584406BCREAM CUPID HEART…82010-12-01 08:26:002.75ASqc1Q0QalDEN+LHe…United Kingdomstrmprivacy/onlin…012911883600000fd20015-40e4-484…null[2]

Read the keys

spark.sparkContext.addFile(urlfile.replaceAll("encrypted", "keys"))
val keys = spark.read.option("header",true).option("inferSchema", true).option("escape","\"").csv(SparkFiles.get("keys.csv"))
keys.show(3)
val key = keys.take(1)(0).getAs[String]("encryptionKey")
Decrypt.INSTANCE.deterministicAead(key)

Decrypt on the fly

prepare the UDF

val decrypt = (key: String, cipher_text: String) => { Decrypt.INSTANCE.decrypt(key, cipher_text) }
val decryptUdf = udf(decrypt)

And decrypt on the fly


encrypted.join(keys, encrypted("`strmMeta.keyLink`")===keys("keyLink"), "inner").withColumn("decryptedCustomerID", decryptUdf(col("encryptionKey"),col( "CustomerID"))).select("keyLink", "CustomerID", "decryptedCustomerID").show(3)
keyLinkCustomerIDdecryptedCustomerID
0fd20015-40e4-484…ASqc1Q0QalDEN+LHe…17850
0fd20015-40e4-484…ASqc1Q0QalDEN+LHe…17850
0fd20015-40e4-484…ASqc1Q0QalDEN+LHe…17850

Extract purpose


val maxConsent = (consentLevels:String) => { consentLevels.drop(1).dropRight(1).split(",").map( i => i.toInt ).max}
maxConsent("[1,7,8]")
res43: Int = 8
val maxConsentUdf = udf(maxConsent)

Filter

encrypted.join(keys, encrypted("`strmMeta.keyLink`")===keys("keyLink"), "inner").
withColumn("decryptedCustomerID", decryptUdf(col("encryptionKey"),col( "CustomerID"))).
withColumn("maxConsent", maxConsentUdf(col("`strmMeta.consentLevels`"))).
where("maxConsent > 1").
select("keyLink", "CustomerID", "decryptedCustomerID","maxConsent").
show(3)
keyLinkCustomerIDdecryptedCustomerIDmaxConsent
0fd20015-40e4-484…ASqc1Q0QalDEN+LHe…178502
0fd20015-40e4-484…ASqc1Q0QalDEN+LHe…178502
0fd20015-40e4-484…ASqc1Q0QalDEN+LHe…178502

And if you had > 3 in the where statement, you wouldn’t see any data.

PS We’re hiring!

Want to work on features like on-the-fly decryption to help data teams build awesome data products without sacrificing privacy in the process? There’s plenty of cool work left. Did we mention we are hiring!?

Decrease risk and cost, increase speed encode privacy inside data with STRM.