tutorial

Batch Jobs and Data Subjects

How to work with Batch Jobs and the Datasubjects Service using the UCI online retail dataset

Reading time: about 20 minutes.

In the world of data privacy, working with existing data is an important use case. Data might be collected many years before privacy laws were applicable. It might be lacking important context, such as the original purpose under which it was collected. Or you might simply have existing and legacy workloads to ingest and process fresh data, where the privacy lens was not applied when the pipelines were written. At the same time, you will have to comply to today’s privacy laws with that same data, and so be able to fulfill important obligations like Data Subject Access Requests (which data do we have on a single individual?) or Right To Be Forgotten (RTBF). With STRM Privacy, you can do exactly this: (re-) process data in batches and apply a data contract to classify its privacy implications and instruct it to transform the data into privacy-safe and usable form. Once processed with our platform, fulfilling obligations under GDPR becomes trivial through our Data Subject interface.

Process data for privacy and fulfill GDPR obligations cheaply and effectively

In this tutorial, we will do just that. We’ll use STRM Privacy’s Batch Jobs to process the well-known UCI online retail dataset.

The processing consists of de-identifying Personally Identifiable Information fields via rotating encryption keys, and we’ll show how the Data Subjects API can be used to locate all the data points associated to a certain data subject (in the case of the UCI dataset, this is the Customer ID).

For scope of this post, the Data Subjects API will provide us with all the information we need to handle GDPR Right To Be Forgotten (RTBF) requests. As we have processed everything with a rotating key mechanism, we can easily destroy (and so ‘forget’) data via crypto shredding. This is simply a matter of which deleting the encryption keys whose links were returned by the Data Subjects API. A much cheaper operation than your traditional full table scan on the data itself and delete / modify operations!

The blogpost has a Jupyter Notebook that you can download from GitHub. The repository also contains all the CSV files that were produced by the processing, so you don’t even need to execute the steps; we only reduced the size of the CSV files (500k to 20k rows).

If you are more interested in working with the result data, you can skip ahead to Playing Working with the resulting data

If you already know about Batch Job processing, you can skip ahead to the Data Subjects API Section

Prerequisites

  1. An account on our platform to authenticate
  2. Access to AWS S3 or equivalent storage
  3. An up to date install of our cli: 'brew install strmprivacy/cli/strm` or ‘brew upgrade strm’
  4. Google’s Tink library to work with encrypted data

What do we show in this blog post

  1. we create a data-connector that defines access to a cloud bucket. Here we use AWS S3, but Google Cloud and Azure are also supported.
  2. we download the dataset, convert it to CSV, and store it into the cloud bucket associated with the data-connector.
  3. we use a Data Contract that matches the column data of the CSV, defines what PII attributes exist, and which column holds the data subject. We show how we would create this Data Contract, but it already exists as a public contract.
  4. we configure the batch-job configuration file that:
    • defines the filenames of the various CSV files in one (or multiple) data connector(s).
    • defines how to parse time stamps in the input file
    • defines how to parse consent from the input file rows, or uses a fixed consent for all the rows in the file. The latter is the case in this blog post.
    • defines which derived datasets to create
  5. run the batch job
  6. observe some information about the results
  7. decrypt encrypted data in the exported encrypted.csv data file with use of the exported encryption keys in keys.csv.
  8. show what information the Data Subjects API can provide

Step 1. The data connector

This is a generic interface that defines access to a cloud bucket where batch-jobs operate. See our documentation on how to create one. In this post I’ve created a data connector named s3-batch-demo that has access to a bucket with the same name

I have actually mounted the S3 bucket on my local filesystem with s3fs into a local directory named s3 via s3fs strm-batch-demo s3, which means I can access these S3 objects as if they were files on my machine.

Step 2. Excel to CSV

The dataset is an Excel (xslx) file, that needs to be converted to CSV because that is what STRM Privacy batch jobs supports (at the time of writing).

I’ve created a Jupyter Notebook that deals with all the steps in this post. The relevant steps are:

import pandas as pd
df = pd.read_excel("https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx",
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['UnitPrice'] = df['UnitPrice'].astype(float)
# This only works because of the s3fs mounted filesystem!
df.to_csv("s3/uci_online_retail.csv", index=False)
df.head(2)
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850United Kingdom
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850United Kingdom

And a sanity check

head -2 s3/uci_online_retail.csv
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom

Step 3. The Data contract

We’ve already prepared the strmprivacy/online-retail/1.0.0 Data Contract. It’s essence is that:

  • the schema that defines the shape of the data exactly matches the column headers of the csv file
  • the PII fields of the contract are filled in (in this case only the CustomerID column.
  • the key field is also CustomerID, so this is what ties sequences of events together. We could also have used InvoiceNo; in that case a new encryption key would have been created for every new invoice. This would have made a difference in the processing of customers with multiple invoices on the same date.
  • the datasubject field is also CustomerID. This is the identifier that allows us to find all encryption keys created for processing this data subject.

This datacontract was created via our cli with the following commands:

strm create schema strmprivacy/online-retail/1.0.0 --definition=online-retail.json --public
strm activate schema strmprivacy/online-retail/1.0.0
strm create event-contract strmprivacy/online-retail/1.0.0 --public \
-F online-retail-contract.json -S strmprivacy/online-retail/1.0.0
strm activate event-contract strmprivacy/online-retail/1.0.0

Note: the sequential steps of creating a ‘schema’ first and then classifying via an ‘event-contract’ will soon change to the simpler strm create data-contract - but we’re not there yet.

Step 4. Configuring the STRM batch job

In order to define a batch job, we need to

  1. figure out the bucket locations and access. This is a matter of setting up the correct data-connector.
  2. figuring out the timestamp format. This uses Java time format
  3. determine the consent that was given, per batch job or even per record
  4. write the batch job configuration file. This file has a json-schema defined format; if your text editor knows about json schema, it can use strmprivacy-batch-job-configuration from schemastore.org

The data-connector

Follow along with the documentation

The time format

The way we converted the Excel file to CSV changed the timestamp format. Make sure you look at the timestamp in the CSV file. A sample timestamp in the UCI CSV file is 2010-12-01 08:26:00. This suggests the following format pattern: yyyy-MM-dd HH:mm:ss. Since this timestamp has no timezone information, it’s necessary to add a default timezone[1].

If you get the format wrong, you’ll notice after the batch job has started:

strm list batch-jobs
... ERROR Invalid timestamp [Text '2010-12-01 08:26:00' could not be parsed at index 4] in row #1

In this case, your batch-job will remain; you’ll have to manually delete it with strm delete batch-job .... [2]

Defining data purpose

An important obligation for lawful data processing is that you obtained data under a legitimate legal ground: the purpose. Such purposes can be legitimate interest or contract obligations, or simply consent from a data subject.

To properly apply the privacy lens to your data, the batch jobs needs information about the purpose under which each record was collected. For now we’ll focus on explicit instructions, but it can be found using pattern matching (that’s for another post).

In the sample dataset, original data purpose is absent, so we will have to provide a bulk level purpose to all the records in the whole file. Please note the formal name “consent” should be read as “purpose” (of which consent is a type)

"consent": { "default_consent_levels": [ 2 ] },

After processing, every event/datapoint will have a strmMeta.consentLevels value that will in this case contain [2]. This consent is used for any further processing in your organization, and is from here on irrevocably connected to this datapoint. The integers we assign are freely defined: whatever the 2 means in your organization. See our documentation for more information about defining and mapping purpose.

Step 5. Assemble the batch-job-config.json file

The batch job is configured using a json file containing all the variables the jobs look for and is available from Schemastore.org

{
"source_data": {
// a connector to your cloud bucket
"data_connector_ref": { "name": "s3-batch-demo"},
"file_name": "uci_online_retail.csv",
"data_type": { "csv": { "charset": "UTF-8" } }
},
// all rows are given these consent levels
"consent": { "default_consent_levels": [ 2 ] },
"encryption": {
/* this defines a collection of encryption keys to be re-used for
successive runs of batch-jobs. If you leave this field out
it will default to your project id.
*/

"batch_job_group_id": "7824e975-20e1-4995-b129-2f9582728ca5",
"timestamp_config": {
// the column that contains the timestamp of the data point
"field": "InvoiceDate",
// how to parse its contents
"format": "yyyy-MM-dd HH:mm:ss",
// a timezone in case the field does not have timezone information.
"default_time_zone": { "id": "Europe/Amsterdam" }
}
},
// the data contract that defines the event
"event_contract_ref": { "handle": "strmprivacy", "name": "online-retail", "version": "1.0.0" },
// this is where the encrypted events are exported
"encrypted_data": {
"target": {
"data_connector_ref": { "name": "s3-batch-demo"},
"data_type": { "csv": { "charset": "UTF-8" } },
"file_name": "uci_online_retail/encrypted.csv"
}
},
// all the encryption keys created or re-used in this batch job run
"encryption_keys_data": {
"target": {
"data_connector_ref": { "name": "s3-batch-demo"},
"data_type": { "csv": { "charset": "UTF-8" } },
"file_name": "uci_online_retail/keys.csv"
}
},
// a list of derived data, decrypted and/or masked
"derived_data": [ {
"target": {
"data_connector_ref": { "name": "s3-batch-demo"},
"data_type": { "csv": { "charset": "UTF-8" } },
"file_name": "uci_online_retail/decrypted-0.csv"
},
"consent_levels": [ 2 ],
"consent_level_type": "CUMULATIVE"
}
]
}

Step 6. Run the batch job

And with all instructions defined, we’re ready to actually run our batch job!

strm create batch-job -F batch-job-config.json
# shows a uuid that you use to access its progress.
strm get batch-job <uuid>

I like to do the following to see how the batch job is progressing

watch "strm get batch-job <uuid> -o json  | jq '.batchJob.states  | last'"

which gives you somewhat realtime information on how the job is progressing. When finished, you should see something like

{
"stateTime": "2022-09-13T10:06:47.113Z",
"state": "FINISHED",
"message": "Processed 541909 records in 194 s. which is 2790 records/s."
}

Once the job has finished, you’ll see output files in your bucket. Remember, I used s3fs to mount the bucket into my local file system so I can just list the file tree as if it was on my local disk:

$> ls -lRh s3/
s3/:
total 88M
drwxr-x--- 1 bvdeenen bvdeenen 0 1 jan 1970 uci_online_retail
-rw-r--r-- 1 bvdeenen bvdeenen 46M 13 sep 15:36 uci_online_retail.csv

s3/uci_online_retail:
total 234M
-rw-r----- 1 bvdeenen bvdeenen 106M 14 sep 15:29 decrypted-0.csv
-rw-r----- 1 bvdeenen bvdeenen 122M 14 sep 15:29 encrypted.csv
-rw-r----- 1 bvdeenen bvdeenen 7.0M 14 sep 15:29 keys.csv

$> wc -l s3/uci_online_retail/encrypted.csv
541910 s3/uci_online_retail/encrypted.csv

We see that the encrypted and decrypted files are substantially bigger than the original file. This is due to the much longer encrypted string than the original five character CustomerID, as well as the information in strmMeta. The good thing is that CSV is a very inefficient format, and these kinds of data are typically stored in a columnar format, with compression. Just a regular zip compression of the 106MB decrypted file brings it down to 11MB.

PlayingWorking with the resulting data in an example notebook

The world’s favorite interface to data is still a notebook, so we follow along with the published Jupyter Notebook. Download the repository (git clone or download repository as zip file). Start the notebook [3]:

jupyter notebook

and in the opened web page click on the file named Batch_Jobs_and_Datasubjects_blogpost.ipynb. You should be able to execute Cell → Run All.

Decrypting the encrypted CSV file

Our configuration was successful. Our batch job ran. I could retrieve the data. And now I’m looking at… tons of encrypted strings. Let’s fix that. But first:


Important
This is the heart of what STRM Privacy is about. Data in decrypted form is readable, even when it shouldn’t be under GDPR. Putting trust into blue eyes or even access control is not sufficient to comply with privacy laws. So we will show how to decrypt, but by the principle of data minimization you should only use (partially) decrypted sets for specific processing purposes. In other words these decrypted files can become toxic waste in your organization if you keep them. The general idea is that you keep these decrypted data only as long as you need them for a specific goal (like training a model, or providing a personalized recommendation), and then discard them within 1 month (the maximum GDPR RTBF processing time). Using the combination of encrypted data and on-the-fly decryption in common databases (like GCP BigQuery, AWS Athena, etc) is the preferred way to work with sensitive data.


The essence of decrypting an encrypted field value is

  1. look up the keyLink value in the strmMeta.keyLink column
  2. find the associated encryption key. In a batch job that is typically exported to a keys.csv file, but once you have this in production, it is likely that the actual database that holds the keys is part of your production system. We strongly suggest to physically separate the key table from any of the data itself
  3. execute a Google Tink method to decrypt.

In the notebook you can find all the details; here I’m just showing the essence of the decryption.

keys = pd.read_csv("s3/uci_online_retail/keys.csv")
keys.head(2)
keyLinkencryptionKey
00fd20015-40e4-484d-ab1f-182acff382ac{"primaryKeyId":714921229,"key":[{"keyData":{"typeUrl":"type.googleapis.com/google.crypto.tink.AesSivKey","value":"EkBKGzBuy9C3UUmWaOzpe7NBEg6QK21FRhZ9...
11de9b1fc-9ad5-43a2-b501-af6933624f67{"primaryKeyId":1369926497,"key":[{"keyData":{"typeUrl":"type.googleapis.com/google.crypto.tink.AesSivKey","value":"EkDxCHkIFc2k5itCyLRWbHz0rP/rWOkGSRnazv5nw10...

A typical key object looks like this:


{'primaryKeyId': 714921229,
'key': [{'keyData': {'typeUrl': 'type.googleapis.com/google.crypto.tink.AesSivKey',
'value': 'EkBKGzBuy9C3UUmWaOzpe7NBEg6QK21FRhZ9MjuD5hpa0+hPJy0kn1HngA9QUT5aGbTNQQyow0V6qJCFoFRQNNTH',
'keyMaterialType': 'SYMMETRIC'},
'status': 'ENABLED',
'keyId': 714921229,
'outputPrefixType': 'TINK'}]}

You need the Tink python library. Use encryptionKey for the json encoded encryption key, and cipher_text the encrypted base-64 encoded original text.

import tink
from tink import daead, cleartext_keyset_handle
daead.register()

...

reader = tink.JsonKeysetReader(encryptionKey)
primitive = cleartext_keyset_handle.read(reader)\
.primitive(daead.DeterministicAead)
clear_text = primitive.decrypt_deterministically(\
base64.b64decode(cipher_text), b'').decode('utf-8')

In the notebook we show the details.

Locating and destroying data on a subject using the Data Subject Service


Like we outlined in the beginning, a set of important obligations under GDPR and other privacy laws is that you will need to retrieve the subset of data inside your systems that belongs to a single individual if they request it. Traditionally, these are expensive and cumbersome operations - imagine having to scan all your data every time a request is made with a data subject potentially present in each day of data you have ever gathered. Approaches to optimize this, such as keeping indexes per data subject, simplify these operations, but for interacting with the data itself you still need to operate on each stored byte.

Using our Data Subjects API, you can find the keylinks associated to a single data subject, providing you with an automated overview of which data about whom is where in your systems!

The Data Subjects API has one goal
Find all the encryption keys that were ever created via Strmprivacy for a certain data subject


A data subject is the owner of the data. See this gdpr definition. Data Subjects typically have some label that allows us to identify them unequivocally. You could think of a customer id, a vehicle license plate, a Dutch burger service nummer, …

When STRM Privacy processes data, and the data contains a data subject identifier in its event data, it stores newly created encryption keys via the Data Subjects API [4].

In the UCI set, we see a field CustomerID that is obviously this identifier. We also see this in the data-contract. We expect that after we’ve run the batch-job, we should see all different values of CustomerID as data subjects in the Data Subjects API.

strm list data-subjects --page-size 3
CgUxMjM0OQ==

12346
12348

Note: in most organizations, these lists of ids can grow pretty large quickly. That’s why the first row is actually the next-page-token that can be used for paginated access to the total list. See the cli manual for details. The empty line is caused by empty customerID strings that were also processed. All non-identified customers will actually be encrypted as the same entity. This is surprisingly useful to detect overall patterns (such as missing customer ids - compliance alert! As you can’t trace the data back to an individual you can never demonstrate it’s obtained lawfully. Ouch).

The next two lines are the first two data subjects in the Data Subjects API (not necessarily, but in this case from the UCI dataset).

Let’s see which key links were created for these two ids:

DATASUBJECT   KEYLINK                                EXPIRY

12346 863cd2bf-60f7-4047-84b6-4d9f08e9868c 2011-01-19T01:00:00.000000+0100
12348 925a0397-628d-4495-9e60-c9d772677417 2010-12-17T01:00:00.000000+0100
12348 3ef801c7-8aec-49a5-89c4-f2a1543a9953 2011-01-26T01:00:00.000000+0100
12348 d96de0e7-36d0-4530-996a-a49ee553e5da 2011-04-06T02:00:00.000000+0200
12348 0039a177-37a9-4447-81da-dc3d1d36fcdc 2011-09-26T02:00:00.000000+0200

So we see that data subject 12346 has been active on one day, and 12348 on four different days in this dataset.

Now how can we operate on the data?

In case of a Right To Be Forgotten request for customer 12348 you…

  1. … call the Data Subjects API to get all the keylinks associated with this data subject
  2. … remove encryption keys from a keys database in the organization. Hopefully, the keys export file has long ago been deleted! If the organization has never stored any derived datasets with decrypted data then it has complied with the RTBF request. This mechanism is called crypto-shredding.
  3. … to fully complete the request, we would have to also forget this was a customer. It would be useful to delete the data subject from the Data Subjects API with strm delete data-subjects 12348. Note different retention periods can cause the same datapoint to be allowed to be stored for different lengths of time!

Et voila

And that’s it! We have used the UCI example dataset to configure and process a STRM batch job. We then showed how processing this data makes important legal obligations much simpler (and so cheaper and lower risk) through the Data Subjects API.

Curious how STRM Privacy can work and help structure your privacy operations? Reach out or request a demo to learn more.

Thanks for watching.

PS We’re hiring!

Care for privacy and want to help in building the fast food of privacy infrastructure? Come build STRM to help data teams deliver data products without sacrificing privacy in the process. We are hiring!


  1. Look here for a list. ↩︎

  2. make sure you install our shell completion ↩︎

  3. if you have issues setting up a local Python installation for datascience, I suggest you give Anaconda a try. ↩︎

  4. it will only do so if the associated data contract explicitly mentions what part of an event contains the data subject identifier. ↩︎

Decrease risk and cost, increase speed encode privacy inside data with STRM.