This blog post might be outdated click here for the newest version.
Streaming + privacy by design = STRM
When we started STRM, our vision was to build a stunningly simple (compared to DIY) streaming data platform with integrated privacy. This way, an advanced data capability is available to a much wider audience without the high barrier to entry. We also believe privacy in data should “shift left”, and data should not be collected without privacy encoded into it.
All in, there are three premises underlying our thinking:
- Encode privacy and legal ground inside every data point
- Apply to new(ly collected) data
- Streaming is the future of data
Which is the foundation of what we’re doing today: define and register data contracts (= data shape + privacy implications), set up a streaming pipeline in just a few minutes and consume privacy-transformed data in real-time.
Build faster, cheaper and at lower risk
This already brings important benefits, not just to tech-focused teams but across organizations.
Enabling previously risky use cases that are blocked by privacy concerns brings new (revenue) opportunities. Considerable cost savings come from reducing coordination between legal/security and data (-science) teams into a single set of decisions you can take up front.
Streaming is the future of data. And batch, batch too!
But when working with clients, we noticed not everyone is like Netflix in terms of maturity to build and benefit from event-based architectures. Even for advanced ML teams, Chip’s way with real-time machine learning is often out of reach (for which seriously low latency data pipelines are a prerequisite).
In fact, most integrations of STRM we see opt to stream data in, but batch it back into their own perimeters and only move to full streaming for completely new applications.
Streaming doesn’t address existing use cases and routines
So perhaps, encoding privacy in “new data” is not just limited to event-based systems. Many processes already exist, and are often time-sensitive only within the boundaries of (micro-) batch architectures.
Moreover, we see organizations struggle with questions around existing data and privacy-by-design. Their data is a treasure trove, but there is a lot of uncertainty around privacy (which adds coordination and limits use cases considerably).
These cases often boil down to a lack of good administration on how the data was obtained (the legal ground that applies), posing serious challenges for compliant data processing and -consumption.
Introducing Batch mode: re-process data or integrate into existing routines.
Internally, this fueled a discussion: what if we expand into the domain of existing data and data routines?
Fast forward to our release of batch mode today: more freedom in choosing and integrating the data architecture to suit your technical setup and use cases.
With batch mode, you can set up data routines that, based on the data contract, pull in data from (currently) a bucket, transform according to the contract and subsequently pick it up for downstream processing.
An important limitation of this early release is that your data needs to be temporally spaced in order for us to achieve anonymization. The tutorial explains this a bit deeper. We plan to offer more privacy transforms, but if you need to process something like user profile data (where every user is only on a single row of data), please be aware of this.
Creating a STRM Batch job: re-process data or integrate into existing routines.
In order for us to pick up and transform your batch data, you need to:
- Create a data connection to retrieve and store the data.
- Define the data contract your data adheres to.
- Define a batch job in the CLI.
- Watch the magic!
Head over to the tutorial on creating batch pipelines for a walk through of all the steps and a data example.
PS We’re hiring!
Want to help data teams build awesome data products without sacrificing privacy in the process? There’s plenty of cool work left. Did we mention we are hiring!?