tutorial

Creating data schemas and classifying data in event contracts

Data schemas define the shape (WHAT) of your data, while event contracts specify HOW to classify each field. Let's dive deeper into these two core concepts and how you can create your own schemas.

by STRM.team on

Find the more technical strm create schema && strm create event-contract explanations and examples in our docs.

For starters

At STRM, our aim is to make sure data teams can build with confidence when it comes to privacy. The way we approach this is to help you codify privacy policies inside your data: You define privacy implications once, and as soon data is generated it needs to adhere to this definition.

This way, you can build against what we call privacy streams (data egress) and don’t have to worry again about the allowed purposes of the data inside, e.g. for analytics or ML.

To achieve this, what better place to start than at data shape, ergo the schema.

In this post we’ll outline:

  • what the concepts and thoughts behind the data schema and even contract are
  • how to classify and add the data through an event contract.
  • and how you can add and use your own data schemas through our CLI and console

Data hates but loves schemas

From databases to streaming to analytics to… schemas are everywhere. They define what data (should) look like. In reality data has a tendency to not look like the schema from time to time. This is a very natural thing - many producers across many teams, a just-before-coffee commit or so-so review (LGTM!), or even users entering strange stuff into fields…

Even if you have the schema well-defined and aligned across the organization, data in real-life is messy. This is a serious problem when you need to warrant what’s inside the data, like when dealing with privacy!

This is where strict schemas like ours come in.

Being strict is being nice

It’s a bit like raising children: if you’re strict on boundaries, you can let go on what’s within.

Being strict about the shape your data can take has tons of advantages:

  • data can only have a specific type and shape,
  • you catch errors in data already in your development process,
  • it’s helpful in being compliant.

It’s also very annoying (as with any strict type) if you are sloppy or in a hurry, or need to facilitate data flow with generic schemas across many departements.

Data schemas define shape…

The data schema defines the shape of the data: what are you collecting or generating?

Basically you can think of it as the headers in columns of a spreadsheet. In technical terms, it’s formally referred to as the serialization schema (but we prefer to just call it a schema). Here’s an example of our privacy demo schema:

The fields are all strings, and they could hold e.g. information about which recommendation received a click (notSensitiveValue) in a browsing session (consistentValue) by a customer (uniqueIdentifier).

Schemas have a reference and version, so our system knows where to look for the source: strmprivacy/demo/1.0.2

You can also retrieve a schema through our CLI:

❯ strm get schema --output json strmprivacy/demo/1.0.2

… and event contracts define how to process the data.

Once we have the schema, we can define what the privacy implications are, e.g. which of your consent settings corresponds to being allowed to identify a customer. The event contract defines how data should be processed and can be used: A field has a type, which needs a consent value.

Note Depending on the consent type granular or cumulative, the consent value counts as only that consent is needed or the consent or higher is needed.

Let’s take a look at the event contract for the privacy demo schema:

In the top row, we see the key field, which indicates which field can be used to connect events. The fields in the demo schema are all classified as PII fields (Personally Identifiable Information), and so need to have a consent level defined. In the clickstream demo event not every field is sensitive. If a field is not sensitive, just don’t define it in the event contract. Next, we have a validations field, which is a bit of a special one.

Setting validations on field level

With STRM, you can validate field contents using regular expressions. If the regex fails to match, the event will bounce. This is a simple but pretty powerful way of validating data at the gate! It also means you really need to know your way around regexes, so handle with care. 😉

Adding schemas and event contracts

With schemas and event contracts at the core of how you send data, let’s see how you add and use your own.

Schema libary: public or private.

Schemas are added to our registry, which we call the library.

With the schema library we intend to provide examples and get you started quicker - schemas generally differ on specifics, not a 100%1. Paid users have the option to submit private schemas.

Adding schemas through the CLI

Adding schemas and event contracts through the CLI is a simple command (provided you have your files defined):

> strm create schema (handle/name/version) <definition-file>
> strm create event contract (definition-file)

View the existing schemas as code and copy them as an example. You can add schemas as JSON or the efficient AVRO format. We’re loving users of the Javro editor.

Console

Adding schemas and event contracts in the console is just as simple.

Under the schemas tab, hit create schema:

You can select a handle, name the schema, choose the format and edit, drop or upload a schema:

Adding event contracts
The steps are the same for event contracts, plus you can add the key field, PII fields and validations:

Once you add a PII field, you need to indicate the path inside the schema to that field. E.g. for someSensitiveValue in the demo schema the path is simply someSensitiveValue. If your schemas are more complex and nested, make sure to enter the full path.

YAML, AVRO and JSON

For now, you can only add AVRO and JSON schemas. We are working on a simpler way to define and add schemas (YAML all the way!), but that’s not publicly available yet.

Wrap-up 🌯

When we started, we promised to outline

  • the concepts and thoughts behind the data schema and even contract
  • how to classify data through an event contract.
  • how you can add and use your own data schemas through our CLI and console.

Which we did!

Head over to our console or sign-up if you haven’t already, and make sure to drop us a line with your feedback.

1 If there’s 61% overlap between DNA in fruit flies and human DNA, there must be overlap between your use cases ;-)

PS We’re hiring!

Want to work on features like adding data schemas/event contracts and help data teams build awesome data products without sacrificing privacy in the process? There’s plenty of cool work left. Did we mention we are hiring!?

Do you want to build faster and cheaper without worrying about privacy?