Product Update - Supercharging Data Quality with Volume Assertions in Acryl Observe

It’s a day before your company’s quarterly business review and you receive an urgent email from the e-commerce team – the website sales numbers are unusually impressive.

That sets off alarm bells for you as a data engineer. You know that when something seems too good to be true, it probably is.

After doing some digging, you find out that the upstream “purchases” table that initially tracked purchases made only on your website now includes purchases made on your company’s wildly popular new mobile app.

No one had even told you about this. You dig a little more; it’s worse than you’d initially thought.

The issue started weeks ago, and you’ve just found out now.

A day before the quarter ends. From a partner team, no less.

We all know the saying: Data users should NOT be the first to encounter data issues.

Unfortunately, conversations with Acryl customers and partners have shown that they often ARE the first to encounter hard-to-detect data quality issues, usually a result of

Unexpected Schema Changes: Structural changes (columns being added or removed, name changes, type changes, etc.) that break the downstream consumer's expectations of the data.
Delayed Data: When data is not updated as per the expected timelines. This can happen due to multiple reasons – application code bugs, scalability issues, etc
Unexpected Semantic Changes: When the meaning of a particular row or column in a table changes unexpectedly.

Frequently, these kinds of issues aren't inherently problematic or erroneous. They may not even be preventable.

The real issue is that they can unknowingly break assumptions made by downstream stakeholders or consumers, as we saw in the horror story above.

However, when total prevention isn't possible, early detection is the next best thing.

At Acryl, we're focused on building observability features that help organizations reduce the time to detect these types of data quality problems so that the people responsible for the data (the data team) are made aware of issues before anyone else (partner teams).

The first feature we launched was Freshness Monitoring, which lets you continuously monitor the freshness of your data and detect catch issues before they reach anyone else. (Read about it here: Preventing Data Freshness Problems with Acryl Observe)

The next feature tool we’re rolling out is Data Volume Assertions.

Volume of a Table as a Data Health Indicator

Let’s rewind to our example of the “purchases” table.

The table suddenly changed in meaning. Each row in the table initially represented a purchase made through the website. But suddenly, each row became indicative of a purchase made either via the website or the mobile app.

This change – although unexpected – is an example of something that didn’t need prevention. There may have been good reasons why the upstream team decided to extend the table.

But it certainly could have been more proactively identified, giving us the opportunity and time to prevent the spread of negative impacts.

A table's row count – or volume – can be a critical indicator of data health. For instance, sudden increases or decreases in the rows being added or removed from a table could indicate that our assumptions about the data are no longer true, or that something is seriously wrong with our data.

Monitoring the volume, or row count, of your tables, and their growth rate, can be a simple and effective way to detect data quality issues early, before things get worse. With Acryl DataHub’s Volume Assertions, you can do just that.

What is a Volume Assertion in Acryl DataHub?

Volume Assertions allow you to:

Define expectations about the normal volume, or row count, of a particular warehouse table

and

Monitor those expectations over time as the table changes over time

They can be particularly useful when you have frequently changing tables that have a predictable pattern of growth or decline, or when you have a table that remains relatively stable over time

At the most basic level, Volume Assertions consist of a few important components:

1. Evaluation Schedule

This defines how often to check a given warehouse Table for its volume. This should usually be configured to match the expected change frequency of the Table, although it can also be less frequent depending on the requirements. You can also specify specific days of the week, hours in the day, or even minutes in an hour.

2. Volume Condition

This dictates when the Volume Assertion should fail. Options include checking for a fixed range, a fixed row count condition (e.g., more or fewer rows than expected), or even evaluating the growth rate of the table.

There are two categories of conditions:

Row Count (Based on the total/absolute volume of the table)

These are defined against the point-in-time total row count for a table.

Examples include:

The table should always have < 1000 rows
The table should always have > 1000 rows
The table should always have between 1000 and 2000 rows.

Growth Rate (Based on the change in volume)

These are defined against the growth or decline rate of a table, measured between subsequent checks of the table volume.

For example, you could specify conditions like:

When the table volume is checked, it should have < 1000 more rows than it had during the previous check (for when the table growth is too fast)
When the table volume is checked, it should have > 1000 more rows than it had during the previous check (for when the table growth is too slow)
When the table volume is checked, it should have between 1000 and 2000 more rows than it had during the previous check.

For Growth Rate conditions, DataHub can identify both absolute row count deltas and relative percentage deltas to identify tables with an abnormal pattern of growth.

3. Volume Source

The Volume Source: This is the mechanism that Acryl DataHub can use to determine the table volume (row count).

The supported source types vary based on the platform you use, but generally fall into these categories:

Information Schema: The system metadata or information schema tables exposed by the data warehouse are used to determine the row count.

Query: A COUNT(*) query is used to retrieve the latest row count for a table, with optional SQL filters applied (depending on the platform). While this approach is more portable as it does not involve system warehouse tables, it can pose efficiency concerns.

DataHub Dataset Profile: The DataHub Dataset Profile aspect is used to retrieve the latest row count information for a table. Using this option avoids contacting your data platform, and instead uses the DataHub Dataset Profile metadata to evaluate Volume Assertions.

It can be time and cost-intensive – when you think about having to run a full scan across tens of thousands of rows to retrieve the row count each time you need to check for changes. However, you can add a custom SQL fragment to narrow down the check to a particular fragment of data when you query the dataset. This is the only option available if you have not configured an ingestion source through DataHub.

Using Volume Assertions to Stay Updated on Changes in Real Time

Once you set the parameters for volume monitoring, you can decide if you want to,

Raise an incident when the Assertion fails
Auto-resolve the incident when the Assertion passes

This allows you to easily broadcast data issues to the right stakeholders in real-time when the assertion fails via the DataHub Subscriptions & Notifications feature.

For a detailed how-to, check out our updated Volume Assertions feature guide.

Acryl Observe’s Smart Volume Assertions

Sometimes, you don't exactly know what the "normal" row count for a table looks like right away. Luckily, the Acryl Observe module offers Smart Assertions out of the box. These are dynamic, AI-powered Volume Assertions that you can use to monitor the volume of important warehouse Tables, without requiring any manual setup. They are generated by looking at historical norms with your data and will change with your data through time.

Acryl Observe: A Direct Line of Sight into Data Health & Integrity

With Acryl Observe, you can be sure that you're informed of unexpected table volumes as soon as they occur, so that YOU are the one who can first triage data quality problems, instead of your downstream data users.

More importantly, your downstream data are never the first to encounter data issues.

Data monitoring is a tricky problem to solve, but it's one we're excited to tackle with Acryl Observe. We're building a host of new capabilities to help you understand the technical and governance health of your data assets.

John Joyce

Stay tuned for more, or get in touch for a demo of Acryl DataHub for a sneak peek into what we’re building.