Oct 2, 2023
That sets off alarm bells for you as a data engineer. You know that when something seems too good to be true, it probably is.
After doing some digging, you find out that the upstream “purchases” table that initially tracked purchases made only on your website now includes purchases made on your company’s wildly popular new mobile app.
No one had even told you about this. You dig a little more; it’s worse than you’d initially thought.
The issue started weeks ago, and you’ve just found out now.
A day before the quarter ends. From a partner team, no less.
We all know the saying: Data users should NOT be the first to encounter data issues.
Unfortunately, conversations with Acryl customers and partners have shown that they often ARE the first to encounter hard-to-detect data quality issues, usually a result of
Frequently, these kinds of issues aren't inherently problematic or erroneous. They may not even be preventable.
The real issue is that they can unknowingly break assumptions made by downstream stakeholders or consumers, as we saw in the horror story above.
However, when total prevention isn't possible, early detection is the next best thing.
At Acryl, we're focused on building observability features that help organizations reduce the time to detect these types of data quality problems so that the people responsible for the data (the data team) are made aware of issues before anyone else (partner teams).
The first feature we launched was Freshness Monitoring, which lets you continuously monitor the freshness of your data and detect catch issues before they reach anyone else. (Read about it here: Preventing Data Freshness Problems with Acryl Observe)
The next feature tool we’re rolling out is Data Volume Assertions.
Let’s rewind to our example of the “purchases” table.
The table suddenly changed in meaning. Each row in the table initially represented a purchase made through the website. But suddenly, each row became indicative of a purchase made either via the website or the mobile app.
This change – although unexpected – is an example of something that didn’t need prevention. There may have been good reasons why the upstream team decided to extend the table.
But it certainly could have been more proactively identified, giving us the opportunity and time to prevent the spread of negative impacts.
A table's row count – or volume – can be a critical indicator of data health. For instance, sudden increases or decreases in the rows being added or removed from a table could indicate that our assumptions about the data are no longer true, or that something is seriously wrong with our data.
Monitoring the volume, or row count, of your tables, and their growth rate, can be a simple and effective way to detect data quality issues early, before things get worse. With Acryl DataHub’s Volume Assertions, you can do just that.
Volume Assertions allow you to:
They can be particularly useful when you have frequently changing tables that have a predictable pattern of growth or decline, or when you have a table that remains relatively stable over time
At the most basic level, Volume Assertions consist of a few important components:
This defines how often to check a given warehouse Table for its volume. This should usually be configured to match the expected change frequency of the Table, although it can also be less frequent depending on the requirements. You can also specify specific days of the week, hours in the day, or even minutes in an hour.
This dictates when the Volume Assertion should fail. Options include checking for a fixed range, a fixed row count condition (e.g., more or fewer rows than expected), or even evaluating the growth rate of the table.
There are two categories of conditions:
Row Count (Based on the total/absolute volume of the table)
These are defined against the point-in-time total row count for a table.
Growth Rate (Based on the change in volume)
These are defined against the growth or decline rate of a table, measured between subsequent checks of the table volume.
For example, you could specify conditions like:
For Growth Rate conditions, DataHub can identify both absolute row count deltas and relative percentage deltas to identify tables with an abnormal pattern of growth.
The Volume Source: This is the mechanism that Acryl DataHub can use to determine the table volume (row count).
The supported source types vary based on the platform you use, but generally fall into these categories:
Information Schema: The system metadata or information schema tables exposed by the data warehouse are used to determine the row count.
Query: A COUNT(*) query is used to retrieve the latest row count for a table, with optional SQL filters applied (depending on the platform). While this approach is more portable as it does not involve system warehouse tables, it can pose efficiency concerns.
DataHub Dataset Profile: The DataHub Dataset Profile aspect is used to retrieve the latest row count information for a table. Using this option avoids contacting your data platform, and instead uses the DataHub Dataset Profile metadata to evaluate Volume Assertions.
It can be time and cost-intensive – when you think about having to run a full scan across tens of thousands of rows to retrieve the row count each time you need to check for changes. However, you can add a custom SQL fragment to narrow down the check to a particular fragment of data when you query the dataset. This is the only option available if you have not configured an ingestion source through DataHub.
Once you set the parameters for volume monitoring, you can decide if you want to,
This allows you to easily broadcast data issues to the right stakeholders in real-time when the assertion fails via the DataHub Subscriptions & Notifications feature.
For a detailed how-to, check out our updated Volume Assertions feature guide.
Sometimes, you don't exactly know what the "normal" row count for a table looks like right away. Luckily, the Acryl Observe module offers Smart Assertions out of the box. These are dynamic, AI-powered Volume Assertions that you can use to monitor the volume of important warehouse Tables, without requiring any manual setup. They are generated by looking at historical norms with your data and will change with your data through time.
With Acryl Observe, you can be sure that you're informed of unexpected table volumes as soon as they occur, so that YOU are the one who can first triage data quality problems, instead of your downstream data users.
More importantly, your downstream data are never the first to encounter data issues.
Data monitoring is a tricky problem to solve, but it's one we're excited to tackle with Acryl Observe. We're building a host of new capabilities to help you understand the technical and governance health of your data assets.
Stay tuned for more, or get in touch for a demo of Acryl DataHub for a sneak peek into what we’re building.
We built a SQL lineage parser that's schema-aware and can generate accurate column-level lineage from SQL queries. In our tests, it works significantly better than other open-source, Python-based lineage tools.
Some partnering announcements are especially sweet—like this one.
Are you ready to dive into the world of open source and make a meaningful contribution? Hacktoberfest 2023 is here, and we're thrilled to invite you to participate by contributing to the DataHub project.