Data Engineering
Metadata Management
Data Governance
John Joyce
Sep 11, 2023
You’re a data engineer who manages a Snowflake table that tracks all user click events from your company’s e-commerce website. This data originates from an online web application upstream of Snowflake, and lands in your data warehouse every day. At least it’s supposed to.
One day, someone introduces a bug in the upstream ETL job that copies the click event data to the data warehouse, causing the pipeline to copy zero click events for the day. Suddenly, your Snowflake table is missing crucial click event data for yesterday, and you don’t have a clue.
Next thing you know, you receive a frantic Slack message from your Head of Product. She’s puzzled by the site usage dashboard that shows zero user views or purchases on the previous day!
As data engineers, it’s all too common for end users – decision makers looking at internal dashboards or worse, users navigating our product – to be the first ones to discover data issues when they happen.
This is because some types of issues are particularly hard to detect in advance, like missing data from upstream sources.
As data engineers, it’s all too common for end users – decision makers looking at internal dashboards or worse, users navigating our product – to be the first ones to discover data issues when they happen.
This is because some types of issues are particularly hard to detect in advance, like missing data from upstream sources.
Unfortunately, we’ve all been there.
The data you depend on (or are responsible for!) went days, weeks, or even months before someone noticed that it wasn’t being updated as often as it should.
Maybe it was due to an application code bug, a seemingly harmless SQL query change, or someone leaving the company. There are many reasons why a table on Snowflake, Redshift, or BigQuery may fail to get updated as frequently as stakeholders expect.
Such issues can have severe consequences: inaccurate insights, misinformed decision-making, and bad user experience to name a few.
For this reason, it’s critical that organizations try to get ahead of these types of issues, with a focus on protecting the most mission-critical data assets.
What if you could reduce the time to detect data freshness incidents?
What if you could *continuously* monitor the freshness status of your data and detect catch issues before they reach anyone else?
While many data catalogs disregard the freshness problem altogether, we at Acryl believe the central data catalog should be the single source of truth for the technical health and the governance or compliance health of a data asset – a one-stop-shop for establishing trust in data, suitable for use by your entire organization.
Based on this belief, and our deep experience extracting metadata from systems like Snowflake, BigQuery, and Redshift, we felt well-positioned to tackle the data freshness challenge.
Through many conversations with our customers and community, we honed our initial approach and built the Freshness Assertion monitoring feature on Acryl DataHub.
With Acryl Freshness Assertions, data producers or consumers can
It's like having an automated watchdog that continuously ensures that your most important tables are being updated on time, and that alerts you when things go off-track.
Let’s take a closer look at how Acryl Observe Freshness Assertions work.
Monitoring the freshness of your warehouse tables might sound straightforward, but it’s a bit more complicated than initially meets the eye.
The first challenge is determining what constitutes a “change”.
Is it an INSERT operation? What about a DELETE? Even if the total number of rows hasn’t changed? Is it based on rows being explicitly added or removed? Or perhaps the presence of rows with a new, higher value than has previously been observed in the table
More specifically, we may rely on:
So what is the right way to determine whether a table has changed?
The answer: it depends.
Selecting the right approach hinges on your data consumer's expectations.
Each scenario needs a different approach. For instance, if you expect new date partitions daily, the high watermark column would be the right choice. If any change (INSERT, UPDATE, DELETE) is valid, the information schema might be suitable. If you already track row changes using a last modified timestamp column, that could be the simplest and most accurate option.
Our conversations with customers and partners revealed the need for configurability and customizability across these approaches.
The outcome of these conversations is the Freshness Assertion—a configurable Data Quality rule that monitors your table over time to detect deviations from the anticipated update schedule.
This ensures that you and your team are the first to know when freshness issues inevitably arise.
Acryl DataHub supports creating and scheduling ‘Freshness Assertions’ to monitor the freshness of the most important tables in your warehouse.
A Freshness Assertion is a configurable data quality rule used to determine if a table in the data warehouse has been updated within a given period. It is particularly useful when you have frequently changing tables.
At the most basic level, a Freshness Assertion consists of:
Using the Last Modified Column or High Watermark approach is especially useful when you want to monitor for specific types of changes, e.g. special inserts or updates, for a table.
There’s more.
As part of the Acryl Observe module, DataHub also comes with Smart Assertions, which are AI-powered Freshness Assertions that you can use out of the box to monitor the freshness of important warehouse tables.
This means that If DataHub can detect a pattern in the change frequency of a Snowflake, Redshift, or BigQuery table, you'll find recommended Smart Assertions for frequently changing tables under the Validations tab on the asset’s profile page.
In this section, we’ll see how simple it is to set up a Freshness Monitoring Assertion for a table using the DataHub UI.
Navigate to the table to be monitored, and create a new DataHub Assertion for Freshness Monitoring using the Validations tab.
For this, you’ll need to configure the
Lastly, you can customize the evaluation source to configure the mechanism you want to use to evaluate the check.
Once you set the parameters for monitoring, you can decide how you want the assertion to automatically trigger an incident when it fails. This allows you to broadcast a health issue to all relevant stakeholders.
You can also use the Auto-Resolve option for when (and if) the issue has passed.
For a detailed guide on setting up Freshness Assertions on DataHub, check out this guide to Freshness Assertions on Managed DataHub (https://datahubproject.io/docs/managed-datahub/observe/freshness-assertions/).
Depending on the evaluation of the Assertion, DataHub uses an Identifier right next to the asset to indicate whether it is healthy or not.
There are two ways for you to stay updated on the freshness of your data warehouse using the Subscriptions & Notification feature offered by Acryl DataHub.
To be notified when things go wrong, simply subscribe to receive notifications for
These notifications are accessible via Slack and soon other platforms.
DataHub goes beyond observability – it is the central source of truth for the health of your data – encompassing both technical health, e.g. day to day-to-day data quality as well as governance health or compliance health like documented purpose, classification, accountability via ownership, etc.
Simply using the Assertions or Incidents filter in the Search feature can help you surface assets that have freshness issues.
You could even use the Observe module on DataHub Acryl to surface the health issues of your assets.
With DataHub, you have a snapshot view of the real-time health of your data that’s accessible and useful to anyone in your company – be it a marketer, a business analyst, or even the CEO.
Data integrity isn't a one-size-fits-all challenge, and that's what sets DataHub’s Freshness Assertion Monitoring apart. Unlike traditional approaches that might rely on point-in-time checks, it offers continuous and real-time monitoring in a hands-free, no-code manner.
So, whether you're crunching numbers, building dashboards, or making critical decisions, DataHub's Freshness Assertion Monitoring can help ensure that
If this sounds like something your team needs, get in touch with me for a demo of Acryl DataHub (https://www.acryldata.io/sign-up).
PS: Freshness Assertion Monitoring is just the beginning. As we continue to iterate on our observability offering, we're excited to bring more data health-focused features to the table. Watch this space to stay tuned for our upcoming Volume Monitoring feature that will help you identify any unexpected shifts in the row count of your most important tables.
Join us on Slack • Sign up for our Newsletter • Follow us on Linkedin
Data Engineering
Metadata Management
Data Governance
NEXT UP
If you're part of a data team responsible for a business-critical dataset, dashboard, or any other data asset, you know how important it is to stay on top of any upstream changes before they impact you and your stakeholders. What if a table you rely on just got deprecated? What if a column you use was removed upstream? Or if an upstream table missed an update and now has stale, un-synced data? Staying updated on critical assets in real time is critical to effective data monitoring and data quality. Given the complexity of today’s data environment, doing this is no walk in the park. But what if there was a way to stay in the loop all the time? And know exactly what happened – right when it happened? With Acryl DataHub's Subscriptions and Notifications feature, you can.
Maggie Hays
2023-09-20
See an overview of DataHub’s vision and current model for Data Products, as well as our vision and commitments for the future.
Shirshanka Das
2023-09-19
See how we’ve implemented Data Contracts within DataHub, how you can get started, and how the Data Products functionality can help you get the most out of Data Contracts.
Shirshanka Das
2023-09-19