Come See Acryl at Coalesce - October 16-19 | San Diego, CA

Acryl Logo
BACK TO ALL POSTS

Data Engineering

Metadata Management

Data Governance

Preventing Data Freshness Problems with Acryl Observe

John Joyce

Sep 11, 2023

Imagine this...

You’re a data engineer who manages a Snowflake table that tracks all user click events from your company’s e-commerce website. This data originates from an online web application upstream of Snowflake, and lands in your data warehouse every day. At least it’s supposed to.

One day, someone introduces a bug in the upstream ETL job that copies the click event data to the data warehouse, causing the pipeline to copy zero click events for the day. Suddenly, your Snowflake table is missing crucial click event data for yesterday, and you don’t have a clue.

Next thing you know, you receive a frantic Slack message from your Head of Product. She’s puzzled by the site usage dashboard that shows zero user views or purchases on the previous day!

Oops.

As data engineers, it’s all too common for end users – decision makers looking at internal dashboards or worse, users navigating our product – to be the first ones to discover data issues when they happen.

This is because some types of issues are particularly hard to detect in advance, like missing data from upstream sources.

As data engineers, it’s all too common for end users – decision makers looking at internal dashboards or worse, users navigating our product – to be the first ones to discover data issues when they happen.

This is because some types of issues are particularly hard to detect in advance, like missing data from upstream sources.

The Data Freshness Problem Is Real

Unfortunately, we’ve all been there.

The data you depend on (or are responsible for!) went days, weeks, or even months before someone noticed that it wasn’t being updated as often as it should.

Maybe it was due to an application code bug, a seemingly harmless SQL query change, or someone leaving the company. There are many reasons why a table on Snowflake, Redshift, or BigQuery may fail to get updated as frequently as stakeholders expect.

Such issues can have severe consequences: inaccurate insights, misinformed decision-making, and bad user experience to name a few.

For this reason, it’s critical that organizations try to get ahead of these types of issues, with a focus on protecting the most mission-critical data assets.

What if you could reduce the time to detect data freshness incidents?

What if you could *continuously* monitor the freshness status of your data and detect catch issues before they reach anyone else?

Introducing Freshness Monitoring on Acryl DataHub

While many data catalogs disregard the freshness problem altogether, we at Acryl believe the central data catalog should be the single source of truth for the technical health and the governance or compliance health of a data asset – a one-stop-shop for establishing trust in data, suitable for use by your entire organization.

Based on this belief, and our deep experience extracting metadata from systems like Snowflake, BigQuery, and Redshift, we felt well-positioned to tackle the data freshness challenge.

Through many conversations with our customers and community, we honed our initial approach and built the Freshness Assertion monitoring feature on Acryl DataHub.

With Acryl Freshness Assertions, data producers or consumers can

  1. Define expectations about when a particular table should change
  2. Continuously monitor those expectations over time
  3. Get notified when things go wrong.
It's like having an automated watchdog that continuously ensures that your most important tables are being updated on time, and that alerts you when things go off-track.

Let’s take a closer look at how Acryl Observe Freshness Assertions work.

The Tricky Bit: Determining Whether a Table Has Changed

Monitoring the freshness of your warehouse tables might sound straightforward, but it’s a bit more complicated than initially meets the eye.

The first challenge is determining what constitutes a “change”.

Is it an INSERT operation? What about a DELETE? Even if the total number of rows hasn’t changed? Is it based on rows being explicitly added or removed? Or perhaps the presence of rows with a new, higher value than has previously been observed in the table

More specifically, we may rely on:

  • An audit log, which contains information about the operations performed on each table
  • An information schema, which contains live database and table information
  • A last modified column, storing the last modification time for a given row
  • A high watermark column that has increasing values each time new data is introduced to a table (any continuously increasing value)

So what is the right way to determine whether a table has changed?

The answer: it depends.

Selecting the right approach hinges on your data consumer's expectations.

Each scenario needs a different approach. For instance, if you expect new date partitions daily, the high watermark column would be the right choice. If any change (INSERT, UPDATE, DELETE) is valid, the information schema might be suitable. If you already track row changes using a last modified timestamp column, that could be the simplest and most accurate option.

Our conversations with customers and partners revealed the need for configurability and customizability across these approaches.

The outcome of these conversations is the Freshness Assertion—a configurable Data Quality rule that monitors your table over time to detect deviations from the anticipated update schedule.

This ensures that you and your team are the first to know when freshness issues inevitably arise.

The Anatomy of an Acryl Freshness Assertion

Acryl DataHub supports creating and scheduling ‘Freshness Assertions’ to monitor the freshness of the most important tables in your warehouse.

What exactly is a Freshness Assertion in DataHub?

A Freshness Assertion is a configurable data quality rule used to determine if a table in the data warehouse has been updated within a given period. It is particularly useful when you have frequently changing tables.

At the most basic level, a Freshness Assertion consists of:

  • An evaluation schedule: This defines how often to check a given warehouse table for new updates. This is usually configured to match the expected change frequency of the table, although you can choose to evaluate it more frequently.
  • A change window: This defines the window of time that is used when determining whether a change has been made to a Table.
  • A change source: This is the mechanism that Acryl DataHub should use to determine whether the table has changed.
  • Audit Log (Default): A metadata API or table that is exposed by the data warehouse which contains information about the operations that have been performed on each table.
DataHub Change Source for Freshness Monitoring
  • Information Schema: A system table exposed by the data warehouse that contains live information about the databases and tables stored inside the warehouse.
  • Last Modified Column: A Date or Timestamp column that represents the last time that a specific row was touched or updated. Adding a Last Modified Column to each warehouse Table is a pattern often used for existing use cases around change management.
  • High Watermark Column: A column that contains a continuously increasing value - like a date, a time, or any other such value.

Using the Last Modified Column or High Watermark approach is especially useful when you want to monitor for specific types of changes, e.g. special inserts or updates, for a table.

There’s more.

As part of the Acryl Observe module, DataHub also comes with Smart Assertions, which are AI-powered Freshness Assertions that you can use out of the box to monitor the freshness of important warehouse tables.

This means that If DataHub can detect a pattern in the change frequency of a Snowflake, Redshift, or BigQuery table, you'll find recommended Smart Assertions for frequently changing tables under the Validations tab on the asset’s profile page.

Using Freshness Assertions to Monitor Tables on Your Data Warehouse

In this section, we’ll see how simple it is to set up a Freshness Monitoring Assertion for a table using the DataHub UI.

Step 1: Creating the Assertion

Navigate to the table to be monitored, and create a new DataHub Assertion for Freshness Monitoring using the Validations tab.

Step 2: Configuring the Assertion evaluation parameters


For this, you’ll need to configure the

  • Evaluation schedule: This is the frequency at which the table will be checked for changes (your expectation about how often the table should be updated).
  • Evaluation period: This defines the period between subsequent evaluations of the check. You can
    • Check whether the table has changed in a specific window of time
  • Check whether the table has changed between subsequent evaluations of the check.

Lastly, you can customize the evaluation source to configure the mechanism you want to use to evaluate the check.

Step 3: Triggering an Incident

Once you set the parameters for monitoring, you can decide how you want the assertion to automatically trigger an incident when it fails. This allows you to broadcast a health issue to all relevant stakeholders.

You can also use the Auto-Resolve option for when (and if) the issue has passed.

For a detailed guide on setting up Freshness Assertions on DataHub, check out this guide to Freshness Assertions on Managed DataHub (https://datahubproject.io/docs/managed-datahub/observe/freshness-assertions/).

Depending on the evaluation of the Assertion, DataHub uses an Identifier right next to the asset to indicate whether it is healthy or not.

Staying on Top of Data Freshness Issues with Subscriptions and Notifications

There are two ways for you to stay updated on the freshness of your data warehouse using the Subscriptions & Notification feature offered by Acryl DataHub.


To be notified when things go wrong, simply subscribe to receive notifications for

  • Assertion status changes: Get notified when an Assertion fails or passes for a specific table
  • Incident status changes: Get notified when an Incident is raised or closed for a specific table

These notifications are accessible via Slack and soon other platforms.

DataHub as a Health Indicator for Your Data

DataHub goes beyond observability – it is the central source of truth for the health of your data – encompassing both technical health, e.g. day to day-to-day data quality as well as governance health or compliance health like documented purpose, classification, accountability via ownership, etc.

Simply using the Assertions or Incidents filter in the Search feature can help you surface assets that have freshness issues.

You could even use the Observe module on DataHub Acryl to surface the health issues of your assets.

With DataHub, you have a snapshot view of the real-time health of your data that’s accessible and useful to anyone in your company – be it a marketer, a business analyst, or even the CEO.

Experience a Fresh Approach to Data Quality and Integrity

Data integrity isn't a one-size-fits-all challenge, and that's what sets DataHub’s Freshness Assertion Monitoring apart. Unlike traditional approaches that might rely on point-in-time checks, it offers continuous and real-time monitoring in a hands-free, no-code manner.

So, whether you're crunching numbers, building dashboards, or making critical decisions, DataHub's Freshness Assertion Monitoring can help ensure that

  • You're informed of freshness issues as soon as they occur
  • You prevent downstream data users from encountering data issues first

If this sounds like something your team needs, get in touch with me for a demo of Acryl DataHub (https://www.acryldata.io/sign-up).

PS: Freshness Assertion Monitoring is just the beginning. As we continue to iterate on our observability offering, we're excited to bring more data health-focused features to the table. Watch this space to stay tuned for our upcoming Volume Monitoring feature that will help you identify any unexpected shifts in the row count of your most important tables.

Connect with DataHub

Join us on SlackSign up for our NewsletterFollow us on Linkedin

Data Engineering

Metadata Management

Data Governance

NEXT UP

Simplifying Data Monitoring & Management with Subscriptions and Notifications with Acryl DataHub

If you're part of a data team responsible for a business-critical dataset, dashboard, or any other data asset, you know how important it is to stay on top of any upstream changes before they impact you and your stakeholders. What if a table you rely on just got deprecated? What if a column you use was removed upstream? Or if an upstream table missed an update and now has stale, un-synced data? Staying updated on critical assets in real time is critical to effective data monitoring and data quality. Given the complexity of today’s data environment, doing this is no walk in the park. But what if there was a way to stay in the loop all the time? And know exactly what happened – right when it happened? With Acryl DataHub's Subscriptions and Notifications feature, you can.

Maggie Hays

2023-09-20

Data Products in DataHub: Everything You Need to Know

See an overview of DataHub’s vision and current model for Data Products, as well as our vision and commitments for the future.

Shirshanka Das

2023-09-19

Data Contracts in DataHub: Combining Verifiability with Holistic Data Management

See how we’ve implemented Data Contracts within DataHub, how you can get started, and how the Data Products functionality can help you get the most out of Data Contracts.

Shirshanka Das

2023-09-19

Get started with Acryl today.
Acryl Data delivers an easy to consume DataHub platform for the enterprise
See it in action
Acryl Data Logo
TermsPrivacySecurity
© 2023 Acryl Data