BACK TO ALL POSTS

Preventing Data Freshness Problems with Acryl Observe

Data Engineering

Metadata Management

Data Governance

John Joyce

Sep 11, 2023

Data Engineering

Metadata Management

Data Governance

Imagine this...

You’re a data engineer who manages a Snowflake table that tracks all user click events from your company’s e-commerce website. This data originates from an online web application upstream of Snowflake, and lands in your data warehouse every day. At least it’s supposed to.

One day, someone introduces a bug in the upstream ETL job that copies the click event data to the data warehouse, causing the pipeline to copy zero click events for the day. Suddenly, your Snowflake table is missing crucial click event data for yesterday, and you don’t have a clue.

Next thing you know, you receive a frantic Slack message from your Head of Product. She’s puzzled by the site usage dashboard that shows zero user views or purchases on the previous day!

Oops.

As data engineers, it’s all too common for end users – decision makers looking at internal dashboards or worse, users navigating our product – to be the first ones to discover data issues when they happen.

This is because some types of issues are particularly hard to detect in advance, like missing data from upstream sources.

As data engineers, it’s all too common for end users – decision makers looking at internal dashboards or worse, users navigating our product – to be the first ones to discover data issues when they happen.

This is because some types of issues are particularly hard to detect in advance, like missing data from upstream sources.

The Data Freshness Problem Is Real

Unfortunately, we’ve all been there.

The data you depend on (or are responsible for!) went days, weeks, or even months before someone noticed that it wasn’t being updated as often as it should.

Maybe it was due to an application code bug, a seemingly harmless SQL query change, or someone leaving the company. There are many reasons why a table on Snowflake, Redshift, or BigQuery may fail to get updated as frequently as stakeholders expect.

Such issues can have severe consequences: inaccurate insights, misinformed decision-making, and bad user experience to name a few.

For this reason, it’s critical that organizations try to get ahead of these types of issues, with a focus on protecting the most mission-critical data assets.

What if you could reduce the time to detect data freshness incidents?

What if you could *continuously* monitor the freshness status of your data and detect catch issues before they reach anyone else?

Introducing Freshness Monitoring on Acryl DataHub

While many data catalogs disregard the freshness problem altogether, we at Acryl believe the central data catalog should be the single source of truth for the technical health and the governance or compliance health of a data asset – a one-stop-shop for establishing trust in data, suitable for use by your entire organization.

Based on this belief, and our deep experience extracting metadata from systems like Snowflake, BigQuery, and Redshift, we felt well-positioned to tackle the data freshness challenge.

Through many conversations with our customers and community, we honed our initial approach and built the Freshness Assertion monitoring feature on Acryl DataHub.

With Acryl Freshness Assertions, data producers or consumers can

  1. Define expectations about when a particular table should change
  2. Continuously monitor those expectations over time
  3. Get notified when things go wrong.
It's like having an automated watchdog that continuously ensures that your most important tables are being updated on time, and that alerts you when things go off-track.

Let’s take a closer look at how Acryl Observe Freshness Assertions work.

The Tricky Bit: Determining Whether a Table Has Changed

Monitoring the freshness of your warehouse tables might sound straightforward, but it’s a bit more complicated than initially meets the eye.

The first challenge is determining what constitutes a “change”.

Is it an INSERT operation? What about a DELETE? Even if the total number of rows hasn’t changed? Is it based on rows being explicitly added or removed? Or perhaps the presence of rows with a new, higher value than has previously been observed in the table

More specifically, we may rely on:

  • An audit log, which contains information about the operations performed on each table
  • An information schema, which contains live database and table information
  • A last modified column, storing the last modification time for a given row
  • A high watermark column that has increasing values each time new data is introduced to a table (any continuously increasing value)

So what is the right way to determine whether a table has changed?

The answer: it depends.

Selecting the right approach hinges on your data consumer's expectations.

Each scenario needs a different approach. For instance, if you expect new date partitions daily, the high watermark column would be the right choice. If any change (INSERT, UPDATE, DELETE) is valid, the information schema might be suitable. If you already track row changes using a last modified timestamp column, that could be the simplest and most accurate option.

Our conversations with customers and partners revealed the need for configurability and customizability across these approaches.

The outcome of these conversations is the Freshness Assertion—a configurable Data Quality rule that monitors your table over time to detect deviations from the anticipated update schedule.

This ensures that you and your team are the first to know when freshness issues inevitably arise.

The Anatomy of an Acryl Freshness Assertion

Acryl DataHub supports creating and scheduling ‘Freshness Assertions’ to monitor the freshness of the most important tables in your warehouse.

What exactly is a Freshness Assertion in DataHub?

A Freshness Assertion is a configurable data quality rule used to determine if a table in the data warehouse has been updated within a given period. It is particularly useful when you have frequently changing tables.

At the most basic level, a Freshness Assertion consists of:

  • An evaluation schedule: This defines how often to check a given warehouse table for new updates. This is usually configured to match the expected change frequency of the table, although you can choose to evaluate it more frequently.
  • A change window: This defines the window of time that is used when determining whether a change has been made to a Table.
  • A change source: This is the mechanism that Acryl DataHub should use to determine whether the table has changed.
  • Audit Log (Default): A metadata API or table that is exposed by the data warehouse which contains information about the operations that have been performed on each table.
DataHub Change Source for Freshness Monitoring
  • Information Schema: A system table exposed by the data warehouse that contains live information about the databases and tables stored inside the warehouse.
  • Last Modified Column: A Date or Timestamp column that represents the last time that a specific row was touched or updated. Adding a Last Modified Column to each warehouse Table is a pattern often used for existing use cases around change management.
  • High Watermark Column: A column that contains a continuously increasing value - like a date, a time, or any other such value.

Using the Last Modified Column or High Watermark approach is especially useful when you want to monitor for specific types of changes, e.g. special inserts or updates, for a table.

There’s more.

As part of the Acryl Observe module, DataHub also comes with Smart Assertions, which are AI-powered Freshness Assertions that you can use out of the box to monitor the freshness of important warehouse tables.

This means that If DataHub can detect a pattern in the change frequency of a Snowflake, Redshift, or BigQuery table, you'll find recommended Smart Assertions for frequently changing tables under the Validations tab on the asset’s profile page.

Using Freshness Assertions to Monitor Tables on Your Data Warehouse

In this section, we’ll see how simple it is to set up a Freshness Monitoring Assertion for a table using the DataHub UI.

Step 1: Creating the Assertion

Navigate to the table to be monitored, and create a new DataHub Assertion for Freshness Monitoring using the Validations tab.

Step 2: Configuring the Assertion evaluation parameters


For this, you’ll need to configure the

  • Evaluation schedule: This is the frequency at which the table will be checked for changes (your expectation about how often the table should be updated).
  • Evaluation period: This defines the period between subsequent evaluations of the check. You can
    • Check whether the table has changed in a specific window of time
  • Check whether the table has changed between subsequent evaluations of the check.

Lastly, you can customize the evaluation source to configure the mechanism you want to use to evaluate the check.

Step 3: Triggering an Incident

Once you set the parameters for monitoring, you can decide how you want the assertion to automatically trigger an incident when it fails. This allows you to broadcast a health issue to all relevant stakeholders.

You can also use the Auto-Resolve option for when (and if) the issue has passed.

For a detailed guide on setting up Freshness Assertions on DataHub, check out this guide to Freshness Assertions on Managed DataHub (https://datahubproject.io/docs/managed-datahub/observe/freshness-assertions/).

Depending on the evaluation of the Assertion, DataHub uses an Identifier right next to the asset to indicate whether it is healthy or not.

Staying on Top of Data Freshness Issues with Subscriptions and Notifications

There are two ways for you to stay updated on the freshness of your data warehouse using the Subscriptions & Notification feature offered by Acryl DataHub.


To be notified when things go wrong, simply subscribe to receive notifications for

  • Assertion status changes: Get notified when an Assertion fails or passes for a specific table
  • Incident status changes: Get notified when an Incident is raised or closed for a specific table

These notifications are accessible via Slack and soon other platforms.

DataHub as a Health Indicator for Your Data

DataHub goes beyond observability – it is the central source of truth for the health of your data – encompassing both technical health, e.g. day to day-to-day data quality as well as governance health or compliance health like documented purpose, classification, accountability via ownership, etc.

Simply using the Assertions or Incidents filter in the Search feature can help you surface assets that have freshness issues.

You could even use the Observe module on DataHub Acryl to surface the health issues of your assets.

With DataHub, you have a snapshot view of the real-time health of your data that’s accessible and useful to anyone in your company – be it a marketer, a business analyst, or even the CEO.

Experience a Fresh Approach to Data Quality and Integrity

Data integrity isn't a one-size-fits-all challenge, and that's what sets DataHub’s Freshness Assertion Monitoring apart. Unlike traditional approaches that might rely on point-in-time checks, it offers continuous and real-time monitoring in a hands-free, no-code manner.

So, whether you're crunching numbers, building dashboards, or making critical decisions, DataHub's Freshness Assertion Monitoring can help ensure that

  • You're informed of freshness issues as soon as they occur
  • You prevent downstream data users from encountering data issues first

If this sounds like something your team needs, get in touch with me for a demo of Acryl DataHub (https://www.acryldata.io/sign-up).

PS: Freshness Assertion Monitoring is just the beginning. As we continue to iterate on our observability offering, we're excited to bring more data health-focused features to the table. Watch this space to stay tuned for our upcoming Volume Monitoring feature that will help you identify any unexpected shifts in the row count of your most important tables.

Connect with DataHub

Join us on SlackSign up for our NewsletterFollow us on Linkedin

Data Engineering

Metadata Management

Data Governance

NEXT UP

Data Quality Should be Part of the Data Catalog - Introducing Acryl Observe

We didn’t go looking for an excuse to develop a data observability solution. There’s more than enough to keep us occupied in relentlessly improving the best data catalog on the planet! But the more experience we gained in working closely with Acryl customers, the clearer it became that data quality, data discovery, and data governance aren’t just complementary, but mutually reinforce one another. Acryl Observe provides data teams with everything they need to detect data breakages immediately, contain their downstream impact, keep stakeholders in the loop, and resolve issues fast—so that data teams can spend less time reacting and more time preventing.

John Joyce

2024-04-16

Metadathon - Why MetaData Matters

Vaidehi Sridhar, product manager at PayPal, came up with a clever solution: a hackathon—but for documentation and metadata. The goal of Sridhar’s and PayPal’s “Metadatathon” was to crowdsource the labor involved in documenting and adding rich context to the company’s distributed data assets. “Lack of documentation was one of the major problems most of our users called out,” Sridhar explains.

Stephen Swoyer

2024-04-16

Hack Your Way to Data Quality with a "Metadatathon"

PayPal Product Manager Vaidehi Sridhar probably wouldn’t call herself a mastermind, but she did execute on a genius idea to improve data quality across PayPal’s sprawling ecosystem. Even better, Sridhar and her team laid the groundwork for PayPal to federate data governance for its decentralized teams—as well as automate many types of tedious governance tasks. The genius idea? A metadata hackathon, or “Metadatathon.”

Stephen Swoyer

2024-03-19

Get started with Acryl today.
Acryl Data delivers an easy to consume DataHub platform for the enterprise
See it in action
Acryl Data Logo
Acryl DataHub
Acryl Observe
TermsPrivacySecurity
© 2024 Acryl Data