Come See Acryl at Coalesce - October 16-19 | San Diego, CA

Acryl Logo

Data Governance



Data Engineering


PII Classification just got easier with DataHub

Maggie Hays

Feb 14, 2023

Managing sensitive data lies at the core of modern data governance. Whether you’re navigating the intricacies of GDPR & CCPA or on the hook for responsibly granting data access to others, it’s critical to have a strategy around tagging sensitive data.


Photo by Philipp Katzenberger on Unsplash

DataHub’s popular Business Glossary is a powerful way to model PII and compliance types and classify data entities across your data stack. In addition to manually assigning these classifications, DataHub can now automatically classify and tag sensitive data or PII — right at ingestion — making data discovery and access seamless, scalable, and secure.

What PII classification in DataHub looks like

DataHub’s automated PII classification identifies sensitive columns and the tables containing them during ingestion, so these columns are automatically associated with predefined PII-related glossary terms.

Datahub flow

Currently, DataHub’s automated PII detection works to detect info types including full name, gender, full name, email phone, street address, credit card number, SSN (Social Security Number), driver’s license numbers, IBAN (international bank account number), bank SWIFT codes, and IP addresses.

TLDR: How DataHub’s automated PII classification works

The TLDR version

Simply put, this opt-in functionality analyzes your metadata at a column level to tag a PII-related glossary term (info type) for every column.

At ingestion, the DataHub PII detection module analyzes each column for the presence of an info type, by

  • checks for the presence of certain factors (referred to as Prediction Factors)
  • assigns a configurable weighting to each Prediction Factor,
  • computes an overall confidence level/score for an info type’s presence,
  • proposes the relevant info type as a glossary term for that column — if the score exceeds the confidence level threshold set by you.

As a DataHub admin, you have complete control over

  • enabling the PII classification functionality
  • deciding the info types to be processed, and
  • setting the confidence-level threshold for the auto-classification of info types

Check out DataHub’s PII Classification in action in this video:

A detailed look at DataHub’s PII classification workflow

DataHub’s classifier implementation uses a standalone library to predict PII info types. It uses the following factors (referred to as Prediction Factors) to propose the info type applicable to each column

  • Name
  • Description
  • Datatype
  • Values

The presence of each Prediction Factor is detected using simple rule-based matching and libraries like Spacy (or other common ML libraries), and a confidence score is assigned for the presence of each Prediction Factor.

The module then uses a customizable weighted combination of these different confidence scores to compute an overall level that determines if the proposed info type applies to the column. You can configure the weightage of each Prediction Factor to control how it impacts the final value.

The resulting score is compared against a configurable threshold (default configuration uses a threshold of 0.7) to determine if the info term should be applied to the column.

Configuring classification info types

As a DataHub admin, you can customize your YAML recipe to configure how each info type is automatically classified during ingestion

All you need to do is configure the following parameters as they apply to your use case:

  • Prediction Factor weightage — the weight of each prediction factor to be used in the final computation of the info type classification score
  • Name — the regex list to be matched against the column name
  • Description — the regex list to be matched against the column description
  • Datatype- the datatypes to be matched against column datatype
  • Prediction Type — regex or library
  • regex — regex list to be matched against column values
  • library — library name to be used to evaluate column values

Here’s an example:

Code snippet

Using DataHub’s PII Classification Module

To use the classification, all you need to do is add the classification section to the recipe and enable it.

Code snippet

Here’s an example of how you can customize and configure your recipe for auto-classifying the ‘Email’ info type based on the criteria and confidence threshold you set.

Code snippet

To understand how you can use more advanced configurations for your info types, check out our Classification Feature Guide.

What’s next?

DataHub’s PII classification feature is currently available for Snowflake; we are excited to extend it to other SQL-based sources and are eager for feedback from the Community about how we can improve the integration experience.

We’re looking for contributors — join the DataHub Community to make this happen!

Connect with DataHub

Join us on SlackSign up for our NewsletterFollow us on Twitter

Data Governance



Data Engineering



Simplifying Data Monitoring & Management with Subscriptions and Notifications with Acryl DataHub

If you're part of a data team responsible for a business-critical dataset, dashboard, or any other data asset, you know how important it is to stay on top of any upstream changes before they impact you and your stakeholders. What if a table you rely on just got deprecated? What if a column you use was removed upstream? Or if an upstream table missed an update and now has stale, un-synced data? Staying updated on critical assets in real time is critical to effective data monitoring and data quality. Given the complexity of today’s data environment, doing this is no walk in the park. But what if there was a way to stay in the loop all the time? And know exactly what happened – right when it happened? With Acryl DataHub's Subscriptions and Notifications feature, you can.

Maggie Hays


Data Products in DataHub: Everything You Need to Know

See an overview of DataHub’s vision and current model for Data Products, as well as our vision and commitments for the future.

Shirshanka Das


Data Contracts in DataHub: Combining Verifiability with Holistic Data Management

See how we’ve implemented Data Contracts within DataHub, how you can get started, and how the Data Products functionality can help you get the most out of Data Contracts.

Shirshanka Das


Get started with Acryl today.
Acryl Data delivers an easy to consume DataHub platform for the enterprise
See it in action
Acryl Data Logo
© 2023 Acryl Data