BACK TO ALL POSTS

DataHub Project Updates

Data Engineering

Metadata

Open Source

Data Governance

Project Updates

Shirshanka Das

Nov 3, 2021

Data Engineering

Metadata

Open Source

Data Governance

Project Updates

DataHub Sep 2021 Project Update

Introduction

We’re back with our eighth post covering monthly project updates for the open-source metadata platform, LinkedIn DataHub . This post captures the developments for the month of September 2021. To read the August 2021 edition, go here . To learn about DataHub’s latest developments- read on!

Community Update

September was an exciting month for the DataHub Community. We welcomed Maggie Hays as the DataHub Community Product Manager (learn more about her journey here) and our Slack community grew by 250 members (1,350 total!). We saw 18 contributors from 11 companies contribute to the DataHub project, and 68 people joined our DataHub community town hall. There’s so much momentum growing within this group and I know we will have a strong Q4 to round out 2021.

Project Update

We had 112 commits in September, continuing our 100+ commits/month rate. We had contributions from 18 different contributors from 11 companies (3 new contributors!).

The September 2021 town hall had 68 attendees where we saw a demo of the new Faceted Search experience, Stateful Ingestion, Improvements to the Looker Connector, and a case study from the team at Adevinta about why they are adopting DataHub within their company. Join us on Slack and subscribe to our YouTube channel to keep up to date with this fast-growing project.

Read on to find out more about the September highlights!

Product and Integration Improvements

We saw some very exciting improvements in the DataHub user experience. Let’s dig in!

Improvements to Glossary Term management in the UI

As a reminder, we partnered with Saxo Bank this summer to introduce Glossary Terms as a way to manage governed terms via YAML, and to link terms to related entities within DataHub. This is separate from the existing free-form Tags which provide more flexibility for logical grouping of similar entities as use cases evolve.

Based on community feedback, we’ve made it possible for DataHub users to add and remove Glossary Terms via the DataHub UI. We’ve also visually separated Tags and Glossary Terms to emphasize the difference between the two and to minimize confusion.

Datahub Tags and Glossary Terms

Datahub Tags and Glossary Terms

Support for Redshift Usage

We now support the ingestion of query history and usage stats for Redshift in addition to Snowflake and BigQuery. This helps DataHub users to better understand the popularity and relevance of datasets during discovery by displaying top users, monthly query count, and recent queries.

Looker Integration Improvements

The DataHub — Looker integration is only getting better! DataHub v0.8.14.2 introduced:

  • Fixes to view naming conventions to resolve naming collisions
  • Improvements to extracting Explores from LookML and Owners from the Looker API
  • Organization of DataHub entities to mimic Looker’s structure of Explores & Views

Watch the video below to check it out:

Primary Key/Foreign Key Mapping

We recently rolled out support to display primary key and foreign key relationships that are defined within any data store that supports the constraints. Want to see more? Check this demo from Gabe Lyons (Acryl Data), or see it live in the DataHub Demo site here .

Developer Tooling and Operations Improvements

Official Release of our GraphQL API

September marked the launch of our GraphQL API! This will be the primary programmatic interface for the metadata graph, and is where we will be building out an ecosystem of client SDKs to make it very easy to interact with DataHub wherever you might be.

Check out our GraphQL API docs here for rich documentation on all GraphQL queries, mutations, and types.

Additional Improvements
  • DataHub CLI now supports env variables — no more sitting at your terminal and confirming all prompts
  • Bootstrap common data platforms on startup — when you ingest metadata, logos will be available
  • Build out frontend and backend monitoring through Prometheus + Grafana — check out Dexter Lee (Acryl Data) give a demo here

New User Experience: Faceted Search

During the September 2021 Community Town Hall, Gabe Lyons from Acryl Data gave us a walk-through of the new, faceted search experience within DataHub.

Our main goal was to help folks find the information they need in as few clicks as possible and to condense search into a unified experience across all entity types (Datasets, Pipelines, Dashboards, etc.).

Instead of separating search results by entity types, they are now blended together so the top-ranked results will appear first. Users can refine their search by filtering by entity type, platform, environment, tags, and more.

Watch the full demo here:

Case Study: DataHub + Adevinta

Iker Martinez de Apellaniz joined our Town Hall to share Adevinta’s journey in evaluating and adopting DataHub to support their international family of localized digital marketplaces. Adevinta has thousands of data assets spanning Kafka, s3, Athena & Redshift, plus a legacy data inventory system to assist with self-service access requests, GDPR compliance, and managing data ownership.

DataHub Architecture at Adevinta

DataHub Architecture at Adevinta

Introducing DataHub to the company offered additional functionality: a way to manage data lineage, documentation, glossary terms, and dashboards while providing robust search functionality and indicators of data health from their thriving community of data practitioners.

Watch Iker’s full presentation here:

New Functionality: Stateful Ingestion

As DataHub adopters expand the volume of integrations and ingestion jobs they support, it becomes increasingly more important to optimize run time and to minimize redundancy.

Stateful ingestion: the new flow

I teamed up with Surya Lanka from Acryl Data to introduce the incubating Stateful Ingestion feature during the September Town Hall — DataHub’s mechanism to allow Sources to remember where they left off from the last ingestion run. You can check out the video here.

We shared our design and implementation considerations and gave a live demo to show how Stateful Ingestion can reduce load on source systems by extracting only the most relevant changes. This feature will be rolled out to different ingestion sources in the upcoming releases.

Looking Forward

I’m so excited to see the continuing growth in momentum in this project and looking forward to delivering big things in Q4 of 2021. From the level of engagement on Slack to the increased velocity of contributions from the community, it has been great to build together.

Over the next month, we expect to roll out some of these improvements to the project and start building out new capabilities like recommendations, improving our support for nested structs in systems like Hive, Trino and providing more controls to the operator for managing data profiling using DataHub. Until next time!

Data Engineering

Metadata

Open Source

Data Governance

Project Updates

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2025 Acryl Data