Data in Context: Lineage Explorer in DataHub

DataHub aims to empower users to discover, trust and take action on data in their organizations. Understanding where a data product comes from and how it is being used is critical for these goals. To give these insights to data professionals, we built the DataHub Lineage Explorer.

DataHub Lineage Explorer

This means DataHub can trace the flow of data from its creation, through all its transformations, to the point where it is consumed as a data product. In this post we’ll go into why we built this, how you can use it, and what is on the horizon for lineage metadata.

Why lineage is important for data professionals

Build trust in data

Lineage is critical to the refinement step of data discovery. You have found a data product by issuing a search query or perhaps browsing a taxonomy. You have access to its title, a description, and can spot check data in some rows or statistics on its columns. However, these can appear correct even if the input data has issues. Examining lineage is an important input for knowing you can trust a dataset.

Downstream and upstream lineage build different types of trust.

Looking at downstream lineage lets you validate the quality of a data product. If an executive dashboard consumes a dataset, this indicates it has already been vetted by someone else. Looking at upstream lineage tells you if the sources of truth for your data product are trustworthy. Certified, reliable and well maintained upstream dependencies lets you verify the data product in question is built on a stable foundation. Combining downstream and upstream lineage validates that a data product is what it appears to be.

Act decisively with data

Lineage is crucial for datasets you are familiar with- even data products you created or maintain. When you notice a data quality issue arise in your data product, how can you identify the source? If you haven’t changed the logic that produces the chart or table, an upstream dependency must be the culprit. Lineage allows you to trace issues back to the source.

Sometimes, the source is a change in an upstream dependency’s contract. In other cases, it is a transient issue. Modern data stacks include pipelines, feature generation tools, streams and other operational components. Issues with a downstream dataset can often be due to operational issues with these components. That makes capturing timeliness, frequency and success of your data transformations all the more important. Using operational lineage, owners can identify when infrastructure issues create problems in the data products they maintain.

Lineage comes into play when updating a data product. You must take downstream dependencies into account when making breaking changes. Using DataHub lineage, you can get up to date information on who depends on the data product you are changing and work with the respective owners to ensure a smooth transition.

How DataHub fits in

DataHub collects lineage data from across your data ecosystem. To start, we built native integrations with Airflow, dbt and Superset. Native integrations with BigQuery, AWS Glue and Dagster are coming soon. DataHub allows any other source to be enriched with lineage metadata at ingestion time, as long as such metadata is available.

The dbt and Superset sources will automatically ingest lineage metadata. To use Airflow’s Lineage API , Airflow source requires an addition in airflow.cfg :

[lineage]
backend = datahub_provider.lineage.datahub.DatahubLineageBackend
datahub_kwargs = {
    "datahub_conn_id": "datahub_rest_default",
    "capture_ownership_info": true,
    "capture_tags_info": true,
    "graceful_exceptions": true }

Then a configuration in airflow CLI:

# For REST-based:
airflow connections add  --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host '<http://localhost:8080>'
# For Kafka-based (standard Kafka sink config can be passed via extras):
airflow connections add  --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}'

It’s as simple as that- now you are ingesting Airflow lineage data to Datahub each time your Airflow pipelines run. This means your lineage metadata is as up to date as possible. You can read the full Datahub Airflow integration docs here .

Lineage metadata can be explored via the Lineage Explorer in Datahub. This tool allows you to visualize how data flows across your entire data ecosystem. It shows you the dependencies of a given data product, as well as a list of downstream consumers. Using the Explorer, you can gain trust in data products and take confident action on them.

To test if DataHub’s lineage feature captures and displays lineage information effectively, we created our own metadata analytics pipeline and integrated it with DataHub. The pipeline analyzes metadata of the test datasets we have loaded into demo.datahubproject.io . Using the Lineage Explorer, we can see all the transformations our data flows through from its creation until it is consumed in the form of a Superset chart:

Use the Lineage Explorer to understand how a data product was created

You can explore the pipeline more on the DataHub demo site. A good place to start is the snapshot of our demo database loaded into s3 .

Design decisions we made along the way

Visualizing lineage metadata is a challenge. This metadata can be complex and highly interconnected. Show too little, and users won’t be able to get the answers to their questions. Show too much detail, and your lineage visualization will may end up a mess:

an unhelpful lineage visualization

When creating the Lineage Explorer we thought carefully about what to show. We used the two cases outlined above, building trust and enabling decisive action, to direct our design decisions.

The use cases we addressed centered around answering questions about some specific entities. The Lineage Explorer focuses on a single central entity. It shows all entities upstream of that central node and all entities downstream. This allows a user to quickly scan the lineage visualization to answer their questions without noise of irrelevant entities. A double click will recenter the visualization on another node. We used the React VX library as a foundation for the layout logic. This helped the dependency graph stay easy to understand, even with large numbers of nodes.

In order to avoid overloading the graph with too many nodes, DataHub downsamples relationships for nodes that have 100’s of entities. We are working to rank entities in order of relevance using features such as usage, tags, ownership, relationships, and metadata quality. These features will rank entities in search and the Lineage Explorer. This means DataHub will present users with the most relevant upstream and downstream dependencies in the lineage graph. Separately, we also have plans to allow searching through downsampled lists of relationships- this way even if an edge is not shown it is discoverable.

Finding upstream and downstream nodes addressed part of our initial use cases. The use cases also involved digging into the details of related nodes. Is this upstream entity dependable? Is this downstream entity in the critical path? Multiple properties must often be referenced to answer these questions. It would be impossible to fit all this information into a node in the Lineage Explorer graph. However, we didn’t want to force user to switch between pages and lose context in transitions.

To solve this, we added a side panel to the Lineage Explorer. It optionally provides additional detail when a node is clicked on. This allows viewing of additional metadata like description, ownership and tags, among others.

The side panel shows additional context about a selected entity

A final challenge came on the ingestion front. High quality metadata is critical for building trust and enabling decisive action. We needed to make the metadata easy to collect but also reliable. In the beginning, we stayed away from SQL parsing. Although it can give quick wins, it is difficult to get absolutely right. Minimizing our margin of error was critical to building trust and confidence in data.

Sources that explicitly declare dependencies were our first target- their metadata is easy to collect and sure to be reliable. Airflow’s lineage API shortly followed as it provided a clear and discoverable interface for configuring inlets and outlets. It also allowed DataHub to ingest lineage when pipelines run. This means DataHub gives users an up to date picture of the status of their data ecosystem. When your pipeline changes, DataHub will reflect that immediately. Since DataHub is built around a stream-based architecture , users can listen to these events and trigger automated real-time responses as well.

We are now working on integrations with AWS Glue, BigQuery and Dagster. Each provides explicit interfaces for capturing dependencies, yielding more lineage data users will be able to trust.

On the horizon for lineage metadata exploration

The main use cases we addressed were building trust and enabling action. These are just the beginning of applications for lineage metadata.

We understand viewing the lineage graph is only the first piece of the puzzle. The next step function change will come from automating processes through lineage metadata. Tag propagation would enable users to set rules and see their tags propagate to downstream entities. For example, a user may want data governance tags to propagate from source tables to their downstream consumers. This prevents sensitive data from accidentally becoming publicly available. Alerting downstream consumers of an upstream schema change or breakage is another powerful use case. If we know one table has a data quality issue, we should leverage lineage to inform other data products and their stakeholders of potential issues.

Another improvement we intend to bring to the Lineage Explorer is the ability to adjust granularity. At the moment, DataHub’s lineage graph only shows a single level of granularity. This is the ideal granularity in many cases. However, the ability to zoom in and out when needed provides more flexibility.

We intend to extend DataHub’s lineage model to support column-level lineage. Zooming in should show dependencies between sets of columns. This gives users a more fine-grained understanding of dependencies between data products.

As users zoom out, elements of their metadata graph would collapse into one another giving an increasingly more high-level picture of their data ecosystem. A slight zoom-out might result in groups of tasks being replaced with pipelines, groups of charts replaced by dashboards, and so on. Zooming out further could reduce your graph to relationships between domains. Collapsing and expanding groups of entities dynamically in the lineage explorer expands the set of questions the tool can answer.

DataHub is committed to advancing the way the world works with data. Lineage metadata is a key component to working with data. What I discussed in this post is only the beginning of the journey, and we’d love you to be a part of it. Want to get involved? Come say hi in our Slack , check out our Github and attend our latest Town hall to learn about the latest in DataHub.