BACK TO ALL POSTS

Acryl Cloud for ML and AI Practitioners

Machine Learning

Mlops

Artificial Intelligence

Data Management

Data Governance

Metadata Management

Data Quality

Harshal Sheth

Jan 31, 2024

Machine Learning

Mlops

Artificial Intelligence

Data Management

Data Governance

Metadata Management

Data Quality

When organizations struggle to operationalize ML or AI solutions, the root causes are usually data-related. ML and AI teams can’t find the data they need to define use cases, engineer features, or train their models. When they can find it, they can’t always use it—because they don’t know what it is, where it came from, who created it, when, or for what purpose. Lacking context, any dataset is a black box.

The shift to generative AI and large language models (LLM) compounds these problems. Access to high-quality, context-rich data is more important than ever—but teams struggle to find and use it.

Acryl Cloud, the modern data catalog and metadata platform based on the open-source DataHub project, is tailor-made for these problems. Acryl Cloud gives you the power and flexibility of DataHub as a fully managed SaaS service, eliminating the operational burden of scaling and maintaining DataHub yourself. And Acryl Cloud improves upon and extends DataHub’s core feature set with advanced search, discovery, and collaboration features, along with capabilities data leaders can use to define, apply, and enforce metadata-driven governance policies.

In a nutshell, Acryl Cloud provides a control plane for the data siloed across your ecosystem, empowering data exploration, discovery, sharing and collaboration, management, and governance.


The best thing is that Acryl Cloud can be used by personas of all kinds—from business subject matter experts (SME) and analysts, to ML and AI Architects and Leads, Data Scientists, and others.

This article unpacks the benefits of Acryl Cloud for ML and AI development—exploring why a modern data catalog and metadata platform is a foundational element of any ML or AI platform.

ML and AI Leads


The ML models and LLMs that power AI solutions are trained on large datasets. They often incorporate sensitive data that must be anonymized to respect privacy and ensure compliance. Most importantly, ML models tend to be critically dependent on the quality of the data they're trained on.

The challenge for ML and AI Leads, then, is to offer ML and AI teams just enough governance.

Too little, and the quality, transparency, and understandability of source data suffers; too much, and the iterative process of improving the performance, accuracy, and fairness of production ML models grinds to a halt. This is why ML and AI Leads have come to rely on Acryl Cloud to empower their teams.

Acryl Cloud ensures the right balance of data governance while providing the flexibility teams need to research, develop, operationalize, and improve production ML and AI solutions. Its advanced search, discovery, collaboration, and sharing features make it easier for ML and AI teams to discover and access high-quality datasets. And its real-time metadata monitoring capabilities equip leaders to better manage the development, operationalization, maintenance, and improvement of ML and AI solutions.

ML and AI Leads can use Acryl Cloud to align the goals and objectives of their teams with organizational priorities. It exposes an adaptable, self-service user experience that empowers domain experts to collaborate with data practitioners in exploring, discovering, and describing data. The same collaborative features help promote team building and skill development, making it easier for teams to share data, models, methods, knowledge, and best practices, fostering a culture of continuous learning.

Finally, Acryl Cloud makes it easier for ML and AI Leads to direct the activities of their teams. They can define policies that enforce shift-left governance, incorporating data schema, data quality, data validation, data masking or anonymization, and other checks into the workflows teams use to build and test their AI and ML prototypes. With Acryl Cloud, ML and AI Leads can feel confident teams have the resources they need to quickly iterate and operationalize ML and AI solutions. Just as important, they can see, understand, and manage exactly what their teams are doing and how they’re doing it.

ML and AI Architects


Acryl Cloud’s best-in-class data catalog and metadata platform capabilities provide ML and AI Architects with the rich set of functions required to support, enable, and accelerate ML and AI work.

A modern data catalog like Acryl Cloud is an essential enabling technology for scoping, exploratory data analysis (EDA), preprocessing/feature engineering, and other stages of ML and AI development.

Acryl Cloud makes it easier for data scientists to surface the critical metadata that accelerates these tasks—including lineage and provenance, helpful statistics (min, max, median, etc.), data quality and usage metrics, and (not least) semantic and contextual information. And because ML and AI development is a team endeavor, practitioners in a wide variety of roles can use Acryl Cloud to discover, collaboratively explore (essential for cross-functional teams), and share relevant data.

Acryl Cloud’s advanced metadata management features also equip data leaders with the capabilities they need to observe, manage, and govern the processed datasets and artifacts (like raw source files, features, etc.) generated during the engineering and training of ML models, including LLMs. Acryl Cloud gives data leaders a holistic view of ML- and AI-related datasets, artifacts, and metadata, along with the stack components—feature stores, model repositories, knowledge graphs, etc.—teams use to develop and productionize ML and AI solutions.

By integrating Acryl Cloud into their platforms, ML and AI Architects can provide data leaders with the capabilities they need to create and enforce metadata-driven data management and governance policies. This is critical not just to ensure compliance and ethical usage (preserving data, artifacts, and metadata for legal reasons), but to enable reproducibility and to simplify model retraining.

Data Scientists and ML/AI Engineers


For data scientists and ML/AI engineers, Acryl Cloud offers essential insight into the lineage and permutations of datasets, providing information they would otherwise have to reconstruct themselves.

It surfaces metadata about the age, history, quality metrics, usage, and semantics of aggregated or derived datasets, and also compiles helpful statistics about their data types, cardinality and distribution metrics, and row counts. From BI tools, Acryl Cloud extracts metadata about hierarchies, dimensions, attributes, and metrics. And from feature stores, it collects vital metadata about feature vectors—like feature names, types, usage statistics, and dimensionality.

Ready access to this and other metadata simplifies the process of formalizing business use cases for ML and AI. It also helps eliminate the more tedious aspects of EDA. For example, by generating data distribution and data quality metrics, or collecting/generating null or missing data statistics, Acryl Cloud spares data scientists and ML/AL engineers the work of generating these values themselves.

To accelerate feature engineering, Acryl Cloud collects metadata about feature vectors from feature stores. From model repositories, Acryl Cloud collects metadata about model versions, training parameters, and lineage.

Finally, Acryl Cloud integrates into the preferred workflows of data scientists, ML engineers, and other practitioners. Its browser extension means experts don’t need to go outside of their preferred workflows during EDA, feature engineering, or model retraining: instead, they can explore and discover features in the same application—a web browser—that hosts their notebooks.

Prompt Engineers


For prompt engineers, a modern data catalog and metadata platform like Acryl Cloud provides detailed insights into data characteristics and relationships. The data catalog, rich with metadata, serves as a guide to the specific context, structure, and content of disparate data assets. By leveraging this information, prompt engineers can design prompts that are finely tuned to the characteristics of the data they’re using.

For example, insight into the provenance of the data used to train an LLM, the methods used to generate tokens, the algorithms used to create embeddings—combined with annotations or field notes—helps prompt engineers to create prompts that produce more accurate results. Plus, an understanding of the semantics of the data used to train an LLM equips prompt engineers with the contextual information they need to design prompts that aren’t just more effective, but reduce the frequency of LLM hallucinations.

ML and AI Work Is a Team Sport


This list of personas is far from exhaustive! Data engineers, Governance Leads, Data Platform Leads, and others are all integral to the process of developing, operationalizing, and maintaining ML and AI solutions. Acryl Data CEO Swaroop Jagadish discussed Acryl Cloud’s relevance for these practitioners in a separate blog.

However, in the context of ML and, especially, AI work, their roles tend to change in significant ways.

  • Data engineers have to deal with more complex and varied data types (like unstructured documents, or audio, video, and image files), requiring them to implement techniques for data labeling, in addition to advanced techniques for preprocessing and transforming data.
  • Data Platform Leads, working with ML and AI Architects, have to grapple with the challenge of designing and scaling the infrastructure used to develop and serve ML and AI workloads, as well as managing and governing the work of ML and AI teams, the data and artifacts generated during development, and the data consumed by models in production.
  • Governance Leads are concerned with monitoring production AI solutions for fairness, accuracy, transparency, and alignment with human values. Along with ML/AI Leads, they also require AI models to be reproducible and explainable.
  • Business subject matter experts (SME) and domain experts are absolutely integral to the work of ML and AI development. They need to be able to work closely with data scientists and ML engineers to research, define, and refine business use cases for ML and AI. Similarly, they play an active role in monitoring, maintaining, and improving production ML and AI solutions.

For all of these people, a modern data catalog and metadata platform like Acryl Cloud is a necessity, not a luxury, empowering everybody who supports and enables ML and AI work.

The Acryl Cloud Advantage

Acryl Cloud’s advanced search, discovery, collaboration, and annotation capabilities, combined with its automated data profiling and “smart” classification features, make it easier for data practitioners to identify, access, and prepare the data they need to scope and formalize ML and AI use cases, pre/process data, engineer features, and train their models. The same discovery and collaborative features facilitate ongoing communication and coordination between hands-on data practitioners (like data engineers, data scientists, and ML engineers) and business SMEs and domain experts.

Similarly, Acryl Cloud enables data leaders to apply metadata-driven data management and governance policies to both the developmental and operational phases of their ML and AI programs. They can create policies that track the data and artifacts created during model preprocessing and training or ensure that production AI solutions consume only timely, high-quality data.

To engineer production AI solutions that are scalable, accurate, and trustworthy, you need trustworthy data. There’s no way around it. As a best-in-class modern data catalog and metadata platform, Acryl Cloud provides the foundational capabilities you need to build, operationalize, and maintain effective, transparent, and trustworthy ML and AI solutions. Discover the Acryl Cloud advantage!

Machine Learning

Mlops

Artificial Intelligence

Data Management

Data Governance

Metadata Management

Data Quality

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

Get started with Acryl today.
Acryl Data delivers an easy to consume DataHub platform for the enterprise
See it in action
Acryl Data Logo
Acryl DataHub
Acryl ObserveCustomer Stories
TermsPrivacySecurity
© 2024 Acryl Data