Jan 31, 2024
When organizations struggle to operationalize ML or AI solutions, the root causes are usually data-related. ML and AI teams can’t find the data they need to define use cases, engineer features, or train their models. When they can find it, they can’t always use it—because they don’t know what it is, where it came from, who created it, when, or for what purpose. Lacking context, any dataset is a black box.
The shift to generative AI and large language models (LLM) compounds these problems. Access to high-quality, context-rich data is more important than ever—but teams struggle to find and use it.
Acryl Cloud, the modern data catalog and metadata platform based on the open-source DataHub project, is tailor-made for these problems. Acryl Cloud gives you the power and flexibility of DataHub as a fully managed SaaS service, eliminating the operational burden of scaling and maintaining DataHub yourself. And Acryl Cloud improves upon and extends DataHub’s core feature set with advanced search, discovery, and collaboration features, along with capabilities data leaders can use to define, apply, and enforce metadata-driven governance policies.
In a nutshell, Acryl Cloud provides a control plane for the data siloed across your ecosystem, empowering data exploration, discovery, sharing and collaboration, management, and governance.
The best thing is that Acryl Cloud can be used by personas of all kinds—from business subject matter experts (SME) and analysts, to ML and AI Architects and Leads, Data Scientists, and others.
This article unpacks the benefits of Acryl Cloud for ML and AI development—exploring why a modern data catalog and metadata platform is a foundational element of any ML or AI platform.
The ML models and LLMs that power AI solutions are trained on large datasets. They often incorporate sensitive data that must be anonymized to respect privacy and ensure compliance. Most importantly, ML models tend to be critically dependent on the quality of the data they're trained on.
The challenge for ML and AI Leads, then, is to offer ML and AI teams just enough governance.
Too little, and the quality, transparency, and understandability of source data suffers; too much, and the iterative process of improving the performance, accuracy, and fairness of production ML models grinds to a halt. This is why ML and AI Leads have come to rely on Acryl Cloud to empower their teams.
Acryl Cloud ensures the right balance of data governance while providing the flexibility teams need to research, develop, operationalize, and improve production ML and AI solutions. Its advanced search, discovery, collaboration, and sharing features make it easier for ML and AI teams to discover and access high-quality datasets. And its real-time metadata monitoring capabilities equip leaders to better manage the development, operationalization, maintenance, and improvement of ML and AI solutions.
ML and AI Leads can use Acryl Cloud to align the goals and objectives of their teams with organizational priorities. It exposes an adaptable, self-service user experience that empowers domain experts to collaborate with data practitioners in exploring, discovering, and describing data. The same collaborative features help promote team building and skill development, making it easier for teams to share data, models, methods, knowledge, and best practices, fostering a culture of continuous learning.
Finally, Acryl Cloud makes it easier for ML and AI Leads to direct the activities of their teams. They can define policies that enforce shift-left governance, incorporating data schema, data quality, data validation, data masking or anonymization, and other checks into the workflows teams use to build and test their AI and ML prototypes. With Acryl Cloud, ML and AI Leads can feel confident teams have the resources they need to quickly iterate and operationalize ML and AI solutions. Just as important, they can see, understand, and manage exactly what their teams are doing and how they’re doing it.
Acryl Cloud’s best-in-class data catalog and metadata platform capabilities provide ML and AI Architects with the rich set of functions required to support, enable, and accelerate ML and AI work.
A modern data catalog like Acryl Cloud is an essential enabling technology for scoping, exploratory data analysis (EDA), preprocessing/feature engineering, and other stages of ML and AI development.
Acryl Cloud makes it easier for data scientists to surface the critical metadata that accelerates these tasks—including lineage and provenance, helpful statistics (min, max, median, etc.), data quality and usage metrics, and (not least) semantic and contextual information. And because ML and AI development is a team endeavor, practitioners in a wide variety of roles can use Acryl Cloud to discover, collaboratively explore (essential for cross-functional teams), and share relevant data.
Acryl Cloud’s advanced metadata management features also equip data leaders with the capabilities they need to observe, manage, and govern the processed datasets and artifacts (like raw source files, features, etc.) generated during the engineering and training of ML models, including LLMs. Acryl Cloud gives data leaders a holistic view of ML- and AI-related datasets, artifacts, and metadata, along with the stack components—feature stores, model repositories, knowledge graphs, etc.—teams use to develop and productionize ML and AI solutions.
By integrating Acryl Cloud into their platforms, ML and AI Architects can provide data leaders with the capabilities they need to create and enforce metadata-driven data management and governance policies. This is critical not just to ensure compliance and ethical usage (preserving data, artifacts, and metadata for legal reasons), but to enable reproducibility and to simplify model retraining.
For data scientists and ML/AI engineers, Acryl Cloud offers essential insight into the lineage and permutations of datasets, providing information they would otherwise have to reconstruct themselves.
It surfaces metadata about the age, history, quality metrics, usage, and semantics of aggregated or derived datasets, and also compiles helpful statistics about their data types, cardinality and distribution metrics, and row counts. From BI tools, Acryl Cloud extracts metadata about hierarchies, dimensions, attributes, and metrics. And from feature stores, it collects vital metadata about feature vectors—like feature names, types, usage statistics, and dimensionality.
Ready access to this and other metadata simplifies the process of formalizing business use cases for ML and AI. It also helps eliminate the more tedious aspects of EDA. For example, by generating data distribution and data quality metrics, or collecting/generating null or missing data statistics, Acryl Cloud spares data scientists and ML/AL engineers the work of generating these values themselves.
To accelerate feature engineering, Acryl Cloud collects metadata about feature vectors from feature stores. From model repositories, Acryl Cloud collects metadata about model versions, training parameters, and lineage.
Finally, Acryl Cloud integrates into the preferred workflows of data scientists, ML engineers, and other practitioners. Its browser extension means experts don’t need to go outside of their preferred workflows during EDA, feature engineering, or model retraining: instead, they can explore and discover features in the same application—a web browser—that hosts their notebooks.
For prompt engineers, a modern data catalog and metadata platform like Acryl Cloud provides detailed insights into data characteristics and relationships. The data catalog, rich with metadata, serves as a guide to the specific context, structure, and content of disparate data assets. By leveraging this information, prompt engineers can design prompts that are finely tuned to the characteristics of the data they’re using.
For example, insight into the provenance of the data used to train an LLM, the methods used to generate tokens, the algorithms used to create embeddings—combined with annotations or field notes—helps prompt engineers to create prompts that produce more accurate results. Plus, an understanding of the semantics of the data used to train an LLM equips prompt engineers with the contextual information they need to design prompts that aren’t just more effective, but reduce the frequency of LLM hallucinations.
This list of personas is far from exhaustive! Data engineers, Governance Leads, Data Platform Leads, and others are all integral to the process of developing, operationalizing, and maintaining ML and AI solutions. Acryl Data CEO Swaroop Jagadish discussed Acryl Cloud’s relevance for these practitioners in a separate blog.
However, in the context of ML and, especially, AI work, their roles tend to change in significant ways.
For all of these people, a modern data catalog and metadata platform like Acryl Cloud is a necessity, not a luxury, empowering everybody who supports and enables ML and AI work.
Acryl Cloud’s advanced search, discovery, collaboration, and annotation capabilities, combined with its automated data profiling and “smart” classification features, make it easier for data practitioners to identify, access, and prepare the data they need to scope and formalize ML and AI use cases, pre/process data, engineer features, and train their models. The same discovery and collaborative features facilitate ongoing communication and coordination between hands-on data practitioners (like data engineers, data scientists, and ML engineers) and business SMEs and domain experts.
Similarly, Acryl Cloud enables data leaders to apply metadata-driven data management and governance policies to both the developmental and operational phases of their ML and AI programs. They can create policies that track the data and artifacts created during model preprocessing and training or ensure that production AI solutions consume only timely, high-quality data.
To engineer production AI solutions that are scalable, accurate, and trustworthy, you need trustworthy data. There’s no way around it. As a best-in-class modern data catalog and metadata platform, Acryl Cloud provides the foundational capabilities you need to build, operationalize, and maintain effective, transparent, and trustworthy ML and AI solutions. Discover the Acryl Cloud advantage!
Increasingly, decision-makers and stakeholders just don’t trust their data and analytics—usually because what they’re seeing is out-of-date, incomplete, inconsistent, and sometimes flat-out wrong.
Data work is a true team sport. Each and every data asset is the product of a clear distribution of labor, with people in a diversity of roles—including data practitioners, software developers, architects, governance authorities, and business domain experts—working collaboratively.
You're a data engineer at a boutique e-commerce start-up. Your company sells luxury goods at steep discounts. One of your many responsibilities involves monitoring the "flash_sale_purchase_events" table in your start-up’s Snowflake data warehouse. Updates to columns in this table are supposed to reflect real-time participation by customers in the limited-time flash sales your company offers.