BACK TO ALL POSTS

Metadata’s Role in Sustainable, Cost-Effective AI Development

Metadata Management

Opinion

Data Governance

Blog

Artificial Intelligence

Machine Learning

Stephen Swoyer

Mar 18, 2024

Metadata Management

Opinion

Data Governance

Blog

Artificial Intelligence

Machine Learning

Hear that sound? It’s the sound a turntable makes when its stylus jumps the groove and skids across a record. It’s a sign the party has suddenly and irreversibly come to an end.

Something like this may be happening with AI, as the best-laid plans of teams building AI solutions crash up against the operational complexity involved in scaling and maintaining them. A related problem involves the challenge of managing and governing the resources teams use to train their AI models, especially large language models (LLM).

This doesn’t mean AI is proving to be a bust—much less that another AI winter is coming.

It does mean AI teams are discovering the agony and the ecstasy of data management and governance.

This blog explores how these disciplines are both different and more challenging in AI work.

Beyond Raw Input Data

Neural networks are the foundation of generative AI.

Along with deep learning models, they’re mostly designed to consume unstructured data—documents, images, and audio and video files. A specific example is the transformer architecture at the heart of GPT-4, which works on plain text that has been extracted from books, articles, web sources, etc. Unstructured data isn’t tabular—nor anything like it.

It's more complicated than this, however.

With LLMs, it isn’t just a matter of preserving, (1) the raw input data on which they were trained (i.e., documents), (2) the plain-text extracted from these documents, (3) the tokens generated prior to model training any embeddings created during training. For various reasons organizations need to be able to store, manage, and govern these assets.


Ultimately, then, organizations that want to reliably operationalize AI prototypes will need to formalize and refine a set of AI-specific data management and governance policies, practices, and patterns that are aligned with the overall AI lifecycle.

At a minimum, AI teams will need to generate, collect, and track metadata about their training datasets (e.g., source, date of creation, author, etc.) and their processing methods—e.g., how data was tokenized and transformed, along with any preprocessing steps. Metadata tracking will allow AI teams and other stakeholders to trace and understand the context of the data, helping maintain transparency and accountability in model training and output.

I’ll return to this theme, because it’s an important one.

But first, I want to look at another emerging trend in ML and AI development…

AI Sticker Shock

Not to throw more cold water on AI-fueled exuberance, but organizations that have successfully operationalized AI investments may also be in for a surprise or two as this year unfolds.

Remember how cloud was cheap…until it wasn’t? This year, organizations are starting to have similar epiphanies about their AI workloads. Ironically, it’s being successful with AI that often forces companies to come to terms with AI Sticker Shock. Take generative AI, for example, where LLM development is supposed to be front-loaded—with most costs accruing during the process of training an LLM foundation model from scratch. This is why most adopters opt to use prebuilt foundation models.

But operating a production LLM isn’t cheap. Heavy-duty inferencing, in particular, drives up costs.

There are two primary cost factors. In the first place, there’s the cost of generating data to train your LLM. Even if you use a pre-built foundation model (like the open-source Mistral 7B, or OpenAI’s proprietary GPT-4 model), and optimize it using techniques like Low-Rank Adoption (LoRA), you’ve still got to extract and validate the data you will use (with LoRA) to fine-tune your model. If you have a large enough corpus of data (hundreds of gigabytes or terabytes of document files, for example), the costs associated with this workload could be significant. (And even higher if you create synthetic data.) Expect to rinse and repeat these costs for every AI asset you build that requires a fine-tuned model.

And that’s just one part of it. Whether you roll your own generative AI stack or pay for a commercial one, operating it can quickly get expensive. It isn’t just that executing each prompt requires a non-trivial amount of compute power—or incurs a fixed per-execution cost—but that per-execution costs tend to increase dramatically as the number of both prompt and generated tokens increases. If consumers are confined to asking very simple questions, with just a couple of dozen (or fewer) tokens per prompt, that’s one thing; however, prompts that consist of hundreds of tokens are comparatively costly. In addition, some commercial AI services (like OpenAI), also charge subscribers a premium to fine-tune their models. In such cases, the per-execution cost of using a fine-tuned model can come in at several times more per token than that of using a basic model.

Thus, the irony: organizations that succeed in operationalizing AI solutions may find they’re too expensive to use at scale. This won’t necessarily cause them to abandon these efforts, but it will lead to some soul-searching, and could give birth to YAOT—i.e., yet another ops-ified thing: AIFinOps.

Metadata to the Rescue?


I’m biased, but I believe these and other drivers will fuel even more interest in metadata management, for the reasons aforesaid. In ML and AI work, especially, managing metadata is essential for maintaining the traceability of training data. This is necessary not just for internal auditing purposes, but also for compliance with regulatory and statutory requirements.

In addition, access to (and knowledge of) high-quality data not only accelerates ML and AI development, but also makes it easier for teams to operationalize AI solutions. Training LLMs on high-quality data improves their reliability and accuracy, and lets you use smaller models to achieve comparable results—reducing your operational costs. While training on high-quality data won't completely eliminate hallucinations, it will reduce their frequency, in turn reducing the need for costly interventions, like intensive human-in-the-loop review, or repetitive prompting.

And as organizations that struggle to operationalize ML and AI solutions grapple with the complexity and cost of deploying, scaling, and maintaining this software in production environments, metadata provides essential traceability and rich contextual understanding.

It enables AI teams, data platform engineers, SREs, and others to identify and correct problematic data pipelines, workflows, and processes. It allows them to understand the lineage of training datasets, and track changes to them over time. It also provides insights into how transformations and training affect model outputs. Teams can capture parameters, code versions, environment variables, and other metadata associated with their models, which helps make the resultant ML or AI solutions more transparent and reproducible.

Metadata is also essential for controlling the costs associated with AI development.

By collecting and analyzing metadata, teams and data leaders can

Track and manage their training datasets;
Distinguish actively used training datasets and artifacts from inactive ones; and
Identify redundant, relict, or forgotten datasets and/or artifacts.

Left ungoverned, the resources required to store and manage the datasets and artifacts used to train AI models will contribute to increased cloud costs, especially in larger AI programs—where projects tend to be bigger, training datasets more voluminous, and models more numerous.

Reducing these costs won’t eliminate those associated with training and tuning AI models, or supporting production inferencing workloads, but it could make them more justifiable.

Conclusion

Metadata gives organizations a way to understand which data gets collected, processed, and used—by whom, and for which purposes—and where this data lives in their sprawling data ecosystems. It enables them to identify and monitor the data sources and assets feeding their critical KPIs and metrics, along with any reports, dashboards, and analytics that depend on these metrics. These are well-known use cases for metadata. But these same use cases also extend to the world of AI, where metadata is no less critical–and useful. Here, too, it provides a rich contextual lens organizations can rely on to more effectively understand, manage, and govern not just their models and data assets, but the production solutions that depend on them.



Discover the Power of a Modern Data Catalog and Metadata Platform

Interested in learning more about how metadata-driven data management and governance can transform your AI development efforts, paving the way for reproducible, sustainable production deployments? Check out this resource to discover more!

Metadata Management

Opinion

Data Governance

Blog

Artificial Intelligence

Machine Learning

NEXT UP

Data Quality Should be Part of the Data Catalog - Introducing Acryl Observe

We didn’t go looking for an excuse to develop a data observability solution. There’s more than enough to keep us occupied in relentlessly improving the best data catalog on the planet! But the more experience we gained in working closely with Acryl customers, the clearer it became that data quality, data discovery, and data governance aren’t just complementary, but mutually reinforce one another. Acryl Observe provides data teams with everything they need to detect data breakages immediately, contain their downstream impact, keep stakeholders in the loop, and resolve issues fast—so that data teams can spend less time reacting and more time preventing.

John Joyce

2024-04-16

Metadathon - Why MetaData Matters

Vaidehi Sridhar, product manager at PayPal, came up with a clever solution: a hackathon—but for documentation and metadata. The goal of Sridhar’s and PayPal’s “Metadatathon” was to crowdsource the labor involved in documenting and adding rich context to the company’s distributed data assets. “Lack of documentation was one of the major problems most of our users called out,” Sridhar explains.

Stephen Swoyer

2024-04-16

Hack Your Way to Data Quality with a "Metadatathon"

PayPal Product Manager Vaidehi Sridhar probably wouldn’t call herself a mastermind, but she did execute on a genius idea to improve data quality across PayPal’s sprawling ecosystem. Even better, Sridhar and her team laid the groundwork for PayPal to federate data governance for its decentralized teams—as well as automate many types of tedious governance tasks. The genius idea? A metadata hackathon, or “Metadatathon.”

Stephen Swoyer

2024-03-19

Get started with Acryl today.
Acryl Data delivers an easy to consume DataHub platform for the enterprise
See it in action
Acryl Data Logo
Acryl DataHub
Acryl Observe
TermsPrivacySecurity
© 2024 Acryl Data