Metadata’s Role in Sustainable, Cost-Effective AI Development

Hear that sound? It’s the sound a turntable makes when its stylus jumps the groove and skids across a record. It’s a sign the party has suddenly and irreversibly come to an end.

Something like this may be happening with AI, as the best-laid plans of teams building AI solutions crash up against the operational complexity involved in scaling and maintaining them. A related problem involves the challenge of managing and governing the resources teams use to train their AI models, especially large language models (LLM).

This doesn’t mean AI is proving to be a bust—much less that another AI winter is coming.

It does mean AI teams are discovering the agony and the ecstasy of data management and governance.

This blog explores how these disciplines are both different and more challenging in AI work.

Beyond Raw Input Data

Neural networks are the foundation of generative AI.

Along with deep learning models, they’re mostly designed to consume unstructured data—documents, images, and audio and video files. A specific example is the transformer architecture at the heart of GPT-4, which works on plain text that has been extracted from books, articles, web sources, etc. Unstructured data isn’t tabular—nor anything like it.

It's more complicated than this, however.

With LLMs, it isn’t just a matter of preserving, (1) the raw input data on which they were trained (i.e., documents), (2) the plain-text extracted from these documents, (3) the tokens generated prior to model training any embeddings created during training. For various reasons organizations need to be able to store, manage, and govern these assets.

Ultimately, then, organizations that want to reliably operationalize AI prototypes will need to formalize and refine a set of AI-specific data management and governance policies, practices, and patterns that are aligned with the overall AI lifecycle.

At a minimum, AI teams will need to generate, collect, and track metadata about their training datasets (e.g., source, date of creation, author, etc.) and their processing methods—e.g., how data was tokenized and transformed, along with any preprocessing steps. Metadata tracking will allow AI teams and other stakeholders to trace and understand the context of the data, helping maintain transparency and accountability in model training and output.

I’ll return to this theme, because it’s an important one.

But first, I want to look at another emerging trend in ML and AI development…

AI Sticker Shock

Not to throw more cold water on AI-fueled exuberance, but organizations that have successfully operationalized AI investments may also be in for a surprise or two as this year unfolds.

Remember how cloud was cheap…until it wasn’t? This year, organizations are starting to have similar epiphanies about their AI workloads. Ironically, it’s being successful with AI that often forces companies to come to terms with AI Sticker Shock. Take generative AI, for example, where LLM development is supposed to be front-loaded—with most costs accruing during the process of training an LLM foundation model from scratch. This is why most adopters opt to use prebuilt foundation models.

But operating a production LLM isn’t cheap. Heavy-duty inferencing, in particular, drives up costs.

There are two primary cost factors. In the first place, there’s the cost of generating data to train your LLM. Even if you use a pre-built foundation model (like the open-source Mistral 7B, or OpenAI’s proprietary GPT-4 model), and optimize it using techniques like Low-Rank Adoption (LoRA), you’ve still got to extract and validate the data you will use (with LoRA) to fine-tune your model. If you have a large enough corpus of data (hundreds of gigabytes or terabytes of document files, for example), the costs associated with this workload could be significant. (And even higher if you create synthetic data.) Expect to rinse and repeat these costs for every AI asset you build that requires a fine-tuned model.

And that’s just one part of it. Whether you roll your own generative AI stack or pay for a commercial one, operating it can quickly get expensive. It isn’t just that executing each prompt requires a non-trivial amount of compute power—or incurs a fixed per-execution cost—but that per-execution costs tend to increase dramatically as the number of both prompt and generated tokens increases. If consumers are confined to asking very simple questions, with just a couple of dozen (or fewer) tokens per prompt, that’s one thing; however, prompts that consist of hundreds of tokens are comparatively costly. In addition, some commercial AI services (like OpenAI), also charge subscribers a premium to fine-tune their models. In such cases, the per-execution cost of using a fine-tuned model can come in at several times more per token than that of using a basic model.

Thus, the irony: organizations that succeed in operationalizing AI solutions may find they’re too expensive to use at scale. This won’t necessarily cause them to abandon these efforts, but it will lead to some soul-searching, and could give birth to YAOT—i.e., yet another ops-ified thing: AIFinOps.

Metadata to the Rescue?

I’m biased, but I believe these and other drivers will fuel even more interest in metadata management, for the reasons aforesaid. In ML and AI work, especially, managing metadata is essential for maintaining the traceability of training data. This is necessary not just for internal auditing purposes, but also for compliance with regulatory and statutory requirements.

In addition, access to (and knowledge of) high-quality data not only accelerates ML and AI development, but also makes it easier for teams to operationalize AI solutions. Training LLMs on high-quality data improves their reliability and accuracy, and lets you use smaller models to achieve comparable results—reducing your operational costs. While training on high-quality data won't completely eliminate hallucinations, it will reduce their frequency, in turn reducing the need for costly interventions, like intensive human-in-the-loop review, or repetitive prompting.

And as organizations that struggle to operationalize ML and AI solutions grapple with the complexity and cost of deploying, scaling, and maintaining this software in production environments, metadata provides essential traceability and rich contextual understanding.

It enables AI teams, data platform engineers, SREs, and others to identify and correct problematic data pipelines, workflows, and processes. It allows them to understand the lineage of training datasets, and track changes to them over time. It also provides insights into how transformations and training affect model outputs. Teams can capture parameters, code versions, environment variables, and other metadata associated with their models, which helps make the resultant ML or AI solutions more transparent and reproducible.

Metadata is also essential for controlling the costs associated with AI development.

By collecting and analyzing metadata, teams and data leaders can

Track and manage their training datasets;
Distinguish actively used training datasets and artifacts from inactive ones; and
Identify redundant, relict, or forgotten datasets and/or artifacts.

Left ungoverned, the resources required to store and manage the datasets and artifacts used to train AI models will contribute to increased cloud costs, especially in larger AI programs—where projects tend to be bigger, training datasets more voluminous, and models more numerous.

Reducing these costs won’t eliminate those associated with training and tuning AI models, or supporting production inferencing workloads, but it could make them more justifiable.

Conclusion

Metadata gives organizations a way to understand which data gets collected, processed, and used—by whom, and for which purposes—and where this data lives in their sprawling data ecosystems. It enables them to identify and monitor the data sources and assets feeding their critical KPIs and metrics, along with any reports, dashboards, and analytics that depend on these metrics. These are well-known use cases for metadata. But these same use cases also extend to the world of AI, where metadata is no less critical–and useful. Here, too, it provides a rich contextual lens organizations can rely on to more effectively understand, manage, and govern not just their models and data assets, but the production solutions that depend on them.

Discover the Power of a Modern Data Catalog and Metadata Platform

Interested in learning more about how metadata-driven data management and governance can transform your AI development efforts, paving the way for reproducible, sustainable production deployments? Check out this resource to discover more!

Metadata’s Role in Sustainable, Cost-Effective AI Development

Beyond Raw Input Data

AI Sticker Shock

Metadata to the Rescue?

Conclusion

Discover the Power of a Modern Data Catalog and Metadata Platform

Governing the Kafka Firehose

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Five Signs You Need a Unified Data Observability Solution