If we broadly compare classical machine learning and generative AI workflows, we find that the general workflow steps remain similar between the two. Both require data collection, feature engineering, model optimization, deployment, evaluation, etc. but the execution details and time allocations are fundamentally different. Most importantly, generative AI introduces unique sources of technical debt that can accumulate quickly if not properly managed, including:
In this blog, we will address each form of technical debt in turn. Ultimately, teams transitioning from classical ML to generative AI need to be aware of these new debt sources and adjust their development practices accordingly - spending more time on evaluation, stakeholder management, subjective quality monitoring, and instrumentation rather than the data cleaning and feature engineering that dominated classical ML projects.
To appreciate where the field is now, it’s useful to compare how our workflows for generative AI compare with what we use for classical machine learning problems. The following is a high-level overview. As this comparison reveals, the broad workflow steps remain the same, but there are differences in the execution details that lead to different steps getting emphasized. As we’ll see, generative AI also introduces new forms of technical debt, which have implications for how we maintain our systems in production.
| Workflow Step | Classical ML | Generative AI |
|---|---|---|
| Data collection | Collected data represents real-world events, such as retail sales or equipment failures. Structured formats, such as CSV and JSON, are often used. | Collected data represents contextual knowledge that helps a language model provide relevant responses. Both structured data (often in real time tables) and unstructured data (images, videos, text files) can be used. |
| Feature engineering/ Data transformation | Data transformation steps involve either creating new features to better reflect the problem space (e.g., creating weekday and weekend features from timestamp data) or doing statistical transformations so models fit the data better (e.g., standardizing continuous variables for k-means clustering and doing a log transform of skewed data so it follows a normal distribution). | For unstructured data, transformation involves chunking, creating embedding representations, and (possibly) adding metadata such as headings and tags to chunks. For structured data, it might involve denormalizing tables so that large language models (LLMs) don’t have to consider table joins. Adding table and column metadata descriptions is also important. |
| Model pipeline design | Usually covered by a basic pipeline with three steps:
| Usually involves a query rewriting step, some form of information retrieval, possibly tool calling, and safety checks at the end. Pipelines are much more complex, involve more complex infrastructure like databases and API integrations, and sometimes handled with graph-like structures. |
| Model optimization | Model optimization involves hyperparameter tuning using methods such as cross-validation, grid search, and random search. | While some hyperparameters, such as temperature, top-k, and top-p, may be changed, most effort is spent tuning prompts to guide model behavior. Since an LLM chain may involve many steps, an AI engineer may also experiment with breaking down a complex operation into smaller components. |
| Deployment | Models are much smaller than foundation models such as LLMs. Entire ML applications can be hosted on a CPU without GPUs being needed. Model versioning, monitoring, and lineage are important considerations. Model predictions rarely require complex chains or graphs, so traces are usually not used. | Because foundation models are very large, they may be hosted on a central GPU and exposed as an API to several user-facing AI applications. Those applications act as “wrappers” around the foundation model API and are hosted on smaller CPUs. Application version management, monitoring, and lineage are important considerations. Additionally, because LLM chains and graphs can be complex, proper tracing is needed to identify query bottlenecks and bugs. |
| Evaluation | For model performance, data scientists can use defined quantitative metrics such as F1 score for classification or root mean square error for regression. | The correctness of an LLM output relies on subjective judgments, e.g. of the quality of a summary or translation. Therefore, response quality is usually judged with guidelines rather than quantitative metrics. |
From first-hand experience balancing a price forecasting project with a project building a tool-calling agent, we found that there are some major differences in the model development and deployment steps.
The inner development loop typically refers to the iterative process that machine learning developers go through when building and refining their model pipelines. It usually occurs before production testing and model deployment.
Here’s how classic ML and GenAI professionals spend their time differently in this step:
Classical ML model development timesinks
Generative AI model and pipeline development timesinks
The following diagrams compare classical ML and generative AI time allocations for the model development loop.
Unlike the model development loop, the model deployment loop does not focus on optimizing model performance. Instead, engineers are focused on systematic testing, deployment, and monitoring in production environments.
Here, developers might move configurations into YAML files to make project updates easier. They might also refactor static data processing pipelines to run in a streaming fashion, using a more robust framework such as PySpark instead of Pandas. Finally, they need to consider how to set up testing, monitoring, and feedback processes to maintain model quality.
At this point, automation is essential, and continuous integration and delivery is a nonnegotiable requirement. For managing CI/CD for data and AI projects on Databricks, Databricks Asset Bundles are usually the tool of choice. They make it possible to describe Databricks resources (such as jobs and pipelines) as source files, and provide a way to include metadata alongside your project’s source files.
As in the model development stage, the activities that take the most time in generative AI versus classical ML projects in this stage are not the same.
Classical ML model deployment timesinks
for loops) Generative AI model deployment timesinks
trace decorator can be used on any function to capture its inputs and outputs. At the same time, custom MLflow Traces spans can be created within specific code blocks to log more granular operations. Only after using instrumentation to aggregate a reliable source of truth from agent outputs can developers begin to identify failure modes and design tests accordingly.make_judge API or the SIMBAAlignmentOptimizer.Technical debt builds up when developers implement a quick-and-dirty solution at the expense of long-term maintainability.
Classical ML Technical Debt
Dan Sculley et al. have provided a great summary of the types of technical debt these systems can accumulate. In their paper “Machine Learning: The High-Interest Credit Card of Technical Debt,” they break these down into three broad areas:
Generative AI introduces new forms of technical debt, many of which may not be obvious. This section explores the sources of this hidden technical debt.
Tools are a powerful way to extend an LLM’s capabilities. However, as the number of tools used increases, they can become hard to manage.
Tool sprawl doesn’t only present a discoverability and reuse problem; it also can negatively affect the quality of a generative AI system. When tools proliferate, two key failure points come up:
The cleanest solution for tool sprawl is to be strategic and minimal with the tools a team uses.
However, the right governance strategy can help make managing multiple tools and access scalable as more and more teams integrate GenAI into their projects and systems. Databricks products Unity Catalog and AI Gateway are built for this type of scale.
Even though state-of-the-art models can handle pages of instructions, prompts that are overly complex can introduce issues such as contradicting instructions or out-of-date information. This is especially the case when prompts are not edited, but are just appended to over time by different domain experts or developers.
As different failure modes come up, or new queries are added to the scope, it’s tempting to just keep adding more and more instructions to an LLM prompt. For example, a prompt might start by providing instructions to handle questions related to finance, and then branch out to questions related to product, engineering, and human resources.
Just as a “god class” in software engineering is not a good idea and should be broken up, mega-prompts should be separated into smaller ones. In fact, Anthropic mentions this in its prompt engineering guide, and as a general rule, having multiple smaller prompts rather than a long, complex one helps with clarity, accuracy, and troubleshooting.
Frameworks can help keep prompts manageable by tracking prompt versions and enforcing expected inputs and outputs. An example of a prompt versioning tool is MLflow Prompt Registry, while prompt optimizers such as DSPy can be run on Databricks to decompose a prompt into self-contained modules that can be optimized individually or as a whole.
There’s a reason why tracing has been receiving attention lately, with most LLM libraries and tracking tools offering the ability to trace the inputs and outputs of an LLM chain. When a response returns an error—the dreaded “I’m sorry, I can’t answer your question”—examining the inputs and outputs of intermediate LLM calls is crucial for pinpointing the root cause.
I once worked on an application where I initially assumed that SQL generation would be the most problematic step of the workflow. However, inspecting my traces told a different story: The biggest source of errors was actually a query rewriter step where we updated entities in the user question to entities that matched our database values. The LLM would rewrite queries that didn’t need rewriting, or start stuffing the original query with all sorts of extra information. This would regularly then mess up the subsequent SQL generation process. Tracing helped here to rapidly identify the problem.
Tracing the right LLM calls can take time. It’s not enough to implement tracing out of the box. Properly instrumenting an app with observability, using a framework such as MLflow Traces, is a first step to making agent interactions more transparent.
LLMs are remarkable because you can pass them a few simple prompts, chain the results together, and end up with something that seems to understand nuance and instructions really well. But go too far down this path without grounding responses with user feedback, and quality debt can build up quickly. This is where creating a “data flywheel” as soon as possible can help, which consists of three steps:
I was reminded of the importance of human feedback when developing a text-to-SQL application to query sports statistics. The domain expert was able to explain how a sports fan would want to interact with the data, clarifying what they would care about and providing other insights that I, as someone who rarely watches sports, would never have been able to think of. Without their input, the application I created likely would not have met the users’ needs.
Although capturing human feedback is invaluable, it’s usually painfully time-consuming. One first needs to schedule time with domain experts, then create rubrics to reconcile differences between experts, and then evaluate the feedback for improvements. If the feedback UI is hosted in an environment that business users can’t have access, circling with IT administrators to provide the right level of access can feel like an interminable process.
Regularly consulting with end users, business sponsors, and adjacent teams to see whether you are building the right thing is table stakes for all kinds of projects. However, with generative AI projects, stakeholder communication is more crucial than ever before.
Why frequent, high-touch communication is important:
There are many other forms of technical debt that may need to be addressed in generative AI projects, including by enforcing proper data access controls, putting guardrails in place to manage safety and prevent prompt injections, preventing costs from spiraling, and more. I’ve only included the ones that seem most important here, and that might easily be overlooked.
Classical ML and generative AI are different flavors of the same technical domain. While it’s important to be aware of the differences between them and consider the impact of these differences on how we build and maintain our solutions, certain truths remain constant: communication still bridges gaps, monitoring still prevents catastrophes, and clean, maintainable systems still outperform chaotic ones in the long run.
Want to assess your organization’s own AI maturity? Read our guide: Unlock AI value: The enterprise guide to AI readiness.
Data Science and ML
June 12, 2024/8 min read
Generative AI
January 7, 2025/6 min read


