AI is evolving faster than we expected. In just a few years, we’ve gone from prompt-driven language models to AI agents that can reason, take action, and interact with the world in meaningful ways. These systems hold tremendous promise — from improving customer experience to transforming entire industries. Yet, despite the promise, a large share of AI applications do not make it into production. The reason? A lack of trust in AI quality, uncertainty about how models will behave once deployed, and doubts around reliability and control.
From the perspective of telecommunication industry, McKinsey’s latest analysis drives the point home, warning that “telcos need ethical, safe, transparent, regulation-aligned AI,” and those that master it could unlock $250 billion in value by 2040. TM Forum’s coverage of Verizon’s agentic AI makes it further clear, “metrics such as answerability, accuracy and efficacy must be continuously measured and refreshed to ensure agents remain trustworthy and effective.” The message is blunt - responsible AI isn’t optional. It’s the backbone of telecom’s next growth chapter.
This brings us to explore and examine why responsible AI design and governance matter in industry.
In this blog, you will understand,
Our goal is to show how organisations deploy AI agents that are not only effective but also scalable, reliable, trustworthy and self-improving.
LLMs are designed to generate non-deterministic output. Consequences can be dire if little or no thought is given to an AI application that relies on these systems. Take some real examples that are happening in the world today:
These examples are not one-offs and align with research that demonstrates that, despite high personal adoption of AI systems (like ChatGPT), organisations fail to implement a production-scale AI application due to a lack of reliability and trust.
Evaluating traditional LLM is relatively straightforward: you provide an input, measure the output, and compare it against benchmarks for accuracy.
On the other hand, AI agents are dynamic systems that plan, make decisions, adapt to their context, and interact with other systems. The same question can lead to different paths—just like two humans solving the same problem in various ways. This means evaluation must look at both the outcome and the path taken.
Consider the customer churn AI agent. When a user asks, "I am unhappy with my service," the multi-agent system:
Let’s see how we can build AI systems to be trustworthy and the considerations to keep in mind while designing them. We will use a real-world example.
For this blog, we have built the multi-agent system outlined below. It is built on Databricks MLflow, LangGraph orchestration, and Databricks-hosted foundation models. Multiple sub-agents are working together in a supervisor-worker relationship and performing dedicated tasks, such as troubleshooting sub-agent, customer 360 analysing agent, retention agent, etc. These individual sub-agents are designed to perform specific tasks upon request. Our task is to validate whether this compound system is accurate, trustworthy and implements responsible AI practices. We will focus on the quality of this agent, so we won’t cover the building process of this agent.
Responsible AI is a practice, not a fixed set of rules. It evolves with the maturity and behaviour of the AI systems we build. This practice can broadly be organised into key pillars. In this blog, we implement a pipeline across these pillars for our churn prevention agent, using the MLflow Python SDK’s and UI where applicable.
Let’s examine each of these pillars in detail, with code examples on implementing them in Databricks.
Feel free to skip the code Implementation part if you are mainly interested in the broader concepts.
The key question for production AI systems is simple: Can the outputs be trusted?
Agentic systems are complex, and metrics such as accuracy, F1 score, etc., often overlook what truly matters. We need custom evaluations tied to business requirements and implement guardrails to ensure effective control. For our telco conversational agent, this might include ensuring: it doesn’t recommend competitors, doesn’t hallucinate, and never exposes sensitive personal data.
Evaluation metrics hierarchy
Databricks MLflow 3 provides a variety of scorers/evaluators to assess our AI application at different levels. The right type of scorers can be utilised depending on the level of customisation and control we need. Each approach builds on the previous one, adding more complexity and power.
Starting with built-in judges for quick evaluation. These judges provide research-backed metrics such as safety, correctness, and groundedness. As needs evolve, build custom LLM judges for domain-specific criteria and create custom code-based scorers for deterministic business logic.
Let’s start implementing each evaluation method for our AI Agent. But first, we need to create an evaluation dataset.
Creating Evaluation Dataset
MLflow evaluation runs on an Evaluation Datasets. It provides a structured way to organise and manage test data for GenAI applications. We can generate this dataset by executing our AI application with our test inputs. Any Spark, pandas, or Delta table can be used as a dataset. Note that it has a specific schema structure; refer to the link for more details.
Below, we are generating an evaluation dataset from existing traces (from our AI agent’s model serving endpoint)
| Python SDK( evaluation dataset) | Evaluation DataFrame |
|---|---|
import MLflow
#Extract the traces from the recent run evaluation metrics
#The traces need to be in the same experiment as the evaluation. Evaluation run can't take traces from another experiment
evaluation_traces = MLflow.search_traces(
experiment_ids = ['MLflow_experiment_id’],
run_id = 'experiment_run_name’)MLflow search_traces API brings all the traces related to application execution in an experiment and is ready to be used for quality assessment. | ![]() |
Now that our dataset is ready and presented in a dataframe, let’s start assessing the quality of our AI agent.
Built-in Judges:
Scenario: Let’s say we want to assess whether our churn prevention agent is generating a safe and relevant answer to the user query.
We can quickly use Databricks built-in judges, Relevance to Query and Safety. We can iterate with these judges and assess how the application is performing. Let’s implement this for our churn agent:
| Python SDK (Built-in Judges) | MLflow Experiment |
|---|---|
from MLflow.genai.scorers import (
RelevanceToQuery,
Safety
)
telco_scorers = [
RelevanceToQuery(),
Safety()]
# Run evaluation with predefined scorers
eval_results_builtin = MLflow.genai.evaluate(
data=evaluation_traces,
scorers=telco_scorers
)
| The screenshot below illustrates that the Relevance to Query metric automatically assesses the input and the output of the agent and provides a response with a boolean score of “yes/no” and a rationale.![]() |
These judges are powerful and provide a good idea about our app's performance. But what if they are not sufficient (E.g., I need my Agent to follow my organisation’s policy guidelines)? We can then use MLflow Guideline Judges to bridge the gap.
Guideline driven Metrics:
Scenario: Now basic validation is done, let's say we want to enforce the organisation’s Competitive Offering guideline on our Churn Agent. Since built-in judges are limited, we can utilise MLflow Guidelines LLM Judges to assist us. These judges/scorers can use pass/fail criteria in natural language to evaluate GenAI outputs.
They will help us define any business rules and can be easily incorporated into the agent assessment flow. We have defined two Guideline metrics, Competitive Offering and PII information, for evaluating the quality of our churn agent.
Custom Judge Metrics:
Scenario: Now, we want to understand whether our agent resolved the customer’s issue properly or not, and we also want to assign a score to reflect the quality of its responses. This requires a deeper, more detailed assessment than simple guideline-based metrics can provide.
In this case, MLflow custom judges are used to perform more nuanced evaluations. They go beyond pass/fail checks by supporting multi-level scores (such as excellent, good, or poor) mapped to numeric values.
Result: Using MLflow’s make_judge API, we implemented a custom issue_resolution judge and verified that responses are correctly scored using the new metric. Refer to the appendix section for the code implementation.
Code-based metrics and MLflow Agent Judges:
After completing a thorough evaluation, we may still require a granular understanding of the internal workings of our AI agent.
Scenario: Recall our earlier discussion. Suppose we want to measure the number of tool calls the agent makes per response—it’s an important metric, and any unnecessary calls can increase costs and latency in production. This can be evaluated using either a pure code-based metric or MLflow’s Agent-as-a-Judge, also, MLflow’s code based metrics help analyse this behaviour in detail.
Result: We implement both approaches below. Refer to the appendix section to see how you can implement it in code. The analysis reveals that the churn agent made 13 tool calls for a single query because no customer details were provided, causing it to get confused and invoke all available tools. Adding authentication checks to block unvalidated requests and better prompting resolves this issue.
All these metrics are unique and test different capabilities of the agent. There is no single answer; you can use one or all, or a combination of multiple metrics, to truly assess the quality of your Application. It is also an iterative process of improving these metrics (through testing and human feedback), so each time you improve them, the overall system gets better.
Now, we hope you have a good idea of how to define and create success metrics to assess the quality of your GenAI application.
Overall:
The snapshot below displays the MLflow experiment UI, which aggregates all the evaluation metrics and traces discussed above and implemented for our agent, showing a combined score to assess the overall quality of our agent. This can also be accessed programmatically via APIs.
This is really powerful; it provides a complete view of how our application is performing and helps us to iterate on our testing and improve our agent. We can also implement these metrics in production to monitor; refer to the monitoring section below.
Transparency provides a white box view of how agents are making decisions to derive their goal, enabling robustness assessment, auditability and trust. MLflow Trace provides out-of-the-box observability for the majority of the Agent orchestrator frameworks. Examples include LangGraph, OpenAI, AutoGen, CrewAI, Groq, and many more. With the auto-tracing capabilities , we can have complete visibility of these frameworks with only one line of code. Traces follow the OpenTelemetry (OTEL) format and can be enabled with a single line of code, with additional support for custom tracing using the MLflow.trace decorator.
The following examples illustrate this capability.
Following the offline evaluation, we need to implement safeguarding mechanisms to ensure our AI application behaves as intended. Centralised oversight of AI systems plays a key role in building a trustworthy AI system. Without the proper protections in place, even small gaps can result in unsafe outputs, biased behaviour, or accidental exposure of sensitive information. Simple input and output guardrails —such as safety filtering and identifying sensitive data—help keep AI behaviour within acceptable boundaries. This can be achieved through Databricks AI Gateway, which allows the implementation of these guardrails and many more on any agentic application.
Once an application is live, its outputs need to be continuously monitored to ensure quality, fairness, and safety over time. Databricks GenAI application monitoring can help to achieve that. By running the same evaluation metrics used during offline testing on live traffic—either entirely or by sampling—we can spot issues early and trigger alerts when performance drops below acceptable thresholds.
In the code below, we used the same scorer (count_tool_calls) and implemented it on our production traffic. This can trigger alerts when the metric value falls below the threshold.
We are still in the early stages of AI; the capabilities of these systems have yet to be fully explored. Human oversight is utmost important while designing and building AI Agentic systems. Without anyone having oversight of application output, we risk losing customer data and sensitive information, ultimately compromising trust in the system.
When we build AI agents within Databricks, human-centric design is the first principle. SMEs can interact with the agents through various channels, and feedback is incorporated into the traces. This feedback can then be used to further improve the agentic system.
Bias in the input data can lead to misleading agent output and have a detrimental impact on the outcome of any AI application.
As agents are a compound system, bias can occur at various stages, so it is crucial that we tackle the overall response of the application.
Data bias can be identified early using business metrics and bias frameworks, while LLM providers apply statistical methods to address bias in pretraining. Applications can also define custom metrics to detect bias in model responses.
We can implement a bias metric with the same principle as the Guideline metrics discussed before. Here we are creating a custom bias detection metric to detect any output bias in our agent.
Governance: We will not delve deeply into these categories, as they are substantial big topics in themselves. In summary, you can implement a centralised governance of any AI assets—models, agents, and functions in Databricks through Unity Catalog. It provides a single, controlled view for managing access and permissions at scale for any enterprise. Refer to the references mentioned below for further information.
Accountability: As agentic systems often rely on multiple LLMs across platforms, flexible model access must be paired with strong accountability; Databricks Mosaic AI Gateway enables centralised LLM governance with strict permission controls to reduce misuse, cost overruns, and loss of trust. Fine-grained, On-behalf-of user authentication ensures agents expose only the necessary data and functionality.
Security: Given that AI trust, risk, and security management is a top strategic priority for industries, robust security practices—such as red teaming, model and tool security enforcement, jailbreak testing, and input/output guardrails—are critical. Databricks AI Security Framework whitepaper brings these controls together to support secure, trustworthy production AI deployments.
Refer to the individual links for more information on how it is implemented in Databricks. We will discuss all these individual topics in our upcoming blogs.
And it’s a wrap!! You’ve seen what’s possible—now it’s your turn to build. Analyst estimates that the dedicated Responsible AI market will grow from roughly $1B to somewhere between $5–10B by 2030. The broader ‘Responsible AI stack’ is on track to reach tens of billions of dollars globally over the next decade.
With MLflow AI Evaluation Suite, traceability built into every step, and an overall responsible-by-design framework, you have everything you need to create AI agents your business can trust.
It's time to build your own agents with confidence. Here is how you can get started:
Ready to move beyond the proof-of-concept? Visit the Databricks MLflow GenAI documentation to deploy your first production-grade, responsible AI agent today.
These metrics enable complete control over prompts to define complex, domain-specific evaluation criteria and produce metrics that highlight quality trends across datasets. This approach is beneficial when assessments require richer feedback, comparisons between model versions, or custom categories tailored to specific use cases.
We have implemented our issue_resolution custom judge using MLflow’s make_judge API.
Code-based metrics enable us to apply custom logic (with or without an LLM) to evaluate our AI application. MLflow traces provide visibility into an agent’s execution flow, enabling analysis using either pure code or an Agent-as-a-Judge approach. The example code shows two ways to measure the total tool usage count for any execution of our agent:
Product
November 27, 2024/6 min read
Healthcare & Life Sciences
December 19, 2024/5 min read


