OpenTelemetry traces in Unity Catalog create a continuous improvement flywheel for AI agents through analytics, evals, and monitoring.
by Firas Farah, Bruno Faria and Anoop Sunke
As AI applications move into production, traces become one of the clearest ways to understand how agents actually behave by capturing prompts, tool calls, responses, latency, and execution paths. Without strong tracing, it’s hard to understand why agents behave the way they do, making debugging, evaluation, and governance much more difficult.
AI traces quickly become valuable for analytics, evaluation, and monitoring workflows beyond traditional debugging and observability use cases. Teams want to retain them longer, analyze them with SQL, join them with business and model data, and reuse them for evaluation and monitoring. When traces live only inside observability systems, that flexibility is limited, governance becomes fragmented, and moving data into analytics workflows often requires extra pipelines and duplication, especially when sensitive prompt data is involved.
Databricks now supports writing OTel traces directly to Unity Catalog using the OpenTelemetry (OTel) format. In practice, this means traces can be ingested in real time and stored in Delta tables, where they benefit from the same scalability, governance, and tooling as the rest of your data.
This changes how teams can use trace data:
So why not rely entirely on a SaaS observability tool?
Databricks supports ingesting OpenTelemetry (OTel) traces, logs, and metrics directly into Unity Catalog tables, using the OTel standard to separate instrumentation from storage.
Databricks removes the operational complexity of traditional, multi-hop telemetry pipelines by providing a managed ingestion layer, transparently powered by Zerobus Ingest. Zerobus Ingest acts as a fully managed, serverless ingestion engine that natively supports standard OpenTelemetry protocols (OTLP) via gRPC for open-source collectors, while its REST API capabilities enable seamless integration with application frameworks like MLflow. Applications can easily export spans, logs, and metrics directly to Unity Catalog tables, where the data is stored in Delta format. With a “single-sink” architecture, Zerobus Ingest simplifies observability by streaming data directly to the lakehouse. Existing OLTP-compatible collectors can point directly to this endpoint via gRPC, entirely bypassing intermediate message buses like Kafka. Zerobus Ingest acts as your high-throughput telemetry pipeline, handling ingestion and durability with zero infrastructure overhead. Any OTel-compatible client can export traces to this endpoint, including popular AI agent frameworks across many programming languages.
From there, traces, logs, and metrics become first-class data in the Lakehouse, powering ad-hoc SQL analysis, dashboards, downstream analytics, and MLflow evaluation and monitoring workflows. Unifying your telemetry creates a continuous improvement flywheel where production behavior feeds evaluation and analysis, which in turn drives faster iteration and better agent performance.

For this blog, we’ll create a simple support manager assistant that we can use to demonstrate tracing end-to-end. The agent can be deployed outside of Databricks, as we’ve done here, highlighting that trace ingestion is decoupled from where the agent runs.
We built a LangGraph agent powered by a Databricks-hosted Claude Sonnet 4.6 model for reasoning and response generation. The agent calls a Genie Space as a tool, which you can deploy here.
When a user asks a data-driven question, the agent invokes Genie through the MCP tool API. Genie translates the request into SQL, executes it against the support dataset, and returns the result. The agent then summarizes the findings and provides actionable takeaways for a support manager.

Before instrumenting the agent, we first configure the tables in UC that will store OpenTelemetry traces. In this example, we use MLflow to create the underlying OpenTelemetry tables in Unity Catalog and link them to an MLflow experiment so traces can be searched, analyzed, and annotated from the UI. Start by identifying (or creating) a SQL warehouse and an MLflow experiment, then use the MLflow Python library to provision the Unity Catalog tables and associate the schema with the experiment. For full steps, follow the docs here.
This setup creates Unity Catalog tables for OpenTelemetry spans, logs, and metrics. The underlying data is stored in OpenTelemetry-compliant table formats, and the MLflow service automatically creates Databricks SQL views alongside them that transform the OpenTelemetry data into an MLflow-friendly format for easier querying and analysis. These include:
<table_prefix>_otel_spans: detailed span-level execution data for each request<table_prefix>_otel_logs: structured log/event data captured during execution<table_prefix>_otel_metrics: numerical telemetry captured during execution<table_prefix>_otel_annotations: MLflow-specific trace data that is not a standard OTel signal, including metadata, tags, assessments/feedback, expectations, and run links<table_prefix>_trace_unified: a consolidated view that assembles trace data into a single record per trace, including raw span data and trace metadata<table_prefix>_trace_metadata: MLflow tags, metadata, and assessments grouped by trace ID; more performant than the unified view when you only need MLflow trace metadata
After setting up the experiment, agent instrumentation remains the same. Any OTel-compatible instrumentation library can export traces to the configured endpoint. You can do automatic and/or manual tracing as described here. In our example, we rely on mlflow.langchain.autolog() to capture the detailed LangGraph execution (model calls and tool calls). We also wrap the entrypoint with @MLflow.trace to establish a request-level root span, allowing each invocation to be observed as a single end-to-end execution.
Now that the agent is instrumented and traces are flowing into Unity Catalog, let’s look at a real execution.
For this example, we asked the Support Manager Assistant:
"Which support engineer should I put up for promotion?"
The agent evaluated the request, called the Genie space multiple times to gather supporting data, and returned a recommendation based on performance metrics.

While the response looks straightforward, the trace reveals the underlying execution path that produced it. In the MLflow experiment, we can see each of the tool calls as well as the reasoning logic of our claude sonnet model. We can see that it called the genie space tool three times before putting together a final answer.

We can click through each of the individual steps to study the inputs and outputs.

Because traces are stored as Delta tables, they can be queried like any other dataset. We can start with the mlflow_experiment_trace_unified view, where we will find a record that includes the request, response, trace metadata, and an array of the spans.

Now that traces are stored in Unity Catalog, they become immediately available for both batch and streaming analytics.
Prompts and responses, however, often contain sensitive information, so treating trace data as governed data is critical. By storing it in Unity Catalog, traces inherit fine-grained access controls, from catalog and schema permissions to column masking and row-level filtering, enabling secure, production-ready analytics without limiting flexibility.
Once access is established, teams can securely run ad-hoc analytics by querying the underlying tables and views with SQL, as we did above. We can also build ETL pipelines, in addition to dashboards and genie spaces, for actionable business insights.
The MLflow Experiment UI now ships with native observability dashboards for traces in Unity Catalog, including views for trace volume, errors, latency, token usage, and cost. For most teams, that's enough to monitor day-to-day agent health.

When you need a view that goes beyond the native visuals, the trace tables are still just Delta tables in Unity Catalog. You can build a custom AI/BI Dashboard against them and write standard SQL (with help from AI) to model whatever your team cares about.
To show what custom dashboards can add on top of the native views, we built an AI Operations Center on our trace tables. Below are a couple of capabilities worth mentioning.
Custom Cost Analysis with Contract Pricing
Native cost metrics rely on standard list prices, which can be off for teams that have negotiated rates or run fine-tuned models with different pricing. Because we control the SQL, we embedded our pricing logic directly into the query. The dashboard tracks token usage by model type (for example, GPT 5.5 vs. Claude 4.6 Sonnet) and applies our contract rates to produce an Estimated Cost per Trace that reflects what we actually pay. That makes it easy to catch expensive outliers, like a single complex query that costs $0.50 because of a retrieval loop.

Component-Level Performance
Native latency views show P50/P99 at the trace level. To go a layer deeper and see which tool is slow, we built a Tool Performance widget that breaks down latency (P50, P99) and error rates per individual tool in the agent (for example, retrieve_docs vs. generate_response). That tells us whether the LLM, a Genie tool call, or another step is the bottleneck, so we can pinpoint exactly where the user experience is degrading.

Both business and technical stakeholders often want to explore agent behavior without writing SQL. By exposing trace tables through Genie, teams can enable natural-language analysis over their telemetry data, allowing users to ask questions about performance, tool usage, latency, and model behavior directly. In our example, this could include questions such as:

Because traces are stored as Delta tables, they can feed downstream ETL pipelines just like any other dataset. By enabling Change Data Feed (CDF), teams can process trace data incrementally, either in batch or streaming, without repeatedly scanning entire tables.
This makes it possible to operationalize observability. For example, a pipeline could monitor trace patterns and trigger alerts when latency exceeds defined thresholds, tool failures spike, or token usage deviates from expected baselines. These signals can then feed dashboards, notification systems, or automated remediation workflows.
Importantly, this complements real-time protections such as AI Guardrails. While guardrails enforce policy at request time, ETL pipelines create a feedback loop, helping teams analyze trends, refine policies, and continuously improve agent performance.
Once traces are available, they can power the full MLflow evaluation stack, enabling teams to measure, improve, and maintain the quality of their GenAI applications across the entire lifecycle. Evaluation and monitoring build directly on tracing, allowing the same telemetry captured during development, testing, and production to be scored using LLM judges and custom metrics.
MLflow allows us to run evaluations against an evaluation dataset, applying built-in or custom judges to score response quality. One effective approach is to bootstrap this dataset from real traces. Because these prompts originate from actual user interactions, they better represent the scenarios your agent must handle compared to purely synthetic test cases.
Below, we create an evaluation dataset from recently captured traces. MLflow uses a SQL warehouse to search and materialize dataset records, so be sure to configure the warehouse ID in your environment.
With the dataset in place, we can define the judges that will score our application. MLflow provides a set of built-in judges, and also allows us to define custom guidelines tailored to our agent’s expected behavior.
And we can now see the results in the MLflow experiment.

Development evaluations help us validate behavior before release, but production monitoring shows us how the application performs with real users. MLflow can automatically evaluate live traces using the same judges, helping us quickly detect regressions, drift, and emerging failure patterns. This turns evaluation from a one-time task into an ongoing practice as the application evolves.
Experian
The transition to MLflow tracing for our Eva virtual assistant and Latte automated email system has been seamless. With Traces in Unity Catalog, our data science team runs hundreds of thousands of traces through governed Delta tables and evaluates agent quality at scale - all without leaving Databricks. As we onboard more serious evaluation workflows, having tracing and evals in one governed platform means we're not maintaining separate tools for each stage of the agent lifecycle.— James Lin, Head of AI/ML Innovation, Experian
Superhuman (Grammarly)
We're standardizing on MLflow tracing as the observability layer for all of our AI agents at Superhuman. We prefer the broader platform integration over building and maintaining a custom or point solution - that maintenance burden was a real pain point for our teams. With MLflow Traces in Unity Catalog, we can scale to hundreds of thousands of traces per day, and our researchers can self-serve and explore agent behavior directly in the MLflow UI with no engineering support. Having tracing, evaluation, and monitoring all in one governed platform is exactly what we needed to move our agents into production with confidence.— Martin Jewell, Lead MLE AI Infrastructure, Superhuman
SmartSheet
We chose Databricks as our platform for GenAI, and MLflow is how our team builds and evaluates AI agents. During a three-day co-build with Databricks, we stood up two production agents using MLflow tracing, evaluations, custom judges, and labeling - and with traces stored in Unity Catalog, we can run tens of thousands of evaluations and iterate on quality with confidence as we scale.— Kapil Ashar, VP of Engineering, Smartsheet
The Standard
The Standard helps our customers achieve financial well-being and peace of mind. Data and AI are key to delivering that experience at scale. By embedding AI agent functionality - such as extracting key information from inbound underwriting documents and claim submittals - across important business functions, we are able to provide exceptional service to our customers and partners. With production tracing and monitoring, our teams can quickly understand how systems behave and make reliable updates. By governing traces in Unity Catalog alongside the rest of our data on the Databricks Data Intelligence Platform, we can query, monitor and iterate securely - without adding unnecessary complexity.— Porter Orr, AVP of AI and Automation, The Standard
Q: Can I use this for agents running outside of Databricks?
A: Yes, the agent can be running anywhere. In fact the support assistant agent example that was used for this blog is deployed locally.
Q: What are the throughput and storage limits of this solution?
A: Ingestion throughput limit starts at 200 QPS. There is no limit on storage. Previous limits on traces per experiment are no longer applicable. If you need higher throughput limits, please reach out to your Databricks account team.
Q: What can I do to ensure my search queries, MLflow experiment experience, and downstream analytics remain performant?
A: With the latest product update, the tables are automatically liquid clustered to keep the data optimally organized. For larger trace volumes, however, you should create a materialized view on top of the derived views and incrementally refresh it to maintain query performance.
Q: How does this handle PII found in user prompts?
A: This feature does not apply any special handling to PII. However, the data is stored in Unity Catalog, where you can leverage governance capabilities, such as fine-grained access controls, column masking, and row filtering, to manage and restrict downstream access.
To get started, follow along with the documentation.
Subscribe to our blog and get the latest posts delivered to your inbox.