May 22, 2026

Observability for any agent, anywhere: Production-ready tracing with OpenTelemetry & Unity Catalog on Databricks

OpenTelemetry traces in Unity Catalog create a continuous improvement flywheel for AI agents through analytics, evals, and monitoring.

by Firas Farah, Bruno Faria and Anoop Sunke

The Problem: AI agents generate massive volumes of trace data, but traditional observability tools make that data expensive to retain, difficult to govern, and hard to use in evaluation and analytics workflows.
The Solution: Databricks now supports writing OpenTelemetry (OTel) traces directly to Unity Catalog tables via a fully managed, serverless ingestion path.
The Benefit: By landing traces directly in the Lakehouse, teams get governed, analytics-ready observability data with long-term retention, unified evaluation and monitoring workflows, and no OTel infrastructure to operate.
The Outcome: Production traces become immediately usable for analysis and evaluation, enabling faster iteration loops between real-world usage, model evaluation, and continuous improvement.

Why AI Tracing Breaks Traditional Observability

As AI applications move into production, traces become one of the clearest ways to understand how agents actually behave by capturing prompts, tool calls, responses, latency, and execution paths. Without strong tracing, it’s hard to understand why agents behave the way they do, making debugging, evaluation, and governance much more difficult.

AI traces quickly become valuable for analytics, evaluation, and monitoring workflows beyond traditional debugging and observability use cases. Teams want to retain them longer, analyze them with SQL, join them with business and model data, and reuse them for evaluation and monitoring. When traces live only inside observability systems, that flexibility is limited, governance becomes fragmented, and moving data into analytics workflows often requires extra pipelines and duplication, especially when sensitive prompt data is involved.

OTel Trace Ingestion

Databricks now supports writing OTel traces directly to Unity Catalog using the OpenTelemetry (OTel) format. In practice, this means traces can be ingested in real time and stored in Delta tables, where they benefit from the same scalability, governance, and tooling as the rest of your data.

This changes how teams can use trace data:

Real-time ingestion with practical retention: Traces can be written as they’re generated at high throughput and retained long-term without the cost pressure typically associated with observability platforms.
Analyze and govern using the Lakehouse: Once traces are in tables, you can treat them like any other dataset: query them with SQL, build dashboards, run ETL pipelines, use tools like Genie, and apply governance controls such as PII masking.
Use the full MLflow evaluation stack: MLflow makes it easy to search, filter, and drill down into your traces for debugging. Persisting traces in Unity Catalog removes typical experiment constraints (such as trace caps), making it easier to run large offline evaluations, monitor production systems, and continuously improve quality as workloads grow.

SaaS vs. Lakehouse

So why not rely entirely on a SaaS observability tool?

Retention economics: Agents generate massive text payloads. Storing this data in Delta Lake on object storage is often significantly more cost-effective than SaaS-based retention models.
The PII deadlock: Sending raw prompts to third-party platforms can create InfoSec friction. Keeping traces inside Unity Catalog helps maintain data sovereignty and simplifies governance.
Analytics, not just telemetry: While SaaS tools are strong for operational metrics like latency, the Lakehouse provides an analytics engine. You can join traces with business data, such as revenue and conversions, to understand real impact and go beyond system health. Furthermore, the Lakehouse enables you to apply AI directly to your traces and to build evaluation frameworks to continuously improve system quality.

Architecture: Serverless OpenTelemetry ingestion

Databricks supports ingesting OpenTelemetry (OTel) traces, logs, and metrics directly into Unity Catalog tables, using the OTel standard to separate instrumentation from storage.

Databricks removes the operational complexity of traditional, multi-hop telemetry pipelines by providing a managed ingestion layer, transparently powered by Zerobus Ingest. Zerobus Ingest acts as a fully managed, serverless ingestion engine that natively supports standard OpenTelemetry protocols (OTLP) via gRPC for open-source collectors, while its REST API capabilities enable seamless integration with application frameworks like MLflow. Applications can easily export spans, logs, and metrics directly to Unity Catalog tables, where the data is stored in Delta format. With a “single-sink” architecture, Zerobus Ingest simplifies observability by streaming data directly to the lakehouse. Existing OLTP-compatible collectors can point directly to this endpoint via gRPC, entirely bypassing intermediate message buses like Kafka. Zerobus Ingest acts as your high-throughput telemetry pipeline, handling ingestion and durability with zero infrastructure overhead. Any OTel-compatible client can export traces to this endpoint, including popular AI agent frameworks across many programming languages.

From there, traces, logs, and metrics become first-class data in the Lakehouse, powering ad-hoc SQL analysis, dashboards, downstream analytics, and MLflow evaluation and monitoring workflows. Unifying your telemetry creates a continuous improvement flywheel where production behavior feeds evaluation and analysis, which in turn drives faster iteration and better agent performance.

Databricks Lakehouse Platform

Tutorial: Wiring Traces into the Lakehouse

Sample agent: Support manager assistant

For this blog, we’ll create a simple support manager assistant that we can use to demonstrate tracing end-to-end. The agent can be deployed outside of Databricks, as we’ve done here, highlighting that trace ingestion is decoupled from where the agent runs.

We built a LangGraph agent powered by a Databricks-hosted Claude Sonnet 4.6 model for reasoning and response generation. The agent calls a Genie Space as a tool, which you can deploy here.

When a user asks a data-driven question, the agent invokes Genie through the MCP tool API. Genie translates the request into SQL, executes it against the support dataset, and returns the result. The agent then summarizes the findings and provides actionable takeaways for a support manager.

Support manager assistant

Setting up OTel tracing with UC

Before instrumenting the agent, we first configure the tables in UC that will store OpenTelemetry traces. In this example, we use MLflow to create the underlying OpenTelemetry tables in Unity Catalog and link them to an MLflow experiment so traces can be searched, analyzed, and annotated from the UI. Start by identifying (or creating) a SQL warehouse and an MLflow experiment, then use the MLflow Python library to provision the Unity Catalog tables and associate the schema with the experiment. For full steps, follow the docs here.

This setup creates Unity Catalog tables for OpenTelemetry spans, logs, and metrics. The underlying data is stored in OpenTelemetry-compliant table formats, and the MLflow service automatically creates Databricks SQL views alongside them that transform the OpenTelemetry data into an MLflow-friendly format for easier querying and analysis. These include:

<table_prefix>_otel_spans: detailed span-level execution data for each request
<table_prefix>_otel_logs: structured log/event data captured during execution
<table_prefix>_otel_metrics: numerical telemetry captured during execution
<table_prefix>_otel_annotations: MLflow-specific trace data that is not a standard OTel signal, including metadata, tags, assessments/feedback, expectations, and run links
<table_prefix>_trace_unified: a consolidated view that assembles trace data into a single record per trace, including raw span data and trace metadata
<table_prefix>_trace_metadata: MLflow tags, metadata, and assessments grouped by trace ID; more performant than the unified view when you only need MLflow trace metadata

Unity Catalog tables for OpenTelemetry

After setting up the experiment, agent instrumentation remains the same. Any OTel-compatible instrumentation library can export traces to the configured endpoint. You can do automatic and/or manual tracing as described here. In our example, we rely on mlflow.langchain.autolog() to capture the detailed LangGraph execution (model calls and tool calls). We also wrap the entrypoint with @MLflow.trace to establish a request-level root span, allowing each invocation to be observed as a single end-to-end execution.

Inspecting a sample trace

Now that the agent is instrumented and traces are flowing into Unity Catalog, let’s look at a real execution.

For this example, we asked the Support Manager Assistant:

"Which support engineer should I put up for promotion?"

The agent evaluated the request, called the Genie space multiple times to gather supporting data, and returned a recommendation based on performance metrics.

Performance Analysis

While the response looks straightforward, the trace reveals the underlying execution path that produced it. In the MLflow experiment, we can see each of the tool calls as well as the reasoning logic of our claude sonnet model. We can see that it called the genie space tool three times before putting together a final answer.

Genie Space Tool

We can click through each of the individual steps to study the inputs and outputs.

Genie Space Tool

Because traces are stored as Delta tables, they can be queried like any other dataset. We can start with the mlflow_experiment_trace_unified view, where we will find a record that includes the request, response, trace metadata, and an array of the spans.

Genie Space Tool

Beyond Debugging: Analytics on Trace Data

Now that traces are stored in Unity Catalog, they become immediately available for both batch and streaming analytics.

Governance in Unity Catalog

Prompts and responses, however, often contain sensitive information, so treating trace data as governed data is critical. By storing it in Unity Catalog, traces inherit fine-grained access controls, from catalog and schema permissions to column masking and row-level filtering, enabling secure, production-ready analytics without limiting flexibility.

Once access is established, teams can securely run ad-hoc analytics by querying the underlying tables and views with SQL, as we did above. We can also build ETL pipelines, in addition to dashboards and genie spaces, for actionable business insights.

Dashboards

The MLflow Experiment UI now ships with native observability dashboards for traces in Unity Catalog, including views for trace volume, errors, latency, token usage, and cost. For most teams, that's enough to monitor day-to-day agent health.

Dashboard

When you need a view that goes beyond the native visuals, the trace tables are still just Delta tables in Unity Catalog. You can build a custom AI/BI Dashboard against them and write standard SQL (with help from AI) to model whatever your team cares about.

To show what custom dashboards can add on top of the native views, we built an AI Operations Center on our trace tables. Below are a couple of capabilities worth mentioning.

Custom Cost Analysis with Contract Pricing

Native cost metrics rely on standard list prices, which can be off for teams that have negotiated rates or run fine-tuned models with different pricing. Because we control the SQL, we embedded our pricing logic directly into the query. The dashboard tracks token usage by model type (for example, GPT 5.5 vs. Claude 4.6 Sonnet) and applies our contract rates to produce an Estimated Cost per Trace that reflects what we actually pay. That makes it easy to catch expensive outliers, like a single complex query that costs $0.50 because of a retrieval loop.

Custom Cost Analysis

Component-Level Performance

Native latency views show P50/P99 at the trace level. To go a layer deeper and see which tool is slow, we built a Tool Performance widget that breaks down latency (P50, P99) and error rates per individual tool in the agent (for example, retrieve_docs vs. generate_response). That tells us whether the LLM, a Genie tool call, or another step is the bottleneck, so we can pinpoint exactly where the user experience is degrading.

Component Level Performance

Genie spaces

Both business and technical stakeholders often want to explore agent behavior without writing SQL. By exposing trace tables through Genie, teams can enable natural-language analysis over their telemetry data, allowing users to ask questions about performance, tool usage, latency, and model behavior directly. In our example, this could include questions such as:

What types of requests require escalation?
Are tool retries increasing?
Which queries trigger the most complex execution paths?

Trace Table

ETL pipelines

Because traces are stored as Delta tables, they can feed downstream ETL pipelines just like any other dataset. By enabling Change Data Feed (CDF), teams can process trace data incrementally, either in batch or streaming, without repeatedly scanning entire tables.

This makes it possible to operationalize observability. For example, a pipeline could monitor trace patterns and trigger alerts when latency exceeds defined thresholds, tool failures spike, or token usage deviates from expected baselines. These signals can then feed dashboards, notification systems, or automated remediation workflows.

Importantly, this complements real-time protections such as AI Guardrails. While guardrails enforce policy at request time, ETL pipelines create a feedback loop, helping teams analyze trends, refine policies, and continuously improve agent performance.

Closing the Loop: From Production Traces to Evaluation

Once traces are available, they can power the full MLflow evaluation stack, enabling teams to measure, improve, and maintain the quality of their GenAI applications across the entire lifecycle. Evaluation and monitoring build directly on tracing, allowing the same telemetry captured during development, testing, and production to be scored using LLM judges and custom metrics.

Evaluate during development

MLflow allows us to run evaluations against an evaluation dataset, applying built-in or custom judges to score response quality. One effective approach is to bootstrap this dataset from real traces. Because these prompts originate from actual user interactions, they better represent the scenarios your agent must handle compared to purely synthetic test cases.

Below, we create an evaluation dataset from recently captured traces. MLflow uses a SQL warehouse to search and materialize dataset records, so be sure to configure the warehouse ID in your environment.

With the dataset in place, we can define the judges that will score our application. MLflow provides a set of built-in judges, and also allows us to define custom guidelines tailored to our agent’s expected behavior.

And we can now see the results in the MLflow experiment.

Production monitoring

Development evaluations help us validate behavior before release, but production monitoring shows us how the application performs with real users. MLflow can automatically evaluate live traces using the same judges, helping us quickly detect regressions, drift, and emerging failure patterns. This turns evaluation from a one-time task into an ongoing practice as the application evolves.

Customers Running AI Observability on Databricks

Experian

The transition to MLflow tracing for our Eva virtual assistant and Latte automated email system has been seamless. With Traces in Unity Catalog, our data science team runs hundreds of thousands of traces through governed Delta tables and evaluates agent quality at scale - all without leaving Databricks. As we onboard more serious evaluation workflows, having tracing and evals in one governed platform means we're not maintaining separate tools for each stage of the agent lifecycle.—James Lin, Head of AI/ML Innovation, Experian

Superhuman (Grammarly)

We're standardizing on MLflow tracing as the observability layer for all of our AI agents at Superhuman. We prefer the broader platform integration over building and maintaining a custom or point solution - that maintenance burden was a real pain point for our teams. With MLflow Traces in Unity Catalog, we can scale to hundreds of thousands of traces per day, and our researchers can self-serve and explore agent behavior directly in the MLflow UI with no engineering support. Having tracing, evaluation, and monitoring all in one governed platform is exactly what we needed to move our agents into production with confidence.—Martin Jewell, Lead MLE AI Infrastructure, Superhuman

SmartSheet

We chose Databricks as our platform for GenAI, and MLflow is how our team builds and evaluates AI agents. During a three-day co-build with Databricks, we stood up two production agents using MLflow tracing, evaluations, custom judges, and labeling - and with traces stored in Unity Catalog, we can run tens of thousands of evaluations and iterate on quality with confidence as we scale.—Kapil Ashar, VP of Engineering, Smartsheet

The Standard

The Standard helps our customers achieve financial well-being and peace of mind. Data and AI are key to delivering that experience at scale. By embedding AI agent functionality - such as extracting key information from inbound underwriting documents and claim submittals - across important business functions, we are able to provide exceptional service to our customers and partners. With production tracing and monitoring, our teams can quickly understand how systems behave and make reliable updates. By governing traces in Unity Catalog alongside the rest of our data on the Databricks Data Intelligence Platform, we can query, monitor and iterate securely - without adding unnecessary complexity.—Porter Orr, AVP of AI and Automation, The Standard

Frequently Asked Questions (FAQ)

Q: Can I use this for agents running outside of Databricks?
A: Yes, the agent can be running anywhere. In fact the support assistant agent example that was used for this blog is deployed locally.

Q: What are the throughput and storage limits of this solution?
A: Ingestion throughput limit starts at 200 QPS. There is no limit on storage. Previous limits on traces per experiment are no longer applicable. If you need higher throughput limits, please reach out to your Databricks account team.

Q: What can I do to ensure my search queries, MLflow experiment experience, and downstream analytics remain performant?
A: With the latest product update, the tables are automatically liquid clustered to keep the data optimally organized. For larger trace volumes, however, you should create a materialized view on top of the derived views and incrementally refresh it to maintain query performance.

Q: How does this handle PII found in user prompts?
A: This feature does not apply any special handling to PII. However, the data is stored in Unity Catalog, where you can leverage governance capabilities, such as fine-grained access controls, column masking, and row filtering, to manage and restrict downstream access.

Observability for any agent, anywhere: Production-ready tracing with OpenTelemetry & Unity Catalog on Databricks

Why AI Tracing Breaks Traditional Observability

OTel Trace Ingestion

SaaS vs. Lakehouse

Architecture: Serverless OpenTelemetry ingestion

Tutorial: Wiring Traces into the Lakehouse

Sample agent: Support manager assistant

Setting up OTel tracing with UC

Inspecting a sample trace

Beyond Debugging: Analytics on Trace Data

Governance in Unity Catalog

Dashboards

Genie spaces

ETL pipelines

Closing the Loop: From Production Traces to Evaluation

Evaluate during development

Production monitoring

Customers Running AI Observability on Databricks

Frequently Asked Questions (FAQ)

Get started

Get the latest posts in your inbox

Sign up