MLflow 3.0: Build, Evaluate, and Deploy Generative AI with Confidence

Published: June 11, 2025

by Corey Zumar, Eric Peter, Alkis Polyzotis, Cathy Yin, Ian Ackerman, Nikhil Thorat, Ben Wilson, Maheswaran Venkatachalam, Matei Zaharia, Patrick Wendell and Akhil Gupta

Summary

MLflow 3.0 unifies traditional ML, deep learning, and GenAI development in a single platform, eliminating the need for separate specialized tools
New GenAI capabilities include production-scale tracing, revamped quality evaluation experience, feedback collection APIs and UI, and comprehensive version tracking for prompts and applications
The platform enables a complete GenAI development workflow: debug with tracing, measure quality with LLM judges, improve with expert feedback, track changes with versioning, and monitor in production, all demonstrated through an e-commerce chatbot example

MLflow has become the foundation for MLOps at scale, with over 30 million monthly downloads and contributions from over 850 developers worldwide powering ML and deep learning workloads for thousands of enterprises. Today, we are thrilled to announce MLflow 3.0, a major evolution that brings that same rigor and reliability to generative AI while enhancing core capabilities for all AI workloads. These powerful new capabilities are available in both open source MLflow and as a fully managed service on Databricks, where they deliver an enterprise-grade GenAI development experience.

While generative AI introduces new challenges around observability, quality measurement, and managing rapidly evolving prompts and configurations, MLflow 3.0 addresses them without requiring you to integrate yet another specialized platform. MLflow 3.0 is a unified platform across generative AI applications, traditional machine learning, and deep learning. Whether you're building GenAI agents, training classifiers, or fine-tuning neural networks, MLflow 3.0 provides consistent workflows, standardized governance, and production-grade reliability that scales with your needs.

MLflow 3.0 at a glance:

Comprehensive Generative AI capabilities: Tracing, LLM judges, human feedback collection, application versioning, and prompt management designed to deliver high application quality and complete observability
Rapid debugging and root cause analysis: View complete traces with inputs, outputs, latency, and cost, linked to the exact prompts, data, and app versions that produced them
Continuous improvement from production data: Turn real-world usage and feedback into better evaluation datasets and refined applications
Unified platform: MLflow supports all generative AI, traditional ML, and deep learning workloads on a single platform with consistent tools for collaboration, lifecycle management, and governance
Enterprise scale on Databricks: Proven reliability and performance that powers production AI workloads for thousands of organizations worldwide

The GenAI Challenge: Fragmented Tools, Elusive Quality

Generative AI has changed how we think about quality. Unlike traditional ML with ground truth labels, GenAI outputs are free-form, nuanced, and varied. A single prompt can yield dozens of different responses that are all equally correct. How do you measure if a chatbot's response is "good"? How do you ensure your agent is not hallucinating? How do you debug complex chains of prompts, retrievals, and tool calls?

These questions point to three core challenges that every organization faces when building GenAI applications:

Observability: Understanding what's happening inside your application, especially when things go wrong
Quality Measurement: Evaluating free-form text outputs at scale without manual bottlenecks
Continuous Improvement: Creating feedback loops that turn production insights into higher-quality applications

Today, organizations trying to solve these challenges face a fragmented landscape. They use separate tools for data management, observability & evaluation, and deployment. This approach creates significant gaps: debugging issues requires jumping between platforms, evaluation happens in isolation from real production data, and user feedback never makes it back to improve the application. Teams spend more time integrating tools than improving their GenAI apps. Faced with this complexity, many organizations simply give up on systematic quality assurance. They resort to unstructured manual testing, shipping to production when things seem "good enough," and hoping for the best.

Solving these GenAI challenges to ship high-quality applications requires new capabilities, but it shouldn't require juggling multiple platforms. That's why MLflow 3.0 extends our proven MLOps foundation to comprehensively support GenAI on one platform with a unified experience that includes:

Comprehensive tracing for 20+ GenAI libraries, providing visibility into every request in development and production, with traces linked to the exact code, data, and prompts that generated them
Research-backed evaluation with LLM judges that systematically measure GenAI quality and identify improvement opportunities
Integrated feedback collection that captures end-user and expert insights from production, regardless of where you deploy, feeding directly back to your evaluation and observability stack for continuous quality improvement

To demonstrate how MLflow 3.0 transforms the way organizations build, evaluate, and deploy high-quality generative AI applications, we will follow a real-world example: building an e-commerce customer support chatbot. We’ll see how MLflow addresses each of the three core GenAI challenges along the way, enabling you to move rapidly from debugging to deployment. Throughout this journey, we'll leverage the full power of Managed MLflow 3.0 on Databricks, including integrated tools like the Review App, Deployment Jobs, and Unity Catalog governance that make enterprise GenAI development practical at scale.

Step 1: Pinpoint Performance Issues with Production-Grade Tracing

Your e-commerce chatbot has gone live in beta, but testers complain about slow responses and inaccurate product recommendations. Without visibility into your GenAI application's complex chains of prompts, retrievals, and tool calls, you're debugging blind and experiencing the observability challenge firsthand.

MLflow 3.0's production-scale tracing changes everything. With just a few lines of code, you can capture detailed traces from 20+ GenAI libraries and custom business logic in any environment, from development through production. The lightweight mlflow-tracing package is optimized for performance, allowing you to quickly log as many traces as needed. Built on OpenTelemetry, it provides enterprise-scale observability with maximum portability.

After instrumenting your code with MLflow Tracing, you can navigate to the MLflow UI to see every trace captured automatically.

After instrumenting your code with MLflow Tracing, you can navigate to the MLflow UI to see every trace captured automatically. The timeline view reveals why responses take more than 15 seconds: your app checks inventory at each warehouse separately (5 sequential calls) and retrieves the customer's entire order history (500+ orders) when it only needs recent purchases. After parallelizing warehouse checks and filtering for recent orders, response time drops by more than 50%.

Step 2: Measure and Improve Quality with LLM Judges

With latency issues resolved, we turn to quality because beta testers still complain about irrelevant product recommendations. Before we can improve quality, we need to systematically measure it. This highlights the second GenAI challenge: how do you measure quality when GenAI outputs are free-form and varied?

MLflow 3.0 makes quality evaluation simple. Create an evaluation dataset from your production traces, then run research-backed LLM judges powered by Databricks Mosaic AI Agent Evaluation:

These judges assess different aspects of quality for a GenAI trace and provide detailed rationales for the detected issues.

These judges assess different aspects of quality for a GenAI trace and provide detailed rationales for the detected issues. Looking at the results of the evaluation reveals the problem: while safety and groundedness scores look good, the 65% retrieval relevance score confirms your retrieval system often fetches the wrong information, which results in less relevant responses.

MLflow's LLM judges are carefully tuned evaluators that match human expertise. You can create custom judges using guidelines tailored to your business requirements. Build and version evaluation datasets from real user conversations, including successful interactions, edge cases, and challenging scenarios. MLflow handles evaluation at scale, making systematic quality assessment practical for any application size.

Step 3: Use Expert Feedback to Improve Quality

The 65% retrieval relevance score points to your root cause, but fixing it requires understanding what the system should retrieve. Enter the Review App, a web interface for collecting structured expert feedback on AI outputs, now integrated with MLflow 3.0. This is the beginning of your continuous improvement journey of turning production insights into higher quality applications

You create labeling sessions where product specialists review traces with poor retrievals. When a customer asks for "wireless headphones under $200 with aptX HD codec support and 30+ hour battery," but gets generic headphone results, your experts annotate exactly which products match ALL requirements.

The Review App enables domain experts to review real responses and source documents through an intuitive web interface, no coding required. They mark which products were correctly retrieved and identify confusion points (like wired vs. wireless headphones). Expert annotations become training data for future improvements and help align your LLM judges with real-world quality standards.

The Review App

Step 4: Track Prompts, Code, and Configuration Changes

Armed with expert annotations, you rebuild your retrieval system. You switch from keyword matching to semantic search that understands technical specifications and update prompts to be more cautious about unconfirmed product features. But how do you track these changes and ensure they improve quality?
MLflow 3.0's Version Tracking captures your entire application as a snapshot, including application code, prompts, LLM parameters, retrieval logic, reranking algorithms, and more. Each version connects all traces and metrics generated during its use. When issues arise, you can trace any problematic response back to the exact version that produced it.

Version Tracking

Prompts require special attention: small wording changes can dramatically alter your application's behavior, making them difficult to test and prone to regressions. Fortunately, MLflow's brand new Prompt Registry brings engineering rigor specifically to prompt management. Version prompts with Git-style tracking, test different versions in production, and roll back instantly if needed. The UI shows visual diffs between versions, making it easy to see what changed and understand the performance impact. The MLflow Prompt Registry also integrates with DSPy optimizers to generate improved prompts automatically from your evaluation data.

With comprehensive version tracking in place, measure whether your changes actually improved quality:

The results confirm that your fixes work: retrieval relevance jumps from 65% to 91%, and response relevance improves to 93%.

Step 5: Deploy and Monitor in Production

With verified improvements in hand, it's time to deploy. MLflow 3.0 Deployment Jobs ensure that only validated applications satisfying your quality requirements reach production. Registering a new version of your application automatically triggers evaluation and presents results for approval, and full Unity Catalog integration provides governance and audit trails. This same model registration workflow supports traditional ML models, deep learning models, and GenAI applications.

After Deployment Jobs automatically run additional quality checks and stakeholders review the results, your improved chatbot passes all quality gates and gets approved for production. Now that you're going to serve thousands of customers, you instrument your application to collect end-user feedback:

dashboards

After deploying to production, your dashboards show that satisfaction rates are strong, as customers get accurate product recommendations thanks to your improvements. The combination of automated quality monitoring from your LLM judges and real-time user feedback gives you confidence that your application is delivering value. If any issues arise, you have the traces and feedback to quickly understand and address them.

Continuous Improvement Through Data

Production data is now your roadmap for improvement. This completes the continuous improvement cycle, from production insights to development improvements and back again. Export traces with negative feedback directly into evaluation datasets. Use Version Tracking to compare deployments and identify what's working. When new issues occur, you have a systematic process: collect problematic traces, get expert annotations, update your app, and deploy with confidence. Each issue becomes a permanent test case, preventing regressions and building a stronger application over time.

A Unified Platform That Scales with You

MLflow 3.0 brings all these AI capabilities together in a single platform. The same tracing infrastructure that captures every detail of your GenAI applications also provides visibility into traditional ML model serving. The same deployment workflows cover both deep learning models and LLM-powered applications. The same integration with the Unity Catalog provides battle-tested governance mechanisms for all types of AI assets. This unified approach reduces complexity while ensuring consistent management across all AI initiatives.

MLflow 3.0's enhancements benefit all AI workloads. The new LoggedModel abstraction for versioning GenAI applications also simplifies tracking of deep learning checkpoints across training iterations. Just as GenAI versions link to their traces and metrics, traditional ML models and deep learning checkpoints now maintain complete lineage connecting training runs, datasets, and evaluation metrics computed across environments. Deployment Jobs ensure high-quality machine learning deployments with automated quality gates for every type of model. These are just a few examples of the improvements that MLflow 3.0 brings to classic ML and deep learning models through its unified management of all types of AI assets.

As the foundation for MLOps and AI observability on Databricks, MLflow 3.0 seamlessly integrates with the entire Mosaic AI Platform. MLflow leverages Unity Catalog for centralized governance of models, GenAI applications, prompts, and datasets. You can even use Databricks AI/BI to build dashboards from your MLflow data, turning AI metrics into business insights.

Getting Started with MLflow 3.0

Whether you're just starting with GenAI or operating hundreds of models and agents at scale, Managed MLflow 3.0 on Databricks has the tools you need. Join the thousands of organizations already using MLflow and discover why it's become the standard for AI development.

Sign up for FREE Managed MLflow on Databricks to start using MLflow 3.0 in minutes. You'll get enterprise-grade reliability, security, and seamless integrations with the entire Databricks Lakehouse Platform.

For existing Databricks Managed MLflow users, upgrading to MLflow 3.0 gives you immediate access to powerful new capabilities. Your current experiments, models, and workflows continue working seamlessly while you gain production-grade tracing, LLM judges, online monitoring, and more for your generative AI applications, no migration required.

Next Steps

Read the documentation for comprehensive guides and tutorials
Try the quickstart to see Managed MLflow 3.0 in action
Join the community to connect with thousands of MLflow users

What's next?

November 20, 2024/4 min read

Introducing Predictive Optimization for Statistics

November 21, 2024/3 min read