Build High-Quality, Domain-Specific Agents at 95% Lower Cost

Introducing Token-Based Pricing for MLflow GenAI Evaluation

Introducing Token Based Pricing for Agent Evaluation

Published: October 15, 2025

by Avesh Singh, Euirim Choi, Samraj Moorjani and Yuki Watanabe

Summary

95% lower evaluation costs: New token-based pricing in MLflow reduces daily evaluation costs without sacrificing rigor.
Open-sourced prompts: Access production-tested evaluation prompts spanning finance, healthcare, technical documentation, safety, and more.
Flexible judge options: Use built-in optimized models or bring your own LLMs to meet compliance, cost, and domain-specific needs at scale.

High‑quality GenAI agents need to be evaluated continuously. But when you scale up testing, the costs can outpace your budget. With MLflow on Databricks, teams can test agents across many metrics without cost becoming a barrier.

New Token-Based Pricing Model for Predefined Judges

As agents move from prototype to production, success relies on understanding your domain (e.g., contracts, customer support, filings), not just general benchmarks. MLflow’s predefined judges help by evaluating correctness, faithfulness, relevance, safety, and retrieval automatically rather than relying on prompt engineering.

Customers asked us to take a look at how we can improve evaluation costs at production scale. So today, we’re launching token-based pricing for judges rather than paying for fixed blocks.

You’ll be charged $0.15 per million input tokens
And $0.60 per million output tokens
On average, costs drop about 95% with no loss in accuracy

Example for 10,000 traces

Before

$0.0175 per judge request
5,000 tokens per request
Result: 10,000 traces × 5 judges = $875/day

Now

$0.15 per 1M input tokens
$0.60 per 1M output tokens
Result: 10,000 traces × 5 judges = $45/day
- Input: 50,000 requests × 4,000 tokens × $0.15/1M = $30
- Output: 50,000 requests × 500 tokens × $0.60/1M = $15

The token-based approach allows both a dramatic reduction in costs and complete transparency into how they are computed.

Traces in MLflow can be automatically assessed by LLM judges, or by human annotators.

Open-Sourcing Battle-Tested Evaluation Prompts

Crafting effective evaluation prompts means balancing accuracy with token efficiency, particularly for domain-specific applications. Teams spend weeks fine-tuning themd for finance, healthcare, or technical documentation, with each group repeating work.

To help, we’re open-sourcing the evaluation prompts behind MLflow GenAI. They’ve been refined across industry-specific contexts like finance, healthcare, technical documentation, and safety to perform well in real-world scenarios. Use them as-is or adapt them for your specific use cases.

You can explore our production-grade prompts here.

These prompts have been validated on rigorous benchmarks including:

FinanceBench: Financial document question answering
HotPotQA: Multi-hop reasoning across documents
DocsQA: Technical documentation comprehension
RAGTruth: Retrieval-augmented generation accuracy
Natural Questions: Real Google search queries
HarmBench: LLM safety
Databricks customer datasets (with permission)

Beyond Built-in Judges: Bring Your Own Model

Our built‑in judges are powerful, but some organizations need full control. Now, you can plug in your own model (OpenAI, Anthropic, or your fine‑tuned model) for evaluation at no extra cost. You just pay for model usage.

This lets you:

Meet specific compliance requirements for model selection
Leverage existing enterprise agreements with LLM providers
Use specialized models trained on your data data
Control your entire evaluation pipeline

Production-Ready from Day One

Cost-effective evaluation means nothing if it can't scale with your production needs. MLflow GenAI evaluation on Databricks provides:

Unity Catalog integration: Govern traces and evaluation data with enterprise-grade security
Delta Lake storage: Store traces and evaluation data in Delta format, enabling you to build custom dashboards and data pipelines from trace and assessment data
Full MLflow integration: View traces and evaluation results directly in MLflow
Serverless compute: Pay only for what you use, with no infrastructure management

Getting Started Today

The new pricing and open-source prompts are available immediately for all Databricks customers. Here's how to get started:

For existing MLflow evaluation users: Your judges will automatically use the new pricing model—no action required
For new users: Start with our quickstart guide. You can also explore our latest courses to understand how to build AI Agents on Databricks.
1. AI Agent Fundamentals: A 90 minute, introductory course on the basics of AI agents with real-world examples of how they create value for your organization.
2. Get started with AI Agents: In just over two hours, go from theory to building and deploying your first agent on Databricks.
For MLflow OSS users: Update to MLflow 3.4.0+ to access the open-sourced prompts

A New Chapter for Evaluation GenAI applications

By cutting costs by 95% and open-sourcing production-tested prompts, we make evaluation accessible at scale. Whether in finance, healthcare, or CX, you can continuously monitor agent quality without breaking your budget.

Ready to transform your agent evaluation strategy? Get started for free or explore our documentation.

What's next?

November 20, 2024/4 min read

Introducing Predictive Optimization for Statistics

November 21, 2024/3 min read