High‑quality GenAI agents need to be evaluated continuously. But when you scale up testing, the costs can outpace your budget. With MLflow on Databricks, teams can test agents across many metrics without cost becoming a barrier.
As agents move from prototype to production, success relies on understanding your domain (e.g., contracts, customer support, filings), not just general benchmarks. MLflow’s predefined judges help by evaluating correctness, faithfulness, relevance, safety, and retrieval automatically rather than relying on prompt engineering.
Customers asked us to take a look at how we can improve evaluation costs at production scale. So today, we’re launching token-based pricing for judges rather than paying for fixed blocks.
Example for 10,000 traces
Before
Now
The token-based approach allows both a dramatic reduction in costs and complete transparency into how they are computed.
Crafting effective evaluation prompts means balancing accuracy with token efficiency, particularly for domain-specific applications. Teams spend weeks fine-tuning themd for finance, healthcare, or technical documentation, with each group repeating work.
To help, we’re open-sourcing the evaluation prompts behind MLflow GenAI. They’ve been refined across industry-specific contexts like finance, healthcare, technical documentation, and safety to perform well in real-world scenarios. Use them as-is or adapt them for your specific use cases.
You can explore our production-grade prompts here.
These prompts have been validated on rigorous benchmarks including:
Our built‑in judges are powerful, but some organizations need full control. Now, you can plug in your own model (OpenAI, Anthropic, or your fine‑tuned model) for evaluation at no extra cost. You just pay for model usage.
This lets you:
Cost-effective evaluation means nothing if it can't scale with your production needs. MLflow GenAI evaluation on Databricks provides:
The new pricing and open-source prompts are available immediately for all Databricks customers. Here's how to get started:
By cutting costs by 95% and open-sourcing production-tested prompts, we make evaluation accessible at scale. Whether in finance, healthcare, or CX, you can continuously monitor agent quality without breaking your budget.
Ready to transform your agent evaluation strategy? Get started for free or explore our documentation.