Summary: LLMs have revolutionized software development by increasing the productivity of programmers. However, despite off-the-shelf LLMs being trained on a significant amount of code, they are not perfect. One key challenge for our Enterprise customers is the need to perform data intelligence, i.e., to adapt and reason using their own organization’s data. This includes being able to use organization-specific coding concepts, knowledge, and preferences. At the same time, we want to keep latency and cost low. In this blog, we demonstrate how fine-tuning a small open-source LLM on interaction data enables state-of-the-art accuracy, low cost, and minimal latency.

Figure 1: Quick Fix helps users resolve errors by suggesting code fixes in-line.
TL;DR of Result: We focus on the task of program repair which requires fixing bugs in code. This problem has been widely studied in the literature without LLMs [1, 2] and more recently with LLMs [3, 4]. In industry, practical LLM agents such as the Databricks Quick Fix are available. Figure 1 shows the Quick Fix agent in action in a Databricks Notebook environment. In this project, we fine-tuned the Llama 3.1 8b Instruct model on internal code written by Databricks employees for analyzing telemetry. The fine-tuned Llama model is evaluated against other LLMs via a live A/B test on internal users. We present results in Figure 2 showing that the fine-tuned Llama achieves 1.4x improvement in acceptance rate over GPT-4o while achieving a 2x reduction in inference latency.


Figure 2: Shows fraction of proposed LLM fixes that were accepted by users (above) and inference speed of each Quick Fix LLM agent (below). Both numbers are normalized with respect to the GPT-4o agent (see details below). Our model (QuickFix Llama 8b Diff) achieves both the highest accuracy and lowest latency. Models with the suffix diff generate edits to the buggy code, while those with the suffix full generate the full code.
Why does it matter? Many organizations, including many existing Databricks customers, have coding usage data that contains inhouse knowledge, concepts, and preferences. Based on our results, these organizations can fine-tune small open-source LLMs that achieve better code quality and inference speed. These models can then be hosted by the organization or a trusted third party for cost, reliability, and compliance wins.
We emphasize that training on interaction data is particularly effective for three reasons. Firstly, it is naturally generated – so requires no annotation effort. Secondly, it contains examples that are encountered in practice and so it is particularly useful for fine-tuning even in moderate quantities. Finally, as interaction data is constantly generated by interactions with the LLM agent, we can repeatedly use newly generated interaction data to further fine-tune our LLM leading to Never Ending Learning (NEL).
What’s next? We believe that these lessons are also true for other enterprise applications. Organizations can fine-tune LLMs such as Llama for program repair or other tasks using Databricks’ fine-tuning service and serve the model in just one click. You can get started here. We are also exploring offering customers the ability to personalize Quick Fix using their own data.
Details of Our Study
A Databricks Workspace provides multiple LLM agents for enhancing productivity. These include an LLM agent for code autocomplete, an AI assistant which can engage in conversations to help users, and the Quick Fix agent for program repair. In this blogpost, we focus on the Quick Fix
