CUSTOMER
STORY

First American Data & Analytics embraces the power of GenAI

70%

Reduction in cost

75%

Less time to complete the project

10x

Increase in data elements extracted

Product descriptions:

Delta Lake Mosaic AI Unity Catalog

First American Data & Analytics (DNA), a leading property-centric information provider to the real estate and title insurance industries, was challenged to convert a vast amount of unstructured data into digital content and match the data to related properties. To do so, they sought to implement page-level classification and strategic extraction across 100+ data fields. An early adopter of NLP and vision models, the First American Data & Analytics team wanted to leverage large language models (LLMs) to improve the accuracy of data extraction. By working with Databricks, they were able to experiment with multiple LLMs and prompting approaches. Key DNA team members brought their title policy expertise to the project and First American was able to quickly fine-tune an LLM with Databricks Mosaic AI, improving model accuracy and reducing training time.

Facing process inefficiencies in linking and ingesting historical title policies

First American DNA wanted to leverage LLMs to enhance the accuracy and efficiency of the data extraction efforts that are central to their parent company’s broader title and transaction closing processes. This division maintains the industry’s largest data repository of publicly recorded document images, with over 8 billion unstructured documents. Most of these historical documents are stored with very limited (if any) metadata. For decades, data capture required a complex, manual data entry process of double-blind keying to ensure accuracy. More recently, First American has been an industry leader in pioneering optical character recognition (OCR) and data extraction methodologies into this process, including a variety of patented technologies. The DNA team has spent years enhancing data extraction to build and maintain the industry’s largest and most comprehensive public record data assets. As part of the company’s commitment to innovation and continual improvement, the DNA team sought to deploy new and additional technologies to extract information from images of documents like deeds, mortgages, assignments and foreclosures that vary widely in quality.

The title insurance process requires extensive research using public record data. One way to speed up this process is to start with a previously issued title policy, known as a starter. A starter, which is not recorded, allows a researcher to begin the examination of a property from the time of the last policy. DNA has built a Starter Xchange that provides access to exchange contributors across the country. Increasing the number of starters in the exchange benefits contributors and attracts new contributors. First American DNA recently received a set of ~4 million historical title policy images going back to the 1930s that needed to be identified at a page level, with over 100 data elements to be extracted, linked to a specific property and ingested into the exchange. The DNA team looked to the success of the public record data manufacturing process to develop a new process to extract necessary data from the starters.

Most of these historical documents are stored as images with very limited (if any) metadata. The First American DNA team sought ways to more rapidly extract information from these documents to fuel analysis from a risk identification and decisioning perspective. This data is used to help accelerate real estate transactions, benefiting buyers, sellers, lenders and real estate professionals. Given the condition of the documents and source material for this project, capturing relevant data can sometimes require a very complex and expensive manual process to deliver the required levels of quality.

Prabhu Narsina, VP of Data and AI at First American Data & Analytics, explained, “Even though we already use many language models in our production pipeline, earlier models required huge training datasets and time for fine-tuning for our complex use cases. We were looking for a process that required less training data, reduced the time to train and delivered similar extraction results.”

Using LLMs for efficient data extraction and processing

The First American DNA team, already an early adopter of large language models, recognized the value of data science applications and embraced the possibilities of generative AI to streamline their data extraction processes. However, initial attempts to implement LLMs from leading commercial providers did not meet the desired accuracy. The team experimented with a few different approaches, but the limited availability of GPUs needed to process large volumes of data posed a substantial hurdle.

Already a Databricks customer, the First American DNA team turned to the Databricks Data Intelligence Platform. A key component of the solution was Delta Lake, which facilitated efficient data storage and retrieval. By moving their data into Delta Lake, First American was able to maintain data consistency and quality while maintaining data security standards — all within a cost-effective and scalable platform that could support their growing data processing needs. Another benefit of the platform was the usefulness of Databricks Notebooks for team collaboration. Prabhu added, “It’s one of the features I really like and it’s very easy to integrate with our DevOps processes.”

Building on the foundation provided by Delta Lake, the DNA team utilized Databricks Model Training to fine-tune several open source models, including Mistral, Llama 3 and Llama 3.1, and is now testing Llama 3.2 with their enterprise data. This fine-tuning approach significantly enhanced the accuracy of data extraction while reducing costs. Prabhu highlighted, “Databricks helped us build the entire end-to-end production pipeline, starting by streaming data from ADLS to the final writing to the destination.” Working together, the Databricks and DNA teams mapped out the timing for testing new features and incorporating them into development and production. Using Managed MLflow, they were able to streamline the LLM lifecycle, improving the efficiency of the handling of potential errors, while running training workloads on approximately 50 GPUs.

Prabhu elaborated, “First American Data & Analytics is continually seeking out new technologies for the betterment of our customers; the Databricks team helped us get early access to Mosaic AI Model Training for fine-tuning and literally had meetings almost every day with us, writing and debugging the code together. We were able to do this in a couple of weeks, testing out a couple of open source models and fine-tuning them, so they were more tailored to our needs than big commercial models. It was also more affordable than running on those big models.” In fact, the team was able to achieve a cost savings of 70% compared with a previous implementation using a different model.

Additionally, Databricks’ serverless compute and Model Serving eliminated previous bottlenecks associated with GPU availability and scalability. This serverless approach enabled the DNA team to deploy LLMs at scale without the need for extensive hardware management, and Model Serving significantly reduced the time and cost associated with deployment. Unity Catalog played a crucial role in maintaining stringent data security and governance standards. By registering the fine-tuned models in Unity Catalog, First American maintained full control and compliance over their data assets. This holistic integration of Delta Lake, Unity Catalog and Databricks Mosaic AI tools into an AI agent system created a robust and scalable solution that met the division’s data and AI needs. It also enabled rapid coordination and course adjustments across multiple supporting DNA teams and allowed title business leaders to direct the overall quality of ML output.

Scaling data and AI to improve the home-buying process

The Databricks Data Intelligence Platform has significantly impacted the DNA team’s data extraction efforts. Accelerating data extraction has improved the efficiency and cost savings while increasing the number of data elements extracted by 10x. With Databricks, DNA reduced the time needed to complete data extraction projects by 75%, going from several months to just weeks.

By moving to Databricks serverless compute for their model serving, DNA eliminated the need for extensive hardware management, reducing operational expenses. This approach not only lowered the cost associated with GPU usage but also provided scalability to handle large volumes of data. The fine-tuning of models — which previously took over two months on other platforms — was achieved in just two weeks with Databricks. This acceleration in project timelines allowed First American DNA to handle data more efficiently and meet their high accuracy benchmarks ahead of schedule. Best of all, the team has significantly optimized their data processing capabilities and successfully utilized and monetized a challenging dataset for their market. The team looks forward to leveraging continuous improvements in LLMs on the Databricks Platform for future projects.