CUSTOMER
STORY

Scaling unstructured data intelligence with AI agents

20X

Increase in company coverage (3K → 60K in one quarter)

Records process within an hour

92–95%

Tagging accuracy out of the box

YipitData delivers high-quality, data-driven insights to institutional investors and enterprises by turning massive volumes of unstructured data into actionable intelligence. As data volumes and expectations grew, the company needed to scale complex tagging and enrichment workflows beyond manual, regex-based systems that couldn’t keep pace with millions of companies and messy text inputs. By adopting Databricks Agent Bricks on the Databricks Data Intelligence Platform, YipitData embedded AI agents directly into its production pipelines using SQL-accessible batch inference, dramatically expanding the number of companies it can analyze, accelerating processing speed, and powering new AI-driven products — all without moving data outside its governed data platform.

Scaling judgment-driven workflows beyond manual limits

YipitData processes millions of transaction records daily from credit card data, web-scraped receipts, and other alternative data sources. To deliver timely insights to customers, the company must rapidly tag each transaction to the correct merchant or company, matching messy and ambiguous vendor records to real businesses at scale.

YipitData’s core challenge was doing this reliably for a rapidly expanding universe of companies and merchants using highly unstructured, text-heavy inputs.

Historically, analysts wrote and maintained hundreds of regular expression (regex) rules and some traditional NLP models to assign entities. This process could be accurate on known patterns but was fundamentally constrained by human throughput and became increasingly brittle as edge cases accumulated.

“We were bottlenecked by what an analyst could do in a day,” said Edward Goo, Head of Data Engineering at YipitData. “Even with hundreds of regex rules, you still miss nuance, context, and the judgment that analysts apply instinctively.”

With thousands to millions of potential companies in scope and more than 20 heterogeneous data providers, manual and regex-based approaches struggled to keep up. Analysts effectively attempted to encode their judgment into over 500 regex statements, yet still missed nuances and context in free-form descriptions.

“The real challenge was judgment,” explained Chief Architect Anup Segu. “That reasoning lived in analysts’ heads, and we had no good way to operationalize it at the scale our products required.” YipitData needed a way to capture analyst reasoning, apply it consistently across massive unstructured datasets, and keep pace with daily data refreshes.

Embedding AI agents directly into Databricks pipelines

To overcome these limitations, YipitData turned to Databricks Agent Bricks, primarily utilizing Information Extraction to integrate reasoning-based tagging directly into its Databricks pipelines.

Instead of manually encoding logic, the team utilizes AI agents to interpret noisy text fields, extract key signals, and assign companies with a contextual understanding that more closely mirrors human judgment.

Crucially, Agent Bricks provides SQL interfaces to access agentic reasoning, enabling YipitData to invoke batch inference directly within existing ETL workflows. This enabled the team to easily augment production pipelines with generative AI using familiar tools, rather than standing up separate systems or rewriting jobs.

By contrast, many batch inference offerings require data to be exported from governed data platforms and processed in external services, introducing operational overhead and governance risk. For YipitData, keeping inference tightly coupled with its data platform was essential.

Because all core data already resides on Databricks and is governed by Unity Catalog, agents connect directly to enterprise data with consistent permissions, lineage, and governance. Batch inference jobs inherit the same governance guarantees as the underlying data and pipelines, ensuring AI enrichment remains compliant by default. For large-scale entity resolution and candidate narrowing, the team pairs Agent Bricks with Lakebase, Databricks’ serverless, fully-managed PostgreSQL database.

Lakebase serves as the system of record for known companies, entities, and their relationships. At the same time, retrieval logic first uses Lakebase to identify a constrained set of likely candidates based on structured relationships and metadata. From there, Vector Search and Information Extraction agents evaluate unstructured text and make the final tagging decision.

“Lakebase gives us the relational backbone we always wanted alongside Databricks,” said Anup. “We can write to it directly from workflows, sync tables with our curated data layers, and use it alongside agents without managing a separate database layer.”

By keeping structured entity data, retrieval, and agent reasoning within a single platform, YipitData reduced operational complexity while improving performance and scalability. The team can continuously enrich entity data, feed it back into Lakebase, and reuse it across downstream pipelines and products, including Signals, YipitData’s market intelligence product for analyzing companies, competitors, and growth trends across private markets.

After benchmarking multiple models, YipitData found that Agent Bricks Information Extraction provided the most optimal solution for their demanding scale, achieving high-quality outcomes on their data while delivering better cost and throughput for production workloads.

Accelerating product delivery and expanding data coverage

With Agent Bricks running in production, YipitData transformed the scale and speed of its enrichment workflows. In a single quarter, the company expanded the number of companies it automatically tags and enriches from roughly 3,000 to more than 60,000 — a 20X increase that would not have been feasible with manual or regex-based approaches.

What once took 24 hours to process 1 million records with Agent Bricks now takes about 1 hour, a dramatic improvement in processing speed that is essential when handling multi‑million‑record influxes each day. YipitData can now run agent‑powered pipelines in record time, even as volumes scale toward tens of millions of records per day.

Crucially, this speed did not come at the expense of quality. In one core use case, YipitData achieved 92–95% tagging accuracy out of the box using Agent Bricks. This quality level meets the bar required for downstream products and customer-facing insights while still leaving room for further improvement via labeling and prompt tuning. “Our analysts have a very high quality bar, and Agent Bricks met it faster than we expected,” said Edward. “It gave us confidence we could move quickly without sacrificing accuracy.”

These capabilities directly enabled the launch and rapid expansion of Signals, YipitData’s external product that helps customers understand companies, competitors, and growth trends across largely private markets. Agent Bricks plays a foundational role in enriching and maintaining the company universe behind Signals, supporting plans to scale coverage from thousands to hundreds of thousands of companies and beyond. “Without Agent Bricks, Signals wouldn’t exist at its current scale,” Edward noted. “We went from struggling to keep 3,000 companies up to date to confidently aiming for hundreds of thousands — and eventually more.”

For analysts, the impact has been equally transformative. Instead of spending time writing and maintaining brittle regex rules, teams now focus on validating outputs, improving quality, and exploring new datasets. The same agent-driven framework can be reused across additional data sources, turning AI enrichment into a repeatable capability rather than a one-off project and opening the door to future GenAI-powered experiences, such as richer search and chat-based exploration, built on top of YipitData’s intelligence.

“This wasn’t just an efficiency gain,” shares Anup. “It changed what products we could realistically build. Agent Bricks gave us the confidence to scale our data coverage and move faster on entirely new offerings.”

Share this post

Details

Industry: Financial Services
Use Case: Artificial Intelligence, Governance and Security, Data Intelligence Platform
Cloud: AWS
Product: Agent Bricks / Mosaic AI, Lakebase, Unity Catalog

Ready to get started?

Try Databricks for free Talk to an expert