Session

PDF Document Ingestion Accelerator for GenAI Applications

Register or Login

Overview

Wednesday

June 11

1:50 pm

ExperienceIn Person
TypeBreakout
TrackData Engineering and Streaming
IndustryFinancial Services
TechnologiesApache Spark, Databricks Workflows
Skill LevelIntermediate
Duration40 min

Databricks Financial Service customers in the GenAI space have a common use case of ingestion and processing of unstructured documents — PDF/images — then performing downstream GenAI tasks such as entity extraction and RAG based knowledge Q&A.

 

The pain points for the customers for these types of use cases are:

  • The quality of the PDF/image documents varies since many older physical documents were scanned into electronic form
  • The complexity of the PDF/image documents varies and many contain tables — images with embedding information — which require slower Tesseract OCR
  • They would like to streamline postprocess for downstream workloads

In this talk we will present an optimized structured streaming workflow for complex PDF ingestion. The key techniques include Apache Spark™ optimization, multi-threading, PDF object extraction, skew handling and auto retry logics

Session Speakers

Qian Yu

/Specialist Solution Architect
Databricks