Cracking Complex Documents with Databricks Mosaic AI

Databricks · Intermediate ·🧠 Large Language Models ·12mo ago

Skills: LLM Foundations90%Prompt Craft80%Fine-tuning LLMs70%Multimodal LLMs60%Prompting Basics50%

Key Takeaways

The video demonstrates how to use Databricks Mosaic AI for contract analysis, leveraging generative AI and agentic patterns to transform complex legal and regulatory content into structured data, achieving significant improvements in extraction speed, cost, and accuracy. The solution utilizes various tools, including Azure framework, SharePoint, and vector databases, to develop a scalable pipeline for document processing and information extraction.

Full Transcript

Right. Good afternoon everyone. I hope you can hear me. It's quite a unique system. I'm sure you can hear me in the front. All right. My name is Gavita Regenoth and I'm here to talk about tracking complex documents using data bricks. This has been a huge collaboration with team Quidalo advancing analytics. Just wanted to be a quick quick reminder. Please complete your surveys. There's a bunch of instructions there in terms of how we do it. Right. So, I've got a question for the entire audience. Right. It's going to be interactive. So, please shout out. So, I want to know what you guys think. How long would it take an expert who understands contract analysis to pull about 100 variables from a five-page contract? How many hours? How many days do you think that's going to take? You can't hear. Three hours. Three days. You can't hear anything. They can't hear. I want to speak to him. But anyway, I'll just give you a bit of context. Okay. Okay, I'll get a bit bit of context. The way the way Quidel Auto does it is they have a bunch of contracts in a bunch of different formats and what they have to do is depending on the language variables are extracted to tables. All right? So, it's up to 100 variables here and it can take up to two hours for five page contract. So, really simple contracts, it can take up to two hours. Now longer the contract is going to take far longer time. So in the next 20 minutes I'm going to tell you what we did which we achieve 96% faster extraction 99% cheaper and 90% accurate. And we did this all within data bricks and the Azure framework. and we use all the nice cool features in Mosaic AI using generative AI and we did a whole bunch of things and I'm going to show you what we did. Now I'm just here representing the entire team. It's been a huge collaboration between advancing analytics and team Kamalo. Exactly. So yep. So it's Yeah, it's been a massive massive collaboration between all of us and I'm just going to show you what we achieved. This is the end product and what we have is a contract analysis system where it givesmemes the ability to talk to any contract and it's extremely cool right so kudos to quid author for having the vision and being at the edge bleeding edge of technology and what you can do is ask it any questions and not only that right we've built in memory with this as well so you can ask any follow-up questions here as well and it's extremely cool what we can do as well is upload certain documentation. So if you have a document that you want to analyze, this system will allow you to upload it into certain directories and what you can do then as well is process and upload all your documentation. So this is quite a cool feature that what you can so yeah here he's just showing you right in share point you drag and drop your files your files will be processed and what what what we have done for author as well it's a document queuing system because we we wanted to scale right so this can be used in the entire world so this can be used up to 100memes and what it gives them the ability is to understand all the documents that's been queued up and you can assess when it was last uploaded, what was the status, whether it failed or succeeded, and you can also look at the source. So, we only have SharePoint at this point in time, but you can have any kind of blob storage source as well. So, this is just kind of showing you what we've done. And I think the really really cool thing about this is besides the search functionality is the contract analysis. Now, this is this is the the the end product really, right? So once you extract all the key variables, it givesmemes a single pane of your uh UI to filter down on certain variables that are important. So for example, if you want to understand what the type of price increase, you can filter it down on that and you get the entire list of document based on what you're filtering down on. You can also click on documents. For some of these documents, we didn't have access, but that's because we're advancing analytics. If you're qual to all the documents and what you can also do then is if you wanted to analyze a particular document you can download it and have a look and analyze it using Excel or PDF. This is really cool. You've got an extraction and insights in here. So every variable that we extract, we give a reason as to why we extracted that particular variable. And this bit here is is one of our a no novelty really. So stay to the end and I'll show you what we did. So this is the end product and this is the the cool architecture in terms of how we implemented this in in data bricks under Azure. Now it might look like a web of mass but really is quite simple. We've broken it down to four steps. Your first steps are all about content and then how we extract the content, break it down into chunks, put it into your Unity catalog, into your vector stores, how we extract important information. And the very last bit is how we then serve it back to all your SMMES, how we report all your variables. So I'm going to go through step by step. It's a fairly complicated architecture, but bear with me. I've broken it down to four steps. Your first step really is all about uploading your documents and how we're constantly queuing or polling the Q system here. So all this is done within data bricks and every two minutes we're pinging the Q system and saying hey if there's any new documents that's been uploaded let's extract it let's analyze it. So you've seen this already in terms of how we upload your documents. So here is just showing you again in terms of how you can upload your documents. You can upload many many documents at one time or you can just upload your one one document at a point and once you do this right it's all about understanding information about the document that's been uploaded and this has been very very important formemes because they want to understand certain things right what are the file sources where has it come from who's uploaded the document and what the priority levels are so this is all backed by unity catalog which makes makes it very very easy for us to understand all these variables and to store it as well. Now what we do is in terms of uploading and getting tax from documents is we use SharePoint connection. Now unfortunately we've not used the SharePoint connector within data bricks but what we've done is use the SharePoint connection using M365 and we we basically take all your documents we extract your contents as bytes we put it down as markdown and we do some clever bits to make sure that all your documents are structured here right so I'm going to show you what we do after we extract or we upload your document is your step two which is probably my most interesting uh step here. It's all about pre-processing. It's all about understanding how your documents can be chunked up, which basically essentially means how you can make it how you can make the computer vector database understand the different chunks. And for this what we did was we added some metadata with pre-processing. We used recursive chunking. We tried a lot of different chunking methodologies, but we found recursive chunking to be the most um optimized here. And also we did a hell of a lot of processing in terms of understanding what's the right chunk size and how we overlap. And we spent a lot of time doing this to get this right. But yeah, we settled on a chunk size of48 with a 20% overlap. So the next step then is all about embedding, right? So what what do we do? So here if anybody can speak German, you can verify it. any languages we can translate it using AI translate and then what happens is we use AI query to kind of translate everything and embed it into vector databases within data bricks right so now we're going on to step three which is all about the extraction right so the extraction is about once you break down your document once you chunk it once you store it it's about how you pull out the right information based on the question. So here for example we're using similarity search and what we do is on it's not very pretty but you can see how we are using data bricks to ask the sell question. So here we're asking about the annual price adjustment information and here is where you're getting the answers. So this kind of shows no matter what languages we're able to translate any documents. We're able to embed documents into vector databases and ask very pertinent questions as well. Right? This is my favorite bit. I think I've said that twice now, but honestly this is my favorite bit. And this goes back to the saying of individually we are one drop, but together we are an ocean. And I'll explain this a bit more. This is based on a paper very very recent paper. It's based on probabilistic consensus that boasts potential precision and it shows that it can increase an accuracy of 20%. I'm going to show it in terms of a picture here. So this is what we've done right this is all about validating the information that we extract which is key. So how do you know that the LLM has extracted the right information? Would you trust just a single LLM? The cred author wasn't too happy about that. So we came up with this noble method where we use a model ensemble technique here. So we use three LLMs and you can use many different models. You can swap in and out different LLM models and you have to get a consensus of a minimum of two. So you got to get majority agreement here. So if you get one out of the three, then we route it to a human intervention where they step in and they validate the variable that's been extracted. If you get two out of three, that's all good news and it goes straight through processing. So the variable gets extracted and it gets stored down to a table. So this is quite novel. It's all based on agentic framework all within data bricks as well and it's worked really really well. Right. The last part it's about reporting. So you saw the bit in the start where we had the chat analysis where you can go to it. You can upload your contracts. You can understand why we've pulled out certain variables. You can also drill down on certain key variables. For example, if contracts are coming to an end, you can understand that. And that's all done via the reporting dashboard for qual. Right. So why why is this so important? right what what's the business impact and we did some rough calculations been validated by qual author as well so I'm not standing here and saying saying nonsense here really but based on just one country processing 16,000 documents you know it will cost humans about a million pounds million dollars not pounds using the system that we've built for quidall author it's about $4,800 so the cost savings here for one country is absolutely massive, right? But if you look past the cost savings, there's also faster reviews. This is allowing all allmemes to close deals far quicker to understand what the renewals are coming up and also it allows them to scale across different regions. So it's it's been extremely impactful. It's such a simple problem to solve and no matter what industry you're in, you'll have manual documentations to extract or variables to extract. But we this is what it means for good author in terms of what's next. The b the way we've designed the entire system. We've got the chatbot interface where we can leverage you know chatier contracts. The next bit of stuff that we are working with qual is about implementing aentic framework to write emails automatically to think automatically. So if they understand there's a contract that's coming up for renewal, you want to be notifying your customers accordingly. And the agentic framework that we've built within data bricks would allow them to scale through all of this. Whether it's writing emails or agentic contract management, it's all within reach. Right. So just just a quick glance back on the ROI. What we've achieved for Kaloo. Instead of a human operator taking 120 minutes to process simple documents, we've cut that down to 5 minutes using our AI solution. We've cut down processing costs that would cost Quidal Auto 7 $73 per contract right down to 30 cents per contract. done all with mosaic AI and the retrieval accuracy as well. So with a novel method that we implemented we can achieve up to 90% and that has been validated by humans as well. So I'm not standing up here and pulling out a figure or a thin air. This has all been validated by humans, right? So in terms of data bricks, how it's helped us, there's loads of features in terms of, you know, why use data bricks, but probably my most favorite thing is the scalability and diversity, how we use mosaic endpoint to serve many different LLM models, which gives you the flexibility whether you wanted to use llama based model or to hook on to open AI, it gives you the flexibility. datab bricks asset bundle has been very very useful for us as well to make sure all the environments are in sync and then of course you've got unity catalog. So in terms of data bricks it's really helped us to do tracking monitoring scalability and making sure everything is in sync. I think that's it. If you've got any questions I'm more than happy to take questions but we are in booth f633 you can come and talk to us or you can come and talk to team pal author as well who are down here. I'm more than happy to take questions.

Original Description

In this session, we will share how we are transforming the way organizations process unstructured and non-standard documents using Mosaic AI and agentic patterns within the Databricks ecosystem. We have developed a scalable pipeline that turns complex legal and regulatory content into structured, tabular data. We will walk through the full architecture, which includes Unity Catalog for secure and governed data access, Databricks Vector Search for intelligent indexing and retrieval and Databricks Apps to deliver clear insights to business users. The solution supports multiple languages and formats, making it suitable for teams working across different regions. We will also discuss some of the key technical challenges we addressed, including handling parsing inconsistencies, grounding model responses and ensuring traceability across the entire process. If you are exploring how to apply GenAI and large language models, this session is for you. Talk By: Gavi Regunath, Chief AI Officer, Advancing Analytics Databricks Named a Leader in the 2025 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms: https://www.databricks.com/blog/databricks-named-leader-2025-gartner-magic-quadrant-data-science-and-machine-learning Build and deploy quality AI agent systems: https://www.databricks.com/product/artificial-intelligence See all the product announcements from Data + AI Summit: https://www.databricks.com/events/dataaisummit-2025-announcements Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Databricks · Databricks · 0 of 60

← Previous Next →

Building AI Agent Systems with Databricks

Building AI Agent Systems with Databricks

Databricks Workflows

Databricks Workflows

Automate Unity Catalog Upgrade with UCX Part 1: Overview

Automate Unity Catalog Upgrade with UCX Part 1: Overview

Automate Unity Catalog Upgrade with UCX Part 2: Installation

Automate Unity Catalog Upgrade with UCX Part 2: Installation

Automate Unity Catalog Upgrade with UCX Part 3 - Assessment

Automate Unity Catalog Upgrade with UCX Part 3 - Assessment

Automate Unity Catalog Upgrade with UCX Part 4 - Group Migration

Automate Unity Catalog Upgrade with UCX Part 4 - Group Migration

Table Migration and Catalog Design with UCX | Part 5

Table Migration and Catalog Design with UCX | Part 5

Setting Up Azure Access for UCX Table Migration | Part 6

Setting Up Azure Access for UCX Table Migration | Part 6

UCX Table Migration: Creating Catalogs and Schemas | Part 7

UCX Table Migration: Creating Catalogs and Schemas | Part 7

Automate Unity Catalog Upgrade with UCX Part 8: Code Migration

Automate Unity Catalog Upgrade with UCX Part 8: Code Migration

Streaming to Kafka Just Got Easier with DLT Pipelines

Streaming to Kafka Just Got Easier with DLT Pipelines

Data Engineering From Data to Dashboards with DABs: Crunching the Cookies Dataset

Data Engineering From Data to Dashboards with DABs: Crunching the Cookies Dataset

Epsilon helps businesses connect with their consumers using Databricks Data Intelligence Platform

Epsilon helps businesses connect with their consumers using Databricks Data Intelligence Platform

Unilever transforms operations with GenAI using the Databricks Data Intelligence Platform

Unilever transforms operations with GenAI using the Databricks Data Intelligence Platform

ActionIQ enables businesses to unlock customer data with the Databricks Data Intelligence Platform

ActionIQ enables businesses to unlock customer data with the Databricks Data Intelligence Platform

Mixed Attention & LLM Context | Data Brew | Episode 35

Mixed Attention & LLM Context | Data Brew | Episode 35

Inside Databricks SQL: Engineering innovation with Hans

Inside Databricks SQL: Engineering innovation with Hans

Inside Databricks: Engineering innovation with Michael Armbrust

Inside Databricks: Engineering innovation with Michael Armbrust

The Money Team at Databricks: driving revenue and customer growth

The Money Team at Databricks: driving revenue and customer growth

Unity Catalog unveiled: engineering data governance at scale

Unity Catalog unveiled: engineering data governance at scale

Create a view in Databricks and share it with Power BI using Delta Sharing

Create a view in Databricks and share it with Power BI using Delta Sharing

NDUS leverages Databricks Data Intelligence Platform to revolutionize higher education management

NDUS leverages Databricks Data Intelligence Platform to revolutionize higher education management

Démo Databricks de AI/BI

Démo Databricks de AI/BI

EMEA Data + AI World Tour 2024

EMEA Data + AI World Tour 2024

GenAI: The Shift to Data Intelligence - Customer Panel on Industry Use Cases

GenAI: The Shift to Data Intelligence - Customer Panel on Industry Use Cases

GenAI: The Shift to Data Intelligence - Ft. Ash Jhaveri, VP of Reality Labs Partnerships at Meta

GenAI: The Shift to Data Intelligence - Ft. Ash Jhaveri, VP of Reality Labs Partnerships at Meta

Virtue Foundation leverages the Databricks Data Intelligence Platform to advance global health

Virtue Foundation leverages the Databricks Data Intelligence Platform to advance global health

Announcing Synthetic Data Generation in Mosaic AI Agent Evaluation

Announcing Synthetic Data Generation in Mosaic AI Agent Evaluation

AI/BI Dashboards Embedding - A tutorial

AI/BI Dashboards Embedding - A tutorial

Bayer transforms global data management with the Databricks Data Intelligence Platform

Bayer transforms global data management with the Databricks Data Intelligence Platform

Databricks at AWS re:Invent 2024

Databricks at AWS re:Invent 2024

Hive Metastore and AWS Glue Federation in Unity Catalog

Hive Metastore and AWS Glue Federation in Unity Catalog

Data + AI World Tour Paris 2024

Data + AI World Tour Paris 2024

Retail reimagined: Currys data-first strategy to driving growth and improving operations

Retail reimagined: Currys data-first strategy to driving growth and improving operations

Mixture of Memory Experts (MoME) | Data Brew | Episode 36

Mixture of Memory Experts (MoME) | Data Brew | Episode 36

Verana Health Data Curation and Innovation with Databricks and AWS

Verana Health Data Curation and Innovation with Databricks and AWS

Securing SaaS Applications: Obsidian Security on Their Journey with Databricks and AWS

Securing SaaS Applications: Obsidian Security on Their Journey with Databricks and AWS

Twilio Eng VP on Data Intelligence & AI at AWS re:Invent 2024

Twilio Eng VP on Data Intelligence & AI at AWS re:Invent 2024

Chegg Eng SVP on Data-Driven Approach to Student Success with Databricks and AWS

Chegg Eng SVP on Data-Driven Approach to Student Success with Databricks and AWS

Ibotta Personalized Rewards Innovation with Databricks and AWS

Ibotta Personalized Rewards Innovation with Databricks and AWS

Simplify AI governance with #databricks AI Gateway

Simplify AI governance with #databricks AI Gateway

Databricks SQL and Power BI Integration

Databricks SQL and Power BI Integration

Databricks Serverless SQL Warehouses

Databricks Serverless SQL Warehouses

7 West powers audience growth with the Databricks Data Intelligence Platform

7 West powers audience growth with the Databricks Data Intelligence Platform

Secret to Production AI: Tools & Infrastructure | Data Brew | Episode 37

Secret to Production AI: Tools & Infrastructure | Data Brew | Episode 37

Skyflow CEO on Data Privacy with Databricks at AWS re:Invent

Skyflow CEO on Data Privacy with Databricks at AWS re:Invent

Databricks Clean Rooms Product Demo

Databricks Clean Rooms Product Demo

Dun & Bradstreet Enrichment & Monitoring, powered by Delta Sharing & Databricks Marketplace

Dun & Bradstreet Enrichment & Monitoring, powered by Delta Sharing & Databricks Marketplace

Unpacking Libraries in Databricks

Unpacking Libraries in Databricks

Providence uses an AI agent system from Databricks to help doctors improve their communication

Providence uses an AI agent system from Databricks to help doctors improve their communication

How State Street Uses AI to Transform Millions of Trades Daily

How State Street Uses AI to Transform Millions of Trades Daily

Vevo Therapeutics CEO on Curing Disease with Data at AWS re:Invent

Vevo Therapeutics CEO on Curing Disease with Data at AWS re:Invent

Over Architected with Nick & Holly: Databricks updates for Feb 2025

Over Architected with Nick & Holly: Databricks updates for Feb 2025

The Power of Synthetic Data | Data Brew | Episode 38

The Power of Synthetic Data | Data Brew | Episode 38

Use Databricks Lakehouse Federation to break down data silos

Use Databricks Lakehouse Federation to break down data silos

AI's rugby score: National Rugby League rallies fans with analytics and unified data

AI's rugby score: National Rugby League rallies fans with analytics and unified data

Open Variant Data Type in Delta Lake and Apache Spark

Open Variant Data Type in Delta Lake and Apache Spark

How would you sort Ætheldred in the alphabet using Databricks?

How would you sort Ætheldred in the alphabet using Databricks?

A guide on how to operationalize the Databricks AI Security Framework (DASF)

A guide on how to operationalize the Databricks AI Security Framework (DASF)

Future-Proof Your Asset Performance Management with Generative AI - Field Assistant Live Demo

Future-Proof Your Asset Performance Management with Generative AI - Field Assistant Live Demo

This video teaches how to use Databricks Mosaic AI for contract analysis, leveraging generative AI and agentic patterns to transform complex legal and regulatory content into structured data. The solution achieves significant improvements in extraction speed, cost, and accuracy. By following the steps outlined in the video, viewers can develop their own scalable pipelines for document processing and information extraction.

Key Takeaways

Upload documents to Databricks using SharePoint connection
Pre-process documents using recursive chunking with 48 chunk size and 20% overlap
Embed extracted information into vector database
Ask questions using similarity search
Validate extracted information using model ensemble technique
Route variables to human intervention if necessary

💡 The use of recursive chunking and embedding extracted information into vector databases enables efficient and accurate information extraction from complex documents.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

Unlocking the LLM’s Hidden Knowledge Engine: The 3X Matrix Expansion in FFN and SwiGLU

Learn how Large Language Models inflate and shrink matrix dimensions and the hardware math behind it, to unlock their hidden knowledge engine

A Brief History of Artificial Intelligence and Machine Learning

Learn the history of AI and ML to understand their evolution and current impact

Medium · Machine Learning

A Brief History of Artificial Intelligence and Machine Learning

Learn the history of AI and ML to understand their evolution and current impact

Medium · Deep Learning

I Know What an LLM Is, But What Is a World Model?

Learn about World Models and their relationship with Large Language Models (LLMs) to understand the next evolution in AI technology

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)