Cracking Complex Documents with Databricks Mosaic AI

Databricks · Intermediate ·🧠 Large Language Models ·12mo ago

Key Takeaways

The video demonstrates how to use Databricks Mosaic AI for contract analysis, leveraging generative AI and agentic patterns to transform complex legal and regulatory content into structured data, achieving significant improvements in extraction speed, cost, and accuracy. The solution utilizes various tools, including Azure framework, SharePoint, and vector databases, to develop a scalable pipeline for document processing and information extraction.

Full Transcript

Right. Good afternoon everyone. I hope you can hear me. It's quite a unique system. I'm sure you can hear me in the front. All right. My name is Gavita Regenoth and I'm here to talk about tracking complex documents using data bricks. This has been a huge collaboration with team Quidalo advancing analytics. Just wanted to be a quick quick reminder. Please complete your surveys. There's a bunch of instructions there in terms of how we do it. Right. So, I've got a question for the entire audience. Right. It's going to be interactive. So, please shout out. So, I want to know what you guys think. How long would it take an expert who understands contract analysis to pull about 100 variables from a five-page contract? How many hours? How many days do you think that's going to take? You can't hear. Three hours. Three days. You can't hear anything. They can't hear. I want to speak to him. But anyway, I'll just give you a bit of context. Okay. Okay, I'll get a bit bit of context. The way the way Quidel Auto does it is they have a bunch of contracts in a bunch of different formats and what they have to do is depending on the language variables are extracted to tables. All right? So, it's up to 100 variables here and it can take up to two hours for five page contract. So, really simple contracts, it can take up to two hours. Now longer the contract is going to take far longer time. So in the next 20 minutes I'm going to tell you what we did which we achieve 96% faster extraction 99% cheaper and 90% accurate. And we did this all within data bricks and the Azure framework. and we use all the nice cool features in Mosaic AI using generative AI and we did a whole bunch of things and I'm going to show you what we did. Now I'm just here representing the entire team. It's been a huge collaboration between advancing analytics and team Kamalo. Exactly. So yep. So it's Yeah, it's been a massive massive collaboration between all of us and I'm just going to show you what we achieved. This is the end product and what we have is a contract analysis system where it givesmemes the ability to talk to any contract and it's extremely cool right so kudos to quid author for having the vision and being at the edge bleeding edge of technology and what you can do is ask it any questions and not only that right we've built in memory with this as well so you can ask any follow-up questions here as well and it's extremely cool what we can do as well is upload certain documentation. So if you have a document that you want to analyze, this system will allow you to upload it into certain directories and what you can do then as well is process and upload all your documentation. So this is quite a cool feature that what you can so yeah here he's just showing you right in share point you drag and drop your files your files will be processed and what what what we have done for author as well it's a document queuing system because we we wanted to scale right so this can be used in the entire world so this can be used up to 100memes and what it gives them the ability is to understand all the documents that's been queued up and you can assess when it was last uploaded, what was the status, whether it failed or succeeded, and you can also look at the source. So, we only have SharePoint at this point in time, but you can have any kind of blob storage source as well. So, this is just kind of showing you what we've done. And I think the really really cool thing about this is besides the search functionality is the contract analysis. Now, this is this is the the the end product really, right? So once you extract all the key variables, it givesmemes a single pane of your uh UI to filter down on certain variables that are important. So for example, if you want to understand what the type of price increase, you can filter it down on that and you get the entire list of document based on what you're filtering down on. You can also click on documents. For some of these documents, we didn't have access, but that's because we're advancing analytics. If you're qual to all the documents and what you can also do then is if you wanted to analyze a particular document you can download it and have a look and analyze it using Excel or PDF. This is really cool. You've got an extraction and insights in here. So every variable that we extract, we give a reason as to why we extracted that particular variable. And this bit here is is one of our a no novelty really. So stay to the end and I'll show you what we did. So this is the end product and this is the the cool architecture in terms of how we implemented this in in data bricks under Azure. Now it might look like a web of mass but really is quite simple. We've broken it down to four steps. Your first steps are all about content and then how we extract the content, break it down into chunks, put it into your Unity catalog, into your vector stores, how we extract important information. And the very last bit is how we then serve it back to all your SMMES, how we report all your variables. So I'm going to go through step by step. It's a fairly complicated architecture, but bear with me. I've broken it down to four steps. Your first step really is all about uploading your documents and how we're constantly queuing or polling the Q system here. So all this is done within data bricks and every two minutes we're pinging the Q system and saying hey if there's any new documents that's been uploaded let's extract it let's analyze it. So you've seen this already in terms of how we upload your documents. So here is just showing you again in terms of how you can upload your documents. You can upload many many documents at one time or you can just upload your one one document at a point and once you do this right it's all about understanding information about the document that's been uploaded and this has been very very important formemes because they want to understand certain things right what are the file sources where has it come from who's uploaded the document and what the priority levels are so this is all backed by unity catalog which makes makes it very very easy for us to understand all these variables and to store it as well. Now what we do is in terms of uploading and getting tax from documents is we use SharePoint connection. Now unfortunately we've not used the SharePoint connector within data bricks but what we've done is use the SharePoint connection using M365 and we we basically take all your documents we extract your contents as bytes we put it down as markdown and we do some clever bits to make sure that all your documents are structured here right so I'm going to show you what we do after we extract or we upload your document is your step two which is probably my most interesting uh step here. It's all about pre-processing. It's all about understanding how your documents can be chunked up, which basically essentially means how you can make it how you can make the computer vector database understand the different chunks. And for this what we did was we added some metadata with pre-processing. We used recursive chunking. We tried a lot of different chunking methodologies, but we found recursive chunking to be the most um optimized here. And also we did a hell of a lot of processing in terms of understanding what's the right chunk size and how we overlap. And we spent a lot of time doing this to get this right. But yeah, we settled on a chunk size of48 with a 20% overlap. So the next step then is all about embedding, right? So what what do we do? So here if anybody can speak German, you can verify it. any languages we can translate it using AI translate and then what happens is we use AI query to kind of translate everything and embed it into vector databases within data bricks right so now we're going on to step three which is all about the extraction right so the extraction is about once you break down your document once you chunk it once you store it it's about how you pull out the right information based on the question. So here for example we're using similarity search and what we do is on it's not very pretty but you can see how we are using data bricks to ask the sell question. So here we're asking about the annual price adjustment information and here is where you're getting the answers. So this kind of shows no matter what languages we're able to translate any documents. We're able to embed documents into vector databases and ask very pertinent questions as well. Right? This is my favorite bit. I think I've said that twice now, but honestly this is my favorite bit. And this goes back to the saying of individually we are one drop, but together we are an ocean. And I'll explain this a bit more. This is based on a paper very very recent paper. It's based on probabilistic consensus that boasts potential precision and it shows that it can increase an accuracy of 20%. I'm going to show it in terms of a picture here. So this is what we've done right this is all about validating the information that we extract which is key. So how do you know that the LLM has extracted the right information? Would you trust just a single LLM? The cred author wasn't too happy about that. So we came up with this noble method where we use a model ensemble technique here. So we use three LLMs and you can use many different models. You can swap in and out different LLM models and you have to get a consensus of a minimum of two. So you got to get majority agreement here. So if you get one out of the three, then we route it to a human intervention where they step in and they validate the variable that's been extracted. If you get two out of three, that's all good news and it goes straight through processing. So the variable gets extracted and it gets stored down to a table. So this is quite novel. It's all based on agentic framework all within data bricks as well and it's worked really really well. Right. The last part it's about reporting. So you saw the bit in the start where we had the chat analysis where you can go to it. You can upload your contracts. You can understand why we've pulled out certain variables. You can also drill down on certain key variables. For example, if contracts are coming to an end, you can understand that. And that's all done via the reporting dashboard for qual. Right. So why why is this so important? right what what's the business impact and we did some rough calculations been validated by qual author as well so I'm not standing here and saying saying nonsense here really but based on just one country processing 16,000 documents you know it will cost humans about a million pounds million dollars not pounds using the system that we've built for quidall author it's about $4,800 so the cost savings here for one country is absolutely massive, right? But if you look past the cost savings, there's also faster reviews. This is allowing all allmemes to close deals far quicker to understand what the renewals are coming up and also it allows them to scale across different regions. So it's it's been extremely impactful. It's such a simple problem to solve and no matter what industry you're in, you'll have manual documentations to extract or variables to extract. But we this is what it means for good author in terms of what's next. The b the way we've designed the entire system. We've got the chatbot interface where we can leverage you know chatier contracts. The next bit of stuff that we are working with qual is about implementing aentic framework to write emails automatically to think automatically. So if they understand there's a contract that's coming up for renewal, you want to be notifying your customers accordingly. And the agentic framework that we've built within data bricks would allow them to scale through all of this. Whether it's writing emails or agentic contract management, it's all within reach. Right. So just just a quick glance back on the ROI. What we've achieved for Kaloo. Instead of a human operator taking 120 minutes to process simple documents, we've cut that down to 5 minutes using our AI solution. We've cut down processing costs that would cost Quidal Auto 7 $73 per contract right down to 30 cents per contract. done all with mosaic AI and the retrieval accuracy as well. So with a novel method that we implemented we can achieve up to 90% and that has been validated by humans as well. So I'm not standing up here and pulling out a figure or a thin air. This has all been validated by humans, right? So in terms of data bricks, how it's helped us, there's loads of features in terms of, you know, why use data bricks, but probably my most favorite thing is the scalability and diversity, how we use mosaic endpoint to serve many different LLM models, which gives you the flexibility whether you wanted to use llama based model or to hook on to open AI, it gives you the flexibility. datab bricks asset bundle has been very very useful for us as well to make sure all the environments are in sync and then of course you've got unity catalog. So in terms of data bricks it's really helped us to do tracking monitoring scalability and making sure everything is in sync. I think that's it. If you've got any questions I'm more than happy to take questions but we are in booth f633 you can come and talk to us or you can come and talk to team pal author as well who are down here. I'm more than happy to take questions.

Original Description

In this session, we will share how we are transforming the way organizations process unstructured and non-standard documents using Mosaic AI and agentic patterns within the Databricks ecosystem. We have developed a scalable pipeline that turns complex legal and regulatory content into structured, tabular data. We will walk through the full architecture, which includes Unity Catalog for secure and governed data access, Databricks Vector Search for intelligent indexing and retrieval and Databricks Apps to deliver clear insights to business users. The solution supports multiple languages and formats, making it suitable for teams working across different regions. We will also discuss some of the key technical challenges we addressed, including handling parsing inconsistencies, grounding model responses and ensuring traceability across the entire process. If you are exploring how to apply GenAI and large language models, this session is for you. Talk By: Gavi Regunath, Chief AI Officer, Advancing Analytics Databricks Named a Leader in the 2025 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms: https://www.databricks.com/blog/databricks-named-leader-2025-gartner-magic-quadrant-data-science-and-machine-learning Build and deploy quality AI agent systems: https://www.databricks.com/product/artificial-intelligence See all the product announcements from Data + AI Summit: https://www.databricks.com/events/dataaisummit-2025-announcements Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Databricks · Databricks · 0 of 60

← Previous Next →
1 Building AI Agent Systems with Databricks
Building AI Agent Systems with Databricks
Databricks
2 Databricks Workflows
Databricks Workflows
Databricks
3 Automate Unity Catalog Upgrade with UCX Part 1: Overview
Automate Unity Catalog Upgrade with UCX Part 1: Overview
Databricks
4 Automate Unity Catalog Upgrade with UCX Part 2: Installation
Automate Unity Catalog Upgrade with UCX Part 2: Installation
Databricks
5 Automate Unity Catalog Upgrade with UCX Part 3 - Assessment
Automate Unity Catalog Upgrade with UCX Part 3 - Assessment
Databricks
6 Automate Unity Catalog Upgrade with UCX  Part 4 - Group Migration
Automate Unity Catalog Upgrade with UCX Part 4 - Group Migration
Databricks
7 Table Migration and Catalog Design with UCX | Part 5
Table Migration and Catalog Design with UCX | Part 5
Databricks
8 Setting Up Azure Access for UCX Table Migration | Part 6
Setting Up Azure Access for UCX Table Migration | Part 6
Databricks
9 UCX Table Migration: Creating Catalogs and Schemas | Part 7
UCX Table Migration: Creating Catalogs and Schemas | Part 7
Databricks
10 Automate Unity Catalog Upgrade with UCX  Part 8: Code Migration
Automate Unity Catalog Upgrade with UCX Part 8: Code Migration
Databricks
11 Streaming to Kafka Just Got Easier with DLT Pipelines
Streaming to Kafka Just Got Easier with DLT Pipelines
Databricks
12 Data Engineering From Data to Dashboards with DABs: Crunching the Cookies Dataset
Data Engineering From Data to Dashboards with DABs: Crunching the Cookies Dataset
Databricks
13 Epsilon helps businesses connect with their consumers using Databricks Data Intelligence Platform
Epsilon helps businesses connect with their consumers using Databricks Data Intelligence Platform
Databricks
14 Unilever transforms operations with GenAI using the Databricks Data Intelligence Platform
Unilever transforms operations with GenAI using the Databricks Data Intelligence Platform
Databricks
15 ActionIQ enables businesses to unlock customer data with the Databricks Data Intelligence Platform
ActionIQ enables businesses to unlock customer data with the Databricks Data Intelligence Platform
Databricks
16 Mixed Attention & LLM Context | Data Brew | Episode 35
Mixed Attention & LLM Context | Data Brew | Episode 35
Databricks
17 Inside Databricks SQL: Engineering innovation with Hans
Inside Databricks SQL: Engineering innovation with Hans
Databricks
18 Inside Databricks: Engineering innovation with Michael Armbrust
Inside Databricks: Engineering innovation with Michael Armbrust
Databricks
19 The Money Team at Databricks: driving revenue and customer growth
The Money Team at Databricks: driving revenue and customer growth
Databricks
20 Unity Catalog unveiled: engineering data governance at scale
Unity Catalog unveiled: engineering data governance at scale
Databricks
21 Create a view in Databricks and share it with Power BI using Delta Sharing
Create a view in Databricks and share it with Power BI using Delta Sharing
Databricks
22 NDUS leverages Databricks Data Intelligence Platform to revolutionize higher education management
NDUS leverages Databricks Data Intelligence Platform to revolutionize higher education management
Databricks
23 Démo Databricks de AI/BI
Démo Databricks de AI/BI
Databricks
24 EMEA Data + AI World Tour 2024
EMEA Data + AI World Tour 2024
Databricks
25 GenAI: The Shift to Data Intelligence - Customer Panel on Industry Use Cases
GenAI: The Shift to Data Intelligence - Customer Panel on Industry Use Cases
Databricks
26 GenAI: The Shift to Data Intelligence - Ft. Ash Jhaveri, VP of Reality Labs Partnerships at Meta
GenAI: The Shift to Data Intelligence - Ft. Ash Jhaveri, VP of Reality Labs Partnerships at Meta
Databricks
27 Virtue Foundation leverages the Databricks Data Intelligence Platform to advance global health
Virtue Foundation leverages the Databricks Data Intelligence Platform to advance global health
Databricks
28 Announcing Synthetic Data Generation in Mosaic AI Agent Evaluation
Announcing Synthetic Data Generation in Mosaic AI Agent Evaluation
Databricks
29 AI/BI Dashboards Embedding - A tutorial
AI/BI Dashboards Embedding - A tutorial
Databricks
30 Bayer transforms global data management with the Databricks Data Intelligence Platform
Bayer transforms global data management with the Databricks Data Intelligence Platform
Databricks
31 Databricks at AWS re:Invent 2024
Databricks at AWS re:Invent 2024
Databricks
32 Hive Metastore and AWS Glue Federation in Unity Catalog
Hive Metastore and AWS Glue Federation in Unity Catalog
Databricks
33 Data + AI World Tour Paris 2024
Data + AI World Tour Paris 2024
Databricks
34 Retail reimagined: Currys data-first strategy to driving growth and improving operations
Retail reimagined: Currys data-first strategy to driving growth and improving operations
Databricks
35 Mixture of Memory Experts (MoME) | Data Brew | Episode 36
Mixture of Memory Experts (MoME) | Data Brew | Episode 36
Databricks
36 Verana Health Data Curation and Innovation with Databricks and AWS
Verana Health Data Curation and Innovation with Databricks and AWS
Databricks
37 Securing SaaS Applications: Obsidian Security on Their Journey with Databricks and AWS
Securing SaaS Applications: Obsidian Security on Their Journey with Databricks and AWS
Databricks
38 Twilio Eng VP on Data Intelligence & AI at AWS re:Invent 2024
Twilio Eng VP on Data Intelligence & AI at AWS re:Invent 2024
Databricks
39 Chegg Eng SVP on Data-Driven Approach to Student Success with Databricks and AWS
Chegg Eng SVP on Data-Driven Approach to Student Success with Databricks and AWS
Databricks
40 Ibotta Personalized Rewards Innovation with Databricks and AWS
Ibotta Personalized Rewards Innovation with Databricks and AWS
Databricks
41 Simplify AI governance with #databricks AI Gateway
Simplify AI governance with #databricks AI Gateway
Databricks
42 Databricks SQL and Power BI Integration
Databricks SQL and Power BI Integration
Databricks
43 Databricks Serverless SQL Warehouses
Databricks Serverless SQL Warehouses
Databricks
44 7 West powers audience growth with the Databricks Data Intelligence Platform
7 West powers audience growth with the Databricks Data Intelligence Platform
Databricks
45 Secret to Production AI: Tools & Infrastructure | Data Brew | Episode 37
Secret to Production AI: Tools & Infrastructure | Data Brew | Episode 37
Databricks
46 Skyflow CEO on Data Privacy with Databricks at AWS re:Invent
Skyflow CEO on Data Privacy with Databricks at AWS re:Invent
Databricks
47 Databricks Clean Rooms Product Demo
Databricks Clean Rooms Product Demo
Databricks
48 Dun & Bradstreet Enrichment & Monitoring, powered by Delta Sharing & Databricks Marketplace
Dun & Bradstreet Enrichment & Monitoring, powered by Delta Sharing & Databricks Marketplace
Databricks
49 Unpacking Libraries in Databricks
Unpacking Libraries in Databricks
Databricks
50 Providence uses an AI agent system from Databricks to help doctors improve their communication
Providence uses an AI agent system from Databricks to help doctors improve their communication
Databricks
51 How State Street Uses AI to Transform Millions of Trades Daily
How State Street Uses AI to Transform Millions of Trades Daily
Databricks
52 Vevo Therapeutics CEO on Curing Disease with Data at AWS re:Invent
Vevo Therapeutics CEO on Curing Disease with Data at AWS re:Invent
Databricks
53 Over Architected with Nick & Holly: Databricks updates for Feb 2025
Over Architected with Nick & Holly: Databricks updates for Feb 2025
Databricks
54 The Power of Synthetic Data | Data Brew | Episode 38
The Power of Synthetic Data | Data Brew | Episode 38
Databricks
55 Use Databricks Lakehouse Federation to break down data silos
Use Databricks Lakehouse Federation to break down data silos
Databricks
56 AI's rugby score: National Rugby League rallies fans with analytics and unified data
AI's rugby score: National Rugby League rallies fans with analytics and unified data
Databricks
57 Open Variant Data Type in Delta Lake and Apache Spark
Open Variant Data Type in Delta Lake and Apache Spark
Databricks
58 How would you sort Ætheldred in the alphabet using Databricks?
How would you sort Ætheldred in the alphabet using Databricks?
Databricks
59 A guide on how to operationalize the Databricks AI Security Framework (DASF)
A guide on how to operationalize the Databricks AI Security Framework (DASF)
Databricks
60 Future-Proof Your Asset Performance Management with Generative AI - Field Assistant Live Demo
Future-Proof Your Asset Performance Management with Generative AI - Field Assistant Live Demo
Databricks

This video teaches how to use Databricks Mosaic AI for contract analysis, leveraging generative AI and agentic patterns to transform complex legal and regulatory content into structured data. The solution achieves significant improvements in extraction speed, cost, and accuracy. By following the steps outlined in the video, viewers can develop their own scalable pipelines for document processing and information extraction.

Key Takeaways
  1. Upload documents to Databricks using SharePoint connection
  2. Pre-process documents using recursive chunking with 48 chunk size and 20% overlap
  3. Embed extracted information into vector database
  4. Ask questions using similarity search
  5. Validate extracted information using model ensemble technique
  6. Route variables to human intervention if necessary
💡 The use of recursive chunking and embedding extracted information into vector databases enables efficient and accurate information extraction from complex documents.

Related Reads

📰
Unlocking the LLM’s Hidden Knowledge Engine: The 3X Matrix Expansion in FFN and SwiGLU
Learn how Large Language Models inflate and shrink matrix dimensions and the hardware math behind it, to unlock their hidden knowledge engine
Medium · LLM
📰
A Brief History of Artificial Intelligence and Machine Learning
Learn the history of AI and ML to understand their evolution and current impact
Medium · Machine Learning
📰
A Brief History of Artificial Intelligence and Machine Learning
Learn the history of AI and ML to understand their evolution and current impact
Medium · Deep Learning
📰
I Know What an LLM Is, But What Is a World Model?
Learn about World Models and their relationship with Large Language Models (LLMs) to understand the next evolution in AI technology
Medium · LLM
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →