Docling: The Open-Source Document AI Engine

Siemens Knowledge Hub · Intermediate ·🤖 AI Agents & Automation ·1w ago

Skills: LLM Foundations70%Agent Foundations60%

Key Takeaways

Explores Docling, an open-source document AI engine for RAG and information extraction

Full Transcript

You can do way more than flooding open source projects with MRs generated by AI. So, that's why we have Docling presentation here from Panos from IBM Research. Welcome. >> Hi, everybody. So, um uh nice to be here. It's my first time here. So, uh great to to join this nice conference. Uh my name is Panos Vagenas. I am uh one of the developers of the Docling uh library and then an advisor software engineer at IBM Research and part also of the Docling steering committee with the Linux Foundation. So, before we get started, just to get an idea from your side, who kind of does not know Docling? Good. So, good that I'm here. So, we will solve that. >> [laughter] >> That's not a problem. All right. So, what are we going to see today? Uh first, we're going to have a short intro like meeting Docling and then just seeing a bit of the functionality, what it is about, and g- taking a peek under the hood, seeing how it works. And then uh towards the end, we have a section also where we explore a bit of the journey, how we got here, what lies ahead, and a bit of the community aspects as well. And uh that would be it. So, um first of all, >> [snorts] >> the good thing about um us building Docling is that it's very, very uh helpful in that everybody uses documents. There is not one person that does not use documents in their day uh work. So, uh all types of documents, all formats, all uh typ- all constellations you can imagine. Uh many many modalities going into that. So, it's not like pure text, right? You have images, you have tables, you have different types of stuff. Plus, you also have other types of unstructured data as well besides documents. So, one thing with documents is that working with them is like very very complicated, right? So, just picking out an example, think about PDF, right? It's a notoriously evil format to work with. So, um and that makes one point, and the other the other important point is that the the modern LLM technology we have does not natively understand this like PDF for example, right? It will not natively understand it. What it can understand is text, right? So, what we do nowadays mostly what people do is trying to break to build these bridges between these two things, but this bridging really comes with significant sacrifices. And sacrifices can be in terms of cost, can be in terms of quality, can be in terms of you like losing sovereignty over your data and many other things. So, uh that's pretty much exactly where Doclin comes into the picture, quite literally. And uh oops, sorry. And um one of the Sorry, this does not work with the clicker. Uh why why is Doclin useful? First of all, it is open source. You don't have to pay for it, and it is bound to be open source forever. Secondly, uh it is uh it includes our advanced AI models that we have been building for many many years, and we have been evolving, and which are made specially for the task of document understanding. >> [snorts] >> It supports many different formats. So, it's a single tool to kind of rule them all if you like. Um and it works with a unified representation that is rich. It's not really just naive plain text, and it's LM friendly as we will be seeing going forward. What's more, it all runs locally. You don't need to give your data to anybody for using DocQuery. And last but not least, we've made sure to put it pretty much everywhere. So, no matter where you are comfortable working at, you can just pick DocQuery from that environment and you know, you're off to the races. So, where do we stand today? We stand at a situation where we have getting very good traction. We are 60,000 GitHub stars. We have more than we have close to 7 million downloads per month on PyPI. We have been the number one repository on GitHub across all dimensions for a very long time. We have been the number one model, number one data set on Hugging Face to the extent that Hugging Face CEO called it the grand slam of trending. Uh we are within the 300 top repositories ever in terms of star, and we are part of the Linux Foundation, namely of the AI and data foundation under that umbrella. So, just to give you a quick glimpse as we're going through the discussion to keep in mind what the pro what the project where the project sits at. So, uh but how does a project actually look like? How do we use it, right? So, in principle, you can use DocQuery in many different ways as I outlined earlier. You can use it for example as CLI. You just pip install DocQuery, and you in one command you can convert documents, and we'll see what that looks like. Or we have a very powerful SDK that essentially allows you to build any kind of workflow you want, right? But also, you can use it as an MCP tool for your agents. You can use it as an API service, as a plugin for all types of long chains, llama indexes, crew AIs, haystacks, and whatever out there. So, you can just in one line use the functionality. Okay, so let's have a quick look how the CLI, for example, looks like. And for that, I I am sharing here a quick video where we start with a document and you see what the document can can entail, right? For example, multiple columns, it can have tables, figures, it can have like different types of captions, paragraphs, and what have you. So, in one line, we're just saying Doc Ling, for example, I want to convert to HTML in this example, and you will see that in in a couple of seconds, we get back the HTML, also the included images. But, what we see in the HTML is importantly that the whole document structure is maintained, right? So, you have the paragraphs, you have the headings, you have what it was a list, all of the different figures, and importantly, as we move towards a more, say, challenging aspects of the document, like you can see that there is like multi-column spans that are retained in tables, and these kind of intricacies that very often are just really overseen by, you know, kind of naive tooling out there. All right. So, let's move on to the next one. So, one of the reasons why people consider Doc Ling is also around the cost, right? So, what you see here is in principle the biggest PDF data set created by a hugging face. It's called the fine PDFs, and they used two technologies for building that, and you can see that Doc Ling. And by the way, the slides are already available online. It's just on GitHub, so we can share them, but you will find them also later. And you can see that the Doc Ling technology is 50 times more cost-effective than the other VLAN that was used there. Just to give you an understanding, right? How they were able to go to like half a trillion PDFs part past with that. Okay. Another interesting collaboration besides what I just showed with hugging face is also like with Nvidia where this is from actually from this year where Jensen was presenting our collaboration on the data side. This was I believe more on the actual structure data part of the broader team at the here at Recursely call but we are also collaborating very strongly around the unstructured data with Docling and with building new models and so on and so forth. So very very interesting to see the you know fruits out of this technology. So, now what about you can ask me what about the other technologies that are out there, right? Why don't we just use say a plain very simple fast technology, right? So yes, you could use for example a low-level PDF parser like PyPDFM or this type of tools. But what you will see is that in principle you get very very low quality. You will have no structure. They will be incomplete. Here you can see like tables, forget about it. And different kind of modalities completely like ignored or misinterpreted. So if you want to do anything of essence with this data, yeah, you you you you get you start on the wrong foot, right? Imagine just feeding this further down. Different [snorts] types of defects here. But also if you consider very big models and I know okay, this is a bit older models here that that we are showing but this the principle still remains. Even big models, they're trained to do pretty much everything. So they're not necessarily very good at doing this one task of document understanding, right? So you can see like dozen billion parameter models messing with the titles or the structure of the table. So this just goes to tell you that But, but have to have a a sweet spot in that area. Now, what you see here is in principle, you know, the the the known story of vegetative electron microscopy. It's like a very deeply studied [clears throat] domain out there. Unfortunately, it's a non-existing one. But just rather came to be because you know, some parser parsed these two columns as one. And then dozens and dozens of researchers made papers that were peer-reviewed by other researchers and and are published. And you know, we have now many of these documents talking about this topic. Okay. >> [laughter] >> All right. Uh so, just goes to tell you that you know, if you have poor quality in your data and your document pipelines, this is really trickling down causing a lot of problems downstream. And obviously, Docling can address this nicely. Oops. Okay. What else can quality look like? Uh we all know our interactions with all of these chat environments and how we make a question and we get back an answer that always looks very very re-affirmative, you know, very confident. And often times, we just leave it at that and we have no clue whatsoever whether it actually stands, whether it's correct or not, right? So, what can be a very powerful feature that Docling can power is like you make a question, you get back an answer, but crucially, you also get visual grounding. So, you know, you can know where from the answer originates in the document and can validate. So, you as a human is are actually kept in the loop. And that gives you a handle to go check and you know, this is the the elements that in my opinion can bring more trust into an AI system. And this is a very important aspect nowadays, I believe. >> [snorts] >> All right. And maybe just another anecdote if you is this other aspect of papers that were published where people would just inject making prompt injections in white text. So, you know, it's it's another here people were asking the reviewer LLM to not highlight any negative aspects of the paper. >> [snorts] >> So, it is just different ways of telling you, you know, how you build your document pipeline is very, very important. And it is becoming increasingly critical with all of the technology that's being built. All right. So, how does Docling work a bit under the hood? Um yeah, so in principle this picture summarizes things quite nicely. >> [snorts] >> What we do is we have these different pipelines that address different types of input document formats. So, we we cover for example Microsoft Office formats, we cover like pictures, audio, markdown, PDF, different types of things, right? So, the the important here the important decision important design feature here is that we are mapping all of these different formats to a single unified representation. Which is this column that you see this light blue column that's that's called Docling document. So, that's kind of the core of it is getting from whatever you started from to this unified representation because once that is in place, everybody is very happy because they don't have to care about PDF anymore. They don't need to know about this document ever originating from a PDF or a Word document or what have you, right? Now, you can just work with the SDK and take that object which is a Docling document and chunk it in a document native way and can export it to different formats and can feed it to your agents and you can essentially do whatever downstream task you want from a single place without having to build custom pipelines and post-processing around everything. >> [snorts] >> So, what's interesting here is that for PDF, for example, which is kind of the challenging format in this scenario, or images equivalently, we have two options. One is the default pipeline, or the standard pipeline as we call it, which is an opinionated pipeline where you have different models catering to different aspects of this whole task like OCR, and then there's a layout analysis model, and there's a table structure model, and at the end we're putting everything together, and these are essentially the models that we have been building and iterating on for a very, very long time. And And this is more like the production-ready essentially pipeline that is super, super light. It runs on a CPU. You don't need a GPU for this. And it's very, very stable. And at the same time, what you see a bit as a second line is what we call the VLM pipeline, which is where we are more and more focused now, which is essentially the single-shot VLM pipeline where one single model takes care of all aspects and essentially all of everything is driven by the data, right? As As to how we build this. So, this is, you know, the more scalable approach. May not be 100% stable at this point as we are working to further iterations, but in principle is the one that we is is more powerful and the one the one that we're pushing more going forward. So, all [snorts] of this is configurable as you will see, and you can essentially do it the way you want. Just a picture of the different components that we have in our internal ecosystem. When we say Docling, it can mean many different things under in terms of, you know, for example, what you see in the on the top is like Docling core which is like the data model, the application API where I'm driving a lot of these aspects. We have the conversion, and we have different, different repositories essentially that that address different aspects of this whole ecosystem. Just to to show you a bit the breadth of it, right? It goes from testing, from conversion scaling, from extensions, and so on so forth. And part of our actual contributions is also to other repositories external to Docling where we're working with for frameworks, for example, in their own repositories, right? So, as I mentioned earlier, a lot of what we're driving is around the VLM area, where VLM stands for vision language model, where essentially you have a single model that does everything. And this model the specialty here and the differentiator is that we are making tiny models. We're talking like 256 million parameters. So, this is like nothing, right? Today, we're talking about trillion model parameters that people are building. So, a lot of a lot of a big difference there. So, these models are so small, but they're very very specialized, right? So, this can give you essentially both the quality and the cost-effectiveness that you need. So, that was the first iteration that we got with small Docling as a collaboration with with Hugging Face. And um then we we moved on and made a second iteration called Granite Docling with many more features. And essentially now, as you will see later, we'll be we're building the the third most powerful iteration of that as we're going. So, here you see a bit how that model works, and this is just a a space from Hugging Face, like a demo space. So, it's just you know, showing you a document on the left side and how the the model is just um essentially producing generating what used to be called doc tags back then. So, essentially a language that pretty much pretty much maps one-to-one what I to what I mentioned the unified representation. So, we have our own language of describing this unified representation and that used to be called doc doc tags, but now we are in the process of uh strengthening it much much more and we will come back to it later. I'm very very excited about that. So, you see different types of modalities covered, different types of languages. Here also you have some right to left left text, which is also covered. So, we did try to put a lot of this functionality into our models as you see. >> [snorts] >> All right. So, moving on. Uh so, what are the reasons that, you know, help Granite Dockling stand out? In principle, one of the main reasons is going to be the sizing that I mentioned earlier, right? It's going to be the multimodal aspect that is able to address different things from tables, figures, charts, all of the different um uh like text elements and so on and so forth, all of the with the bounding boxes, obviously. But, also very importantly this unified language that we will come back to. All right. So, um just exploring one more interesting feature that we have on the Dockling side. Essentially, we don't want to, you know, prescribe too much around which models you should use or whatsoever, right? Oops. So, in principle, we do allow you to use other models as well within the Dockling framework. And one of the ways that we are doing this is by having hooks where you, for example, can bring in your favorite expert model on task X to help you with part of the document. So, what you see here, for example, is like you configure the pipeline very easily to say, "Look, I want to use that model from hugging face, but it could just as well be any model like a served model as well for picture annotation because I have this model that's very good for you know, captioning this particular type of diagrams or what have you. So, this is absolutely possible and this is something that we we see as super important for providing the modularity and the composability that is needed for real solutions out there and for, you know, essentially we have many many many many different type of use case that people looking at. Um yeah, so that's that's that's an important aspect. You can use other models as well. We are not uh uh prescribing only our models. And here comes essentially a completely different also way of at looking at these problems, which is, you know, what we've been looking at until now is essentially what we call conversion like starting from existing document and trying to represent it in a truthful way uh with our representation, but a different type of problem can be you don't just start from an existing document, but you actually know very much exactly what parts of the document you're interested in. So, that's what we call information extraction and that's a scenario where you may have, for example, like your invoices or your product sheets or what have you and um you come with a predefined schema where you say, "Look, I I'm interested in the invoice number. I'm interested in the like the total value on this and that." And then in principle, you will get out uh of the document this this kind of aspects right away. So, again, it's a a bit of a an orthogonal functionality to what we discussed earlier, but uh know, these two things together, I believe, build a very, very powerful plane for um developing very nice applications there on document side. And as you imagine, agents could have not been missing. Uh everything is around agents nowadays. So, obviously and naturally, we are uh trying to put this um as a very um core focus of what we're doing because a lot of this is used in the context of agents, right? So, that's why we have, for example, the MCP um Docling MCP out there, but also specialized agents that we provide such that you can, in principle, um you know, use Docling from different agentic environments. And you know, you also see different many different logos here like uh from Crew AI to LangFlow. Uh so, that these are different places where Docling capabilities are essentially already integrated. So, for example, LangFlow, I believe the the advanced document parser that the UI uh supports is directly Docling. Um yeah. So, with that, just let's uh look at one last um you know, snapshot of interesting things that have happened recently. One very nice aspect is that we had directly from the community an interest on the Java world, and we saw that uh you know, Docling being a very much native Python thing, uh oh, actually, these capabilities use um uh useful, obviously, also from different communities, right? So, uh we were very happy to also help spin off this other initiative around the Java Docling Java project, uh which is um using the Docling capabilities. A- As mentioned earlier, we have a very long a very nice collaboration with uh Nvidia as well. Here, you see the announcement they made uh around uh part of the model work that we did together. Uh And there are different features that we are constantly um advancing and uh sharing news on, be it on chart understanding, right? Extracting, for example, actual values from your line charts or pie charts, what have you. Uh to how you parse LaTeX and and many different things that we're working on. So, that hopefully solves a bit what uh we had in the beginning that some people did not know about Docling. So, now uh you have an idea what Docling does. Um what I would like to do next is uh spend a bit of time on discussing a bit the community aspect and uh you know, how we got here and what what's coming uh next. All right. So, first of all, how we got here? So, when it comes to that, I think it definitely helps that this is not our first rodeo as a team. Uh the team has been building uh document understanding capabilities for uh like decades. Uh you see like this is really not the full list of of papers that we've been doing. It's just uh a shortened version and it that goes up to 2018. I think in reality it goes further back. And uh in principle, what this tells you is that the team has a strong competence on the actual field of computer vision for documents and this is a very much needed aspect, right? You cannot just appear in the field and say, "Look, I built something with uh with an agent and it's two lines and it solves everything." So, the the team's exposure to these problems in such over such an extended period of time is part of the elements that uh have been very important, right? And uh not only that, but also we have been uh building um broader technologies on top of that. So, for example, before um before Docling came to to be, the team was working on another project was called Deep Search. And you see a bit of what it used to look back then. So, this we're talking like 2000 21 22, and we already had things like advanced multimodal rag and with different types of capabilities in there. So, we were already talking to customers and exposed to what the customers are interested in, which sort of use cases would be appealing to them, and you know, essentially this experience helps a lot. So, it was actually through a coincidence that Docling came to existence in 2024, and what we essentially did is we distilled we we took these models out of the Deep Search essentially technology. We put many more things around it as well. And but we were able to reuse a lot of this technology as well. Crucially though, what I believe was more important is that we were able to use our experience and our exposure to these problems and to the user needs on that side. So, that's very important. I think this aspect is key, right? Culture and it was mentioned earlier. I think it's so super important. So, if you think of the whole hierarchy here, essentially when we open source Docling back in 2024, and I'm saying back in 24 because by today's standards 2024 is a time like Stone Age. You know, it was not the most self-explanatory thing to do to go out and open source a key AI technology, right? So, you can imagine that there is dynamics at play that it be straightforward necessarily. So, we were fortunate that the standpoint of the company was strongly shifting towards open source AI strategy, right? So, that was definitely one one aspect, but uh you know, I think this is a bit of also trends that come and go in in that sense. So, sometimes you also need this bit of um of luck with aligning with with with a current trend. Um And then the team is the absolute most important aspect that we are most thankful, obviously, of having a very very strong team with mixed skill set, you know? And if we are to lead in this space of document AI, um it is just impossible to do it if you don't stick to the cutting edge to the bleeding edge every single day and really work at a breakneck speed. It is otherwise you're just irrelevant in this field with scale and speed of things happening. So, what has been key in that sense is that everybody on board has retained a strong growth mentality. Uh we never know, right? We always need to to learn something first something again, be it from like, you know, this has been largely an a research team, right? With also some experience software engineering members, but a team that has a strong research aspect, how can you drive a product that is meant to uh you know, be appealing to developers. How can you attract developers, right? So, it has to be developer-first experience and that is one of the things that we had to learn. We're obviously adapting to as everybody to the new agentic and work of way of coding, all things like spec-driven development, all all things like that, right? But, uh if I'm to pick up one aspect that would be really having an entrepreneurial mindset around everything. And uh what I mean by that is essentially that um we have been uh cultivating seeds uh in any opportunity possible. We have been planting seeds like crazy over a long amount of time, and you know knock on that door, to the talk to that person, pursue that opportunity, follow up on this forever, relentlessly uh from many, many people on many, many fronts. So, uh I think this is so crucial. And you know, sometimes, if you look at back at at a certain course of events, you may think, "Oh, that's that that was actually a lucky moment, right?" Or people talk about the serendipity. Oh, that was just pure fortune. Yes. And you can actually, in a way, create or support your own luck by being uh so uh relentlessly after going after opportunities. So, uh what we found super interesting, and it was also discussed earlier very nicely is that um open source has been a great place to collaborate. So, um I think if you if you look at the collaborations we have had, and it has been with many different parties at this point, uh you just see some of the logos here that in one way or the other we're collaborating with. It would have been plainly impossible beside the corporate walls. Like, or it would have been so slow that it would be completely irrelevant. So, uh you know, the the scale and the speed that you can achieve in the open is unique. So, for for us, it has This has been an enabler. Uh because the moment you're out, you can talk about your work with everybody. They can get interested in your work. They want to contribute. Everything moves faster. Uh so important. And then obviously, you know, us donating the project to the Linux AI and Data Foundation has also been a very, you know, uh [snorts] crucial in that, you know, obviously on a on the on the surface you get this sort of governance layer, right? Open governance. You get sort of the framework how you are supposed to operate. That helps, right? And that gives you some structure on many aspects. Uh And also you get some exposure to additional communities and additional, um you know, opportunities that may be available through the Linux Foundation. But at the end of the day, I believe the single most important factor is the trust. The seal of trust that the Linux Foundation or obviously also other foundations, right, convey uh through their logo because a user that is looking at the technologies available out there and trying to pick between Docling or this other technology or Docling or this other open source technology will also very much account for the fact that Docling is bound to remain open source forever. Uh through the uh Linux Foundation. So, this is a very, very strong uh differentiator for for many companies around what they actually go for for using because they don't want to run into this sort of situation where suddenly something got re-licensed and oops. Or, you know, uh these things have happened. All right. So, um how did we grow? I think at the end of the day, um [snorts] besides what I've discussed earlier, also by just, you know, showing up, by being there, by being present in all types of uh events, sessions, venues, be it uh online, being in person. Okay, I I put here together a couple of photos, but in principle, we have been a very very active. And uh if you if you ever go to like a Duckling uh workshops, Reborn, it's on GitHub, where you can find also this presentation, you will see our schedule for this year, for last year, and it's very very very dense, right? So, um this spreads the message. The message is um thankfully strong. People uh are liking it, and that's how we are now at the 60,000 stars. And uh hopefully through your stars as well, we'll go to the duck will move a bit towards the upper and right side. Okay. Now, moving a bit to what's next. Okay, we saw how we got here. Now, what's next? Uh I will start with a bit of the overall what's next, which is the kind of elephant in the room of the growing Okay, the growing community that we are, you know, we we do appreciate, we have, but also the growing AI usage, which is just a um global uh reality for everybody right now. And you see here just some um some of the numbers that GitHub was publishing, I believe was 2 weeks ago or so, where essentially they were trying to to do their own postmortem. And uh how they are essentially also uh facing this massive um sprawl of issues, of pull requests, and every single maintainer out there uh >> [snorts] >> is is facing those. And it obviously, you know, the larger the project, the larger the kind of attention it will enjoy, and the larger numbers it will face. So, uh Uh our team, and I believe the whole community is adapting to this reality. You know, the community numbers are not the numbers that used to be like last year, let alone three or five years ago. So, if people should start thinking about scale, like sheer scale. And I I've heard of I've had very interesting conversations on that one and how essentially, you know, even things like PRs and issues are meanwhile most likely to be thought like just signals that you can cluster it, make something out of. But, yeah, it's a very interesting topic. Okay, but for us, what was also very relevant is like there is also a proliferation of different tooling agents, benchmarks out there. So many VLMs, you just see some of them up on the top right side that are doing documents, right? So, that's it. Another thing that the user is confronted with right now and is confronting with managing those, evaluating, and so on and so forth. At the same time though, for us, this is a massive opportunity to drive the document AI space across the whole stack. And essentially from the data model going to all the way to the models, to the agents, and to the evaluations and metrics. And that's what I will be shortly talking about and closing with in the next slides, essentially our agenda, our road map for the months to come, which is essentially addressing these these aspects. So, I'll start uh from the first one, which is very very close to my heart, which is uh DocLang, which is essentially a standard that we're putting in place, which is going to be the AI document standard. So, a format that is going to essentially address all of these aspects in a unified way and enable all of these AI workflows that people are building nowadays with their documents. And also the people that make models have to address and so and so forth. So, we are working with the Linux Foundation to do that. We're working with partners. We're very happy that Nvidia signed very recently on that. We're working with Abby, a very big OCR company. So, essentially this is going to be the standard of how people can do can do like AI with documents. And um I'm very very happy about the work that we're doing there. How this naturally moves forward is also it directly flows into our new models, which are going to be very strong models that can leverage these aspects. And we're very putting a lot of innovation from from our scientists there as well. So, that's something that's going to improve not only the VLM models, which is essentially the core of our focus, but also we are doing iterations of our individual models of our standalone pipeline of the standard pipeline. And last but not least or maybe second to last, in principle what we also see is like everybody that makes models nowadays also feels the need to publish some numbers and they do. And but one of the underlying problems with that is that the actual even metrics that people report on are not meant for documents. So, it's kind of an ill-posed problem. People are using, you know, sort of tooling from the broader computer vision community history and they're just using kind of, you know, people tend to not do the groundwork. Let's summarize like that. And then uh what we are trying to do also in that sense is to really, you know, do the groundwork, build the actual metrics that are native to documents, make it open source, build an evaluation suite, make it very efficient. So, that is also going to be open source very, very soon. And I'm very happy about that as well. And obviously and naturally the company is also trying to capitalize on this being having such good traction with the community. So, it is seeing the value of saying, "Look, you don't need or not anybody needs to deploy their own infrastructure on this and having to maintain that." So, we are also within IBM making this an actual managed service where essentially you will be able to just consume it as an API very easily. So, this is I believe a natural step. And yeah, that's pretty much concludes the talk. If you need to take some messages with you, you know, it is very easy to to use. It is it helps you automate a lot of your workflows around documents. It is very cost-effective, high quality, and your data is with you. It never needs to leave your laptop. And you can build all types of applications that you want with that. And we are really committed to pushing, you know, the boundaries of standardization of innovation on this document AI space across the whole stack as you saw. So, if you will, feel free to, you know, check our community, go to docling.ai, give us a star. You can join us on Slack. You can join us on GitHub, on LinkedIn, you know, follow all of the developments of of what we're doing. And also you are very much welcome to make contributions and interact with us. We also have office hours where you can make questions and and things like that, okay? And I also have stickers for whoever wants to take a Docling with them. So, that's it from my side. >> [applause] >> Question from the chat? >> So, thank you for the presentation. That was nice. So, we have a question from the chat uh which is um is there any way for a human to somehow see side by side how some information from the source has been structured or formatted in the target? Because that would help gain user confidence. >> Yes. Uh >> [snorts] >> we have different uh ways of addressing this question. Uh in principle, we also have components of um TypeScript that uh people can uh and we also have our Docling serve which comes with a small UI where you can essentially get a very nice uh visualization of uh the different elements that uh are um you know, constitu- constituting a document and you can see the the layout of the document and you uh the reading order and everything. So, um this is something that has not possibly gotten the attention of the whole visualization aspect in the open source space has not gotten the attention that it may be actually deserving. Um so, yeah. But I would still uh recommend people to uh have a look at our TypeScript components and how that can these can be useful for that. >> Okay. Thank you. >> Sure. >> Thanks a lot. Can you say something about parsing legacy documents like scanned pages which are a little jagged, noisy, and so on from the 1980s without proper formatting which still contain maybe legacy in particular mathematics, numerics, physics know-how. >> Yeah. >> Yeah, that's a great question, right? And it kind of brings to surface the complexity of the of the whole thing. And you can imagine the more jagged, the more skewed, the more obscure the writing, at some point it becomes impossible also for a human. Right? So, there's a continuum to that. But, we are very strongly working on these aspects in many different fronts. So, one of it is essentially essentially a synthetic data generation. Mhm? So, we are working on producing different types of artifacts, different types of um modalities or artifact modalities. You just described some of them, right? You could It could be coffee stains, it could be anything, right? It could be also different types of hand handwriting. So, um it's something that you have to pretty much address on the data level, ensure that it's part of the data mix, right? And um that's always a bit of an opinionated which aspects you consider important. And also then also in the evaluation at the end of the day. But, uh definitely we do see that a lot of the actual use cases do, uh you know, uh relate to such environments, legacy documents, handwritten, and so forth. That's for sure. So, we have a great point. >> Uh great presentation. Quick question regarding the DocLang. Uh do you see that there is a necessity for specific languages for these use cases rather than using HTML or a more generic thing? And if so, do you find other applications would also require their own language, uh or are generic languages sufficient? >> Yeah. Great question. And thanks, because apparently I forgot to mention some stuff in my presentation. [laughter] So, yes, we do believe DocLang is catering to very specific AI AI specific needs. And these are going to be around aspects like LLM friendliness in terms of like token efficiency, in terms of like including the bounding boxes. Uh you know, it's going to be aspects like these that people do care about for different reasons in the the whole AI space. So, for example, if you look at HTML, right? Um Uh if you see for example, let me just bring up like the table modality, how tables are done in HTML for example. Um it is not designed for token efficiency. Uh the team has developed a a language which is proven to be optimal for tables, which is called OTSL, already many years ago. Uh so, essentially, this was built for being optimal in terms of tokens. And um it has actually been on top of that language that then DocTags was produced and out of which then DocLang is essentially merging. So, it's a sort of natural progression of this whole idea. So, um and then also if you think about HTML, you don't you're not able to necessarily easily do what we mentioned around the There is no unique rendering. So, that also makes it a bit more um you know, less tangible and things like visual grounding is would be done in a different way. So, um at the end of the day, it's different languages built for different purposes. Plus, HTML doesn't have necessarily the concept of what is a document component, right? It was meant for visual for a visual on an actual browser. So, it doesn't know the concept of uh some of these document specific things. Yeah, I'm very happy to also, you know, follow up offline or later. >> Hello? Uh I missed a few minutes from your presentation, so sorry if you already answered, but um were there some discussions to put this into a commercial product? I could imagine this um is not really a non-differentiating thing. >> Yes. >> [laughter] >> It is uh being GA'd soon. It has already been announced. It's on tech preview, principally. So, yes. It's a product that we are starting with, and I'm saying starting with because what we're putting out is essentially the first step, right? It's a super horizontal API service that everybody can convert documents with uh in scale. Uh but uh I believe that it would make sense to build on top of that, you know, and build more on the same more integrated capability level, as well. >> I mean, keeping it completely closed. >> How do you mean, keeping it completely closed? >> So, not not publishing it as open source. >> Uh we went the other way around. We started from completely closed. We already were there. And we came to to open source. So, yes, this is not going back. This is now This is now forever public. >> Cool. >> Perfect. Thank you very, very much, partners, for this interesting presentation. >> [applause] >> Thank you very, very cool here. >> [music]

Original Description

Join this session from the core Docling team to learn more about this emerging open-source AI project and go behind the scenes with insights into its journey. We will start by exploring what Docling is and how it can help address key unstructured data use cases — ranging from RAG and information extraction to broader document-related agentic workflows. Beyond current capabilities, we will also provide a sneak peek of exciting upcoming developments on our roadmap. Moreover, we will share key learnings from the Docling journey: the challenges and opportunities of open-sourcing, growing and nurturing an ecosystem, and evolving a research technology to a trusted industry standard — along with practical lessons relevant to anyone building sustainable open-source AI infrastructure. Panos Vagenas is an Advisory Engineer at IBM Research, architecting innovative technologies at the intersection of Artificial Intelligence, Information Retrieval, and Data Management. He has been co-driving the Docling open-source project from its inception as a core developer and Technical Steering Committee member within the LF AI & Data Foundation. Panos holds an MSc in Computer Science from ETH Zurich and has received multiple awards including the NASA Group Achievement Award, the IBM Outstanding Technical Achievement Award, the ACM SIGMOD Best Demonstration Award, and the ETH Zurich Excellence Scholarship.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Forget Data Centers In Space. How About Satellites That Think?

Learn how satellites with AI can enhance offshore security and protection

Forbes Innovation

I Found Firecrawl Too Expensive for My AI Agent's Knowledge Base, So I Built My Own

Learn how to build a custom knowledge base for AI agents when existing tools like Firecrawl are too expensive, and why this matters for cost-effective AI development

Dev.to · Samuel Raphael

How Managing 500+ Employees Led Me to Build WorkforceIQ: The AI Platform I Wish I Had

Learn how managing a large workforce led to the creation of WorkforceIQ, an AI platform for workforce management, and discover its key features and benefits

Medium · Startup

AI-First MVP Development: How Startups Should Build Products in 2026

Learn how AI-first MVP development can revolutionize startup product building in 2026

Dev.to · Nasif Sid

Building Great Agent Skills: The Missing Manual