LlamaIndex Webinar: LLaVa Deep Dive
Key Takeaways
The webinar discusses LLaVa, a large language and vision assistant model, and its capabilities in following human intent, reasoning about the visual world, and reflecting with natural language. The model is trained on a dataset created by leveraging a text-only GPT to expand instructions and outputs from a small set of seed instructions and outputs.
Full Transcript
um hey everyone welcome back to another episode of the LOM index webinar series uh today we're joined we're excited to be joined by howan Lou um author of the very popular lava paper probably one of the leading open source multimodel models out there um and so you know we've done a lot of stuff with llms over the past year um and as these models become augmented with vision capabilities uh what are they actually capable of and and what are their use cases um and so howan um on the lava side will present kind of like a deep dive into uh what these models are what they consist of um as well as maybe any sort of future directions um and then and then we we we have a howan from our side right like we have howan J on the L index team um who will present um kind of some use cases with multimodal especially using lava so super excited to have this joint webinar um and and Howen feel free to take it one oh thank you Jerry for the intro uh hi everyone I'm htin and today I'm going to present present uh lava and uh we our work on large multimodal models to that can follow human intent uh so we as humans we can see and reason about the visual world and express and interact with natural language and doctors can read the CT scans and explain their findings to the patients the teachers will teach students just with conversations and we will share our findings on the social media and interact with others so we would like to build a visual intelligence distance that can actually reason about the visual world and reflect with natural language and the closesly uh related work along this direction of pre-train image to text gener of models where you have an uh it takes an image and output the text reflecting understanding of the image and such Works include jit bptu and Flamingo and they do have the basic visual reasoning capabilities while they generally like the capability to follow complex instructions or uh engaging very long conversations and back in March opening I demonstrated GPT for vision which has very strong visual reasoning capabilities but there has been no disclosure on how it actually works and it is also not accessible until until very recently so our question is how can we create such multimodal models that can actually follow human intent and can do complex reasoning and in N uh researchers find that although the pre-trained uh language models that have take absorbed uh billions uh billions of tokens uh contain vast knowledge but they do not necessarily know uh uh what our intent are if you do not do further tuning so uh because they are trained with next word prediction so if you ask uh it to explain the humans Behavior cry it will usually just do some completion style so it will say that demonstrate the feedings communicate silently you you you can see from its response that it does have some understanding of this world and what crime means or what humans behavior mean but it does not NE necessarily know uh what we want it to say so researchers find that uh this is just completion solle and it is not following instructions and in NLP researchers found that instruction tuning is a key to let it really follow humans instructions or follow humans intent and specifically instruction tuning are a small set of instruction and output pairs which regularize how the model should perform or behave on the user's input for example uh when you ask it to explain cry you should say like there could be many reasons why people might cry and give a specific reasoning and similar for movie recommendation and by turning on these samples the model can generalize to onene tasks during the inference for example suggest a movie that explores human behavior and can uh leverages pre-trained World Knowledge and also the Behavior Uh uh it learned during the instruction tuning so that that it can uh answer this question much better and to collect such data one way is to let human to write uh high quality handwritten instructions which can be quite costly and there's the reason paper call Self instruct proposed to leverage a strong language model teacher like chpt to create such instruction at an affordable cost specifically you can just provide chpt with a small set of seed instruction output pairs as the examples and let it expand to million skill using in context learning and in this way you can generate a lot of instructions for training the language model and this has been leveraged to build powerful open source language models like Alpa so how do we create instruction following multimodel models and given such a basic architecture where you have an Vision encoder that can encode the image into the feature space you have a cross model connector that can make the language model understand those visual features and you have a language decoder that can uh really perform the reasoning and uh output the text reflecting its understanding and given this architecture uh we want uh strong model uh model components and also we we will need a data set that we can train a model to follow the model model instructions so how do we create such data set and if we take a look at the uh teaches uh that are used in the uh to create powerful language model uh open source language models we will find that those teachers are actually like text only and there are no there no powerful multimodel teachers so that's why we want to create lava and our first step is to uh try to build such data set and uh we try to leverage a text only GPT to create visual instruction following data specifically we Leverage The Well annotated data sets like coko where human has provided uh The annotation for the image and we can if we can provide the image context in the textual format this way the text only GPT can understand what's happening in the image and it can uh then uh expand the instructions and outputs into the million scale just to using a modified self- instruct pipeline so we have image and the coko has the caption to provide the image wide uh Image level uh context it can also have the uh more fine grin layout which are region level context which we can have the uh Runing boxes and categories so these can provide the more uh uh then this can provide like the image context information to the GPT at different granularity and we can Pro uh have a like more specific example where we first have in context examples for uh to to directly guide a model with specific examples so the user will just provide the image context in a textual format to uh uh show like what the input will look like and we also provide a sample response on what GPD should generate which are instruction and output pairs the instructions are basically the questions about the image and outputs are the that uh desired answer and these are the in context examples which we show the model how it should respond and then for any image from the uh Coco training set we just convert it to the text format as a as the same as what we presented here and ask GPT to generate new instruction output pair that following some predefined criteria and and it can also refer to the previous examples for a better understanding for the task and it can generate this instruction and output pairs by running this over and over again for each image on the Coco training set we're able to obtain around 150k instructions where we have defined three types of responses conversation detailed description and complex reasoning and these uh three types of responses are all respon Are all uh designed for a specific use case for the visual chck CH bot and for the model we uh find that uh the we use the clip as a vision encoder the instruction to language model vuna as a language model and for the projector we just use a single linear layer which we find be it kind be quite effective because the uh CL of visual features already uh contain lots of uh great visual semantics so we can just imagine it as a kind of a foreign visual language that's a language model can somehow understand and you just use use a single linear layer to project it to a space that the language model can better understand and that's why it can actually work and we train the model for two stage where in the first stage we uh train the this linear projector layer only on the image text Pur data set to for feature alignment so that during the process the images are projecting into a space and we learn this projector so that the this uh projected uh latent space can be understood by the language model and then in the second stage we find tune the language model and a projector end to endend for the visual instruction tuning so that the model actually learns how to answer the visual questions and uh we do uh after train lava we do observe similar uh several interesting emerging properties and before we dive into that we do want to quickly revisit some of the properties of of our training data that our training data contains only common concept so there is a limited domain and there is no human annotation during in our training uh at least for the lava V1 and also there is no explosive OCR data so we do find that after training lava it uh it uh it has uh great like visual reasoning capabilities to understand the unusualness of this image that the man is actually like ironing the clo on the back of a minivan like GPT for vision does it also can uh understand this humorously modified version of monola as a dog interace and it correctly identify so and also it has the uh emerging OCR capability that it can not only uh recognize the text in image it can also uh associate that with its pre-trained language knowledge that it it it recognize that cvpr is actually a conference related to artificial intelligence so people who are interested in AI may be interested in this in this conference so it's quite uh amazing as those are uh emerging capabilities and uh so so given uh we have trained such a interesting model with great instruction fing capabilities how are we going to evaluate that and we proposed to leverage uh TX on GPT to evaluate this model uh we draw our inspiration from the NP the vuna teams uh GPT evaluation and we just slightly modify our data creation pipeline to do this where we provide the context uh in the image context we provide the instruction and we provide a reference model output and the model model output of assistant that we are concerned and then we ask G request GPT for the feedback on the performance of these two model assistants and then GPT uh we will provide several criterian and the GPT will give the score out out of 10 and also provide a detailed explanation for that uh and also like to uh uh better uh facilitate this we propose a benchmark called lava bench in the wild which we designed to be challenging and it can requires knowledge beond training data it has requires multime model understanding and we also wanted to be able to uh require the model to uh uh understand sub of subtle visual details and uh we have uh we have provideed very detailed annotations for the image so that the GPT attex only GPT can actually have the ground shoes understanding of the image and one of the example of the this challenging Benchmark is that it will ask for what's a brand of a blueberry flavor yogurt in the image so it should first uh uh identify the uh all go through all the containers that are present in image there are multiple of the containers and you will need to First identify the blueberry flavored yogurt and then it there are no uh complete logo in in uh uh indicating the brand in this image so it should uh it can either recognize this uh fire brand uh logo this partially visible logo it can actually memorize that and then extra FL to to answer this or you can also uh see this small like how fire yogurt is pronounced and it can relate this to its uh pre-train knowledge and then guess that it is probably which brand yeah so it's uh an interesting and challenging Benchmark and we wanted to uh provide us some insights on how the model is performing and also like how how the model should uh try to improve for example how how does it extend to the knowledge Beyond its pre-train knowledge like does our retrieval uh really work and we do want to test these capability as well and since the introduction lava there has been great works from the community extending lava into different domains modalities and also uh developing benchmarks for us to better understand the behavior of those Vision language models and we have also developed our own improved baselines with visual instruction tuning lava 1.5 and to to uh design a uh improv Baseline we do want to try to figure out like what's the current uh strength and the weakness of lava and we find that actually lava is very very good at participating in visual conversations and in those new benchmarks that are designed for visual conversations lava actually outperforms many of those uh works that are introduced after lava even some some works are introduced I guess uh one or two months after lava's introduction and lava is still uh ranking on top of those uh benchmarks but lava does have room for improvements for example it does not necessarily uh know how to give good short answers or yes no answers for those academic benchmarks it does not have uh perfect OCR capabilities as well so we want to uh try to solve that and one of the problem is the yes no questions where we find that uh lava actually will answer like more than 90% of the questions to be yes and the main reason was actually when we uh generate our lava instructions we want the model to reduce the hallucination so we we ask GPT to uh only generate questions that it can either the confidently affirm or deny the presence of the object but we find that GPT will tend to generate uh questions that it can conf assert the Pres presence of non object so basically when it generate a question that ask for the presence of non object the answer will typically be yes so this creates some bias for that and uh after uh understanding this we do want to uh quickly revisit some of the common explanations of on why lava is not good at performing at vqa questions uh comparing to other follow-up works like instruct flip and one of the common uh explanation is that lava does not use resampler that instruct use q forer and Q and use visual resampler where they have some additional Vision modules that can re-encode the visual information instead of Flava where it just simp simply project all the patches and provide it to the language model and we uh we we guess that like this can be the reason but actually we do do want to uh we we do want to know that lava has the full model Capa capability and also like because we uh do not uh downsample the image to the uh those resample tokens so actually more information is preserved so maybe that is not the key reason and also uh some other uh papers say that uh it may be the large scale pre-training and because instructor use Millions hundreds of millions image to pre-train and Q and use billion scale while lava only actually use less than one million image to pre-train and we guess this may not also be the reason because the vision resampler has been pre-trained with large scale data so our our visual encoders are clip it has been train on 400 minutes data so that may not be the reason either and one of the reason that has been very if few mention are the academic task oriented data where are those vqa datas that can provide very specific visual knowledge and we did not use that in the lava V1 so we want to study how it can help us improve and we perform our scaling study on lava and we uh use three data sets gqa mme and mm vet each are uh designed for specific aspect of the vision language model uh the gqr for short answers mme for QA with format instruction like please answer yes or no and M andv are for open and QA and we find that directly as adding the vqa V2 data allows us uh the lava to significantly improve in The mme Benchmark where it achieves around one uh around the similar almost the similar performance as the instruction with 14b model and all performs it in mm that but we find that a uh an awkward situation which after including the data set and the same issue for instructional blip which is that when we are uh provide this image of the fridge and we ask lava to say like can you tell me what I can cook with these before that lava can actually uh identify the some of the ingredients in the image and provides a reasonable recipe well after the training the lava will just say yes and there's nothing else so actually like after training uh with those short answer data the model refused to provide natural answers and we find the reason is actually simple because some of the uh prompts that we use to incorporate vqa data are quite ambiguous so that uh actually like the usually the model will just say a complete sentence when you ask like what's the color of the shirt but when we are training on the VQ data we actually training it to say yellow instead of a full sentence and the same question will actually appear in our lava instruct so uh when when it is so so model actually got confused that whether it should a say a complete sentence or a small sentence and when there are more and more vqa data in our data set the model tends to overfit to those vqa data sets and the solution is quite simple like we just propose a response formatting prompt where we ask or we we say that you just explicitly instruct a model to provide a answer with some format like answer a question using single word or phrase and it works very well so we adding the formatting prompt allows it to surpass GQ uh surpass instruct blound mme and also mmv we F the uh add additional data sets like OC okq and OCR data sets so it actually outperforms instruct with on all those uh inv validation benchmarks uh when when when lava is only using a subset of its train data we also add additional data sets to further scale it up and our final model lava 1.5 achieves great performance on these benchmarks and we further extend the Benchmark to a wide variety of 12 benchmarks and it it outperforms uh previous State art methods with great sample efficiency as we only use less than one% of the data that we have been they have been using training and lava 1.5 has been able to generalize to different format prompt for example it can be instructed to the generate the Json format it can also be instructed to create prompts for stable diffusion where you give it a very specific uh uh uh identif uh annotation on how the prompt should look like including like it should be uh describing this uh cartoon image at a specific order and laa can follow such instructions also it can be uh you uh trained uh it can has also learned to follow the those instructions to identify the factual errors in the questions and it can uh it can for those tricky questions lava 1.5 can handle that as well so it will even if you say that what's happening in the desert when there is no one no such in the image the lava can identify that and say that there is actually no desert deserts in the image and it's actually a beach with palm trees City skyline and a large body of water and also it has uh starting to uh integrate the its Json formatting PR uh probability uh capability and also the uh OCR capability to do some information extraction uh for those although it still has some room for improvement comparing with GPT for vision but we do have an internal Vision a version that we may release soon that supports even higher resolutions and perform much better on those OCR uh tasks and yeah I think that's all given the uh time and uh we have we do have uh and I I think this morning like laava has been uh officially uh merged into the transform hogging face Transformers and we would be able to use lava much easier and we'll update the instructions on our GitHub page very soon and hope you love lava and you can try interesting examples on our demo page and I think we will uh hand the floor to htin and we we will do the Q QA later right Jerry yeah um so so this is this is great I also um first of all thanks for a fantastic presentation uh quick note I realized I forgot actually announced the format to both of you as well as the audience um so I figured we' just run through like uh two quick questions two or three quick questions on on lava right now um and then howan Dr from our side will present a little bit about some of the use cases of lava actually and it touches on some of the stuff you mentioned at the end of the slides like structured data extraction captioning um even multimo Rag and and so um maybe just a quick question here is and this is from the audience you use clip uh as a particular as a vision encoder uh was there a particular reason for that and and do you see like improvements okay uh that's a good question uh the reason we use clip is because it's has been trained on like the 400 million image text pairs data set so that it actually has a great concept coverage and uh this allows us to like uh train the this Vision language model connector with few samples and the language model actually because the vision encoder already encodes great semantics for different concepts language we do not need too we can make our pipeline training uh very data efficient so uh I think this is the main benefit for using clip and uh we have been exploring different uh alternatives for the vision encoder but I think the general uh like like the top priority is that the vision encoder has been trained with uh it has seen lots of samples so that it can coverage a lot of Concepts already uh and this ensures that we do not need to fine tune this whole Pipeline with like M hundreds of millions data because that will be too expensive yeah makes sense and then the second quick question is how does lava perform on OCR compared to maybe more specific OCR models as well as like 4V oh yeah so so that's a great question like for for OCR uh capabilities uh the current lava can can demonstrate basic OS capabilities if the image resolution because our lava 1.5 accepts like 3 36x 3 36 image so if uh they are reasonably within this resolution the lava can do I I I guess okay in terms of this and we do have an internal version that uh we uh add more ocer data and also we add more uh instructions that uh kind of require uh require OCR capabilities and it will be hopefully it will be released by the end of this year uh yeah and uh yeah it it's going to be improved and another uh uh important uh factor for o capability is the image resolution and we do find that IM IM image resolution is the key for having a great OCR capability and we are trying to work on a higher higher resolution version uh so that uh uh we can enjoy uh uh more benefits of lava on uh different tasks yeah awesome well thanks so much uh we'll cover some more questions uh towards kind of like a joint Q&A at the end um but in the mean time um how T from wama index are you ready to so so basically just for some context um uh we actually have been playing around with multimodal um models including lavva j4v uh other models too and we've basically this is kind of to help you you know and the audience understand some use cases with multimodel you know obviously LW index is focused uh primarily in the taxt Bas setting uh over the course of the past year uh like rag agents structured data extraction and so you can see some of these Concepts actually translate into the joint image text setting as well as the use cases um so yeah how feel free to take it one yeah thanks Jerry and also thanks another H for the great present yeah J so you can see my screen right so everyone can see my screen and hear me okay cool yeah let me start so SE off for like whole to in the lava I think everyone here Bas on sense how the lava is built and trained and also has also different use cases for them IND B we are for this part we are covering three use case we are using lava so as Jerry said we are more so lava index PR is both focus on test so we are do a lot of rack based on test but recently we have the lava HP 4B those kind of large Vision model in part us to do more like using image to on image visiting and also improve our R system I will give three examples here so the first example is pretty uh straightforward so uh basic we are doing some we have some like T 10K buyer T 10K as you can see is basic the financial report every quarter or every year for company so there are a lot of tables aot of text inside the financial report so one inter think is that if we have a image so can we using also using our rack to caption those image so lava is a great vision model so basically we can somehow understand do the image reasoning from the image using lava and then we get the text output from the image reasoning then we do the rack based on the the test so here we first load the like the test 10K model uh test 10K fire so basic we're using the advanc recursive retrial to retrieve some tables some text from the test 10K fire then we build some of the index on the T fire then for the vision part this is one image I give as like input so this is the image for the test my question here for this image is that what's the T factor is shown in the image pleas give me the short answer basically I found the laa sometimes two vers so I if I want to get the accurate like the concise answer B I say hey can you give me the name of the factory so it's is the correct answer but somehow it's a jiga Texas the full name should be the jiga Texas Factory and then after get the response the image reasoning for the lava model what I do is that I ask the or or t TK fire so this is the Gant fire contain a lot of information from Tesa and now I do the retrial based on the L response B is TSA Factory I get a lot of notes so there are some notes related to a table some notes with just a text from the PDF fire so this this file is like external knowledge about the TLA from lava itself I don't think it can capture all the information from TLA so if we have those external knowledge you can help us better understand what the image is about so I ask the the or R system hey can you do the retrial based on the query generated from the lava Bas is this like the Texas Factory you can generate the final answer give more information from the Texas Factory so it say the the factory refers to the Giga factory located in Texas and it has some equipment whatever showing the response so Bas idea is that combines the vision model L Vis model for the image reason and also our R system with a lot TX reach information we can somehow generate the better understanding for the query and also for the user intent so this is the first example the selling example is very similar to what uh another H introduced so we found also La is is very good at to Output some structure data so I can give an example especially for the e-commerce and also search ad recommendation area so some a lot of times we have those kind of products so people usually do the web cing for those products cross product title description brand Etc but we find that we can just directly using the image using the screenshot of the post because this is one as post from Instagram and it shows the the Air Jordan NE shows and also the brand the price so basically directly ask lava what's information inside this image and here we using some Advan Advanced like structure output so we call it a ptic so ptic is a class that we can say we can set some attribution what kind of account brand product we want output from this image then we ask lava hey can you generate the class attribution for this image and surprisingly this is my promp can you summarize what's in the image and the return answer in the J format and sending to the the Pint this class to understand to F every attribute for this class and we can see that by calling the lava 13B model you can generate the account brand part category everything very accurately so basically instead of we Coss the data from like a web page or from text we can potentially direct it using the image reasoning and also some OC model to understand okay what's what's the information for this image and also attribution and the third example is that after we have pic those kind of structure output from the image now we can do multim model like the rack system so mul model R is basic a Ral system is basic we not only have the image uh Tex ined we also have the image in beding so basic we have the text we have image we have build different reg DB store for those text and image embeding so basically here I bu some data set using some random Wikipedia page because Wikipedia really has image and also have a text so I Coss some data from the Wikipedia with different type of the Wikipedia page I build the M mod Vector index so you can see here we have we are using one better DB so it's the quadrant and also we are building two SE Vector store one is the text store Tex store is all the text from Wikipedia and Image store is all the image Rec from the same Wikipedia page we load into the or like the V DB mul model V DB now we do retrieval because this is all the image uh record from the Wikipedia and our query here is from the second example our query here is just Air Jordan the brand we want to find all the rant information for this specific query from the from the image then the results here we are showing when we do the retrieval we're showing this is the top three retrieval image more similar to the query Jord and also where return the text notes this is all the text most relevant text are relevant to this query a Jordan so after we have the image and also we have the most R text we can somehow ask in the L for example gb4 uh gb4 and also other LM to summarize what we can according to those image according to this topk retrieve the text what can we summarize information so basically my problem final prompt for this query is that can you tell me more about this brand the pic responsibil is Air Jordan you can according to the image with retr and also T RR you can summarize the perfect response to answer this question yeah so I just demonstrate the three simple use cases inex we are trying to leverage lava Vision model to improve our current record system yeah so this is my part great um and uh I know there was like three main sections there there's actually a lot for each of those um so if you think about the first one um retrieval augmented image captioning basically the idea is that g given an initial image caption can you augment it with additional text from a knowledge base um the second piece is like the structured output extraction with which uh both outand basically demonstrated um and then the third is just how do you actually plug in you know uh rag pipelines um and and add in images in addition to text as inputs so we have a lot of these different use cases uh this one notebook I think is primarily focused on lava right I know you're saying using gupd 4V I don't I don't know if this is using 4V or or mostly using um Lava but either way you know we we integrate with a variety of these models um and we're constantly exploring new use cases so if there's things that you guys are interested in trying out um or that there's potential uh applications of multimodel in your setting uh please let us know right because we we have actually probably 10 plus guides on multimodal stuff right now um and and we're very open to kind of exploring more stuff here cool so I know there's some uh questions in the chat um I figured in the meantime um and by the way the the uh link to the notebook and the collab it's both in this Zoom chat as well as available on our docs page um in the meantime I figured we would could just like ask some questions to basically um kind of uh both how T Lou as well as Hing um both kind of on the research side as well as on the um application side right and some of these questions might be more applicable to to some of you versus um versus others maybe maybe to start with and this is just running through questions from the audience um Can lava be used uh to answer questions about tables maybe maybe uh how howz uh you can take that for first uh I think I can Tech I tried to if you check our like the lava index Lama index like the repo we do have some recently have some examples passing the tables directly from PDF and we try different models the mostly the GB 4V and also there's the t i is a text table extractor model from Microsoft we found that uh sometimes I can I can be ref the lva certain B we are testing is not very good at extracting the table contents and especially when the table from PDF is pretty complicated some tables has different columns and have different rows but we found the best solution here is that first we T the screenshot of each PDF page and then we using T the Microsoft package to identify the location of the table now we extract the table from the PDF and send to the g4v for on sanding so somehow it can give us the the best C or best results compared the different option and I think I will leave the the lava quality for on standing table to I think there's still lot of UNS solving problem especially for the tables in the PDF yeah so this is my take um and then how do you have things that yeah yeah I just want to quickly add because uh yeah I think first like uh parsing the T texting the tables is a very important application and we've been working on that the the main reason for uh for that is that I I I think for current like multimodel models you do to to to complete a task you either need a very large capacity for it to emerge if it is not instructed to do so or you have some Specific Instructions teaching those models on how to solve this task so lava is mainly currently relying on its emerging capabilities for solving these task because the current released version does not has not been fine tuned on those tasks asks that much so yeah we uh like hopefully it will be um uh much more improve much improve more improved in our uh later update but I think it's generally uh for for like most of the existing open source models it's like this where you uh if you find the model does not perform well on a task it is uh very first very likely that it's not trained to perform such task or has not been trained on similar task so that it cannot easily extrol it and second is that it may not have a very ba uh it may require the model to be trained on a large scale like for example OCR it it needs some base uh capability for different task like the extracting table it needs the OCR capability it also needs language capability to Output the tables in a structured format so I think uh like we can think of these uh this way and also uh I do see like some community members have found that fine tuning lava on some uh table data directly can allow it to improve the such performance because you now instructed you have provided instructions on how to how the model S should solve this task yeah great and and maybe a question for you howon Le um there's uh speaking of instruction tuning um I actually had a question that might also help kind of elucidate this for for the audience um in terms of General instruction tuning data you mentioned using like text only model like GT4 basically to help you kind of generate you know give some inputs like uh more detailed responses um could you actually walk a little bit more through how that process works because one thing I noticed is that in addition to the image caption itself you actually put the bounding boxes in text form and I'm curious if that was like in your mind like a key trick that you did to really kind of upgrade the quality of this instruction tuning data and if you think there's actually future directions here to basically kind of use tax based models with more inputs to actually help improve the quality of of of the data set yeah that that's actually a great question so uh so so basically like the key uh the key that we use uh the key is that we want to provide as much information as possible for the text GPT text only GPD so that it can know like what the image looks like and also like even if we have G pt4 Vision now and let's suppose that it will not have like uh quoda limit uh in the future at some point it's still not perfect in all tasks so if we do have some uh ground choose annotations that we can leverage that we can provide accurate information for GPT to directly leverage it will be much uh it can very much improve the quality of the generated instructions because now the uh answers are more likely to to be correct so uh going back to your question where like why do we want to add the bounding box is that like so so the the captions will provide a image level uh instructions and Bing box will provide more fine green details on where the object actually look loc are actually located at so I think one extension or uh to this is that let's suppose that we want to create instructions for the table data and let's say that we actually have the uh source markdown for creating those tables then we can like provide those markdown to the to GPT to tell it what this table look like and that is generate like questions to the to this table uh can it generate question to reformat this table and it can because now the table of the those marked on table are just gr Table and there will not be errors for like if if you just provide a screenshot to GPT for vision it may make make some mistakes in understanding those tables so basically uh I think the key is first uh provide it with as much information as possible as you can for for the task that or samples you want to provide and second is that provide uh as accurate information as possible if you can provide some information that that the uh so that GPT for does not need to hallucinate or try to recognize itself it will be the best and and maybe this is kind of like a dumb question but maybe just like a quick follow on is if you um you know par a markdown then for a lot of users they and and you're able to do that well then they might just stick with like a tax model right um and and so like uh is the additional advantage of using kind of like a multimodal model um because you're also using like a taex model to help generate some training data also the fact that you have these like additional kind of um General visual inputs and ground truth yeah yeah I think that's a good point so we we do want to like make it clear that Al like we although we generate those instruction data using a tech Tex model we finally are training a multimodal model the reason we provide the markdown is that we want to provide it accurate information on how the what the table looks like so that it can for example if you ask questions about the table uh it can generate the retrieve the very accurate answer and it does will not contain any errors in terms of OCR for example and in terms of training the model you're still providing the uh for example if you train lava for table understanding you're still providing the screenshot of the table and then you ask it for example uh to retrieve some values from this table and you have an answer and those instructions are generated by a text GPT and this way like the us when the user uh you train this model and the user are using this lava variant that are fine tuned for the table use case the user are still like taking a screenshot and ask the questions then lava will refer to the image and give the answer so yeah great makes a lot of sense um the next question is for both of you and we could start with howl on the research side uh and then move to how TR on the on the use case side what kind future directions are you excited about um you know on the model side as well as on the use case side yeah that that's a great question when we're excited to share about uh we we have uh we think like the current uh open source model the the there two main uh weaknesses uh first is the uh first is the uh the lack of the instructions to for real world applications which is an extension of what what we were talking about uh about the tables where for some of the use cases like understanding the tables or doing more complex reasoning about those memes or maybe structure data and it has not been instructed to do so so we're creating uh uh we're trying to create as much as application driven data instruction data as possible so that uh it will have more like real world application value the second is that the main the key limitation inside the language model and the multimodel model for the language model uh because the base model are currently 13 B so it may have the limited language reasoning capability so that some of the problem solving uh uh performance are bounded by the language model second is the input to the uh language model are not uh large enough so that some of the details cannot be captured when you're for example when you're taking a screenshot of a document 336x 336 is definitely not enough so we're trying to scaling up the resolution so that it can uh so try to solve this task from both the model side as well as the data set side to make it uh more uh ready for the application use this is the for for the image and also like for the multimodel definitely we want to see uh what are the possibilities of having a unified model which can accept the image can accept video can accept audio uh and it can process that uh simultaneously and how to handle uh lar large memory which we believe should be able to hand uh uh should be very relevant to the rag uh systems where for example I think the I I saw someone asking about jamni and uh I I guess like for if you really want to create a demo video that jamni demonstrated you will need a great a huge number of the context window because and you we will also need a good retrieval system or I either a good model to understand what are the relevant content in the history that you really want the model to consider when you are answering the question yeah I think that's my my the answer from my my side yeah yeah makes sense um and then how yeah I think I tottally agree with Al uh for me there two two things one is horizontally horizontal means that we may have a different type of input in like the video audio image so it's it's a Ral AGI Especial for retrieval for rack for if we have different information in the real industry we may have the text we have image everything how can we build a better Ral ranking system to using all the information we need right so sometimes for example the image test you may not they may not match each other and also some provide some pollution data how can we purify the data and also using different source is the one thing another is vertically so I mean we have a lot of great models Germany is coming out and also gpv is improving so those models are better for some specific task OCR maybe the table passing those those test can be improved so that our rual system and also ranking or whatever system they can using those capability to improve the accurate accuracy of the pr recall I will be also very interesting yeah so that's my take sweet um I think that's basically it in terms of most the main questions uh of course if you guys have uh more thoughts or questions please feel free to join our Discord uh or just you know send Howen howl questions about the lava model itself um yeah I mean this is this is a great session thanks to howl and how Dr for for um taking part of this webinar uh congrats by the way on The NPS oral I know it's happening uh next week but that's a great achievement um and so to everyone in the audience I definitely highly encourage you to try to think about and track out some of our multimodel use cases we're pretty excited about this we think you know especially as these models get better and faster um as lava gets better um as some of these proprietary models come out as well like Gemini there's going to be an emerging class of use cases uh Beyond just pure alms so definitely encourage you to check it out um and thanks again uh for coming and have a great Friday
Original Description
In this webinar we're excited to host Haotian Liu, author of LLaVa (Large Language and Vision Assistant) - a ground-breaking series of open-source multi-modal models that are competitive with GPT-4V.
We do a deep dive into the model itself, and we also do a short presentation on multi-modal use cases with LLaVa + LlamaIndex from Haotian Zhang on the LlamaIndex team.
This is going to be an exciting webinar, don't miss out!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from LlamaIndex · LlamaIndex · 43 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
▶
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
LlamaIndex Virtual Meetup (May 4th, 2023)
LlamaIndex
LlamaIndex + MongoDB Workshop/Fireside Chat
LlamaIndex
Discover LlamaIndex: Ask Complex Queries over Multiple Documents
LlamaIndex
Discover LlamaIndex: Document Management
LlamaIndex
Discover LlamaIndex: Joint Text to SQL and Semantic Search
LlamaIndex
Discover LlamaIndex: JSON Query Engine
LlamaIndex
LlamaIndex Webinar: Active Retrieval Augmented Generation
LlamaIndex
LlamaIndex Webinar: Demonstrate-Search-Predict (DSP) with Omar Khattab
LlamaIndex
LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs
LlamaIndex
LlamaIndex Webinar: Graph Databases, Knowledge Graphs, and RAG with Wey (NebulaGraph)
LlamaIndex
LlamaIndex Webinar: Community Project Showcase (07/07/2023)
LlamaIndex
LlamaIndex Webinar: LLMs for Investment Research (with Didier Lopes, co-founder/CEO at OpenBB)
LlamaIndex
Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 1, LLMs and Prompts)
LlamaIndex
Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 2, Documents and Metadata)
LlamaIndex
Discover LlamaIndex: Key Components to build QA Systems
LlamaIndex
Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 3, Evaluation)
LlamaIndex
LlamaIndex Webinar: From Prompt to Schema Engineering with Pydantic (with @jxnlco)
LlamaIndex
Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 4, Embeddings)
LlamaIndex
Discover LlamaIndex: Custom Retrievers + Hybrid Search
LlamaIndex
LlamaIndex Webinar: Document Metadata and Local Models for Better, Faster Retrieval
LlamaIndex
LlamaIndex Webinar: Build Personalized AI Characters with RealChar
LlamaIndex
LlamaIndex Webinar: Make RAG Production-Ready
LlamaIndex
LlamaIndex Workshop: Building RAG with Knowledge Graphs
LlamaIndex
Discover LlamaIndex: Introduction to Data Agents for Developers
LlamaIndex
LlamaIndex Webinar: Finetuning + RAG
LlamaIndex
Discover LlamaIndex: SEC Insights, End-to-End Guide
LlamaIndex
Discover LlamaIndex: Custom Tools for Data Agents
LlamaIndex
LlamaIndex Sessions: Building a Lending Criteria Chatbot in Production
LlamaIndex
Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 5, Retrievers + Node Postprocessors)
LlamaIndex
LlamaIndex Webinar: How to Win a LLM Hackathon
LlamaIndex
LlamaIndex Webinar: LLM Challenges in Production (w/ Mayo Oshin, AI Jason, Dylan from Starmorph)
LlamaIndex
LlamaIndex Webinar: Agents Showcase!
LlamaIndex
LlamaIndex Webinar: Learn about DSPy
LlamaIndex
LlamaIndex Webinar: Time-based retrieval for RAG (with Timescale)
LlamaIndex
LlamaIndex Webinar: Build/Break/Test LLM Apps Showcase (co-hosted with TrueEra, Pinecone)
LlamaIndex
LlamaIndex Workshop: Evaluation-Driven Development (EDD)
LlamaIndex
LlamaIndex Webinar: Building LLM Apps for Production, Part 1 (co-hosted with Anyscale)
LlamaIndex
LlamaIndex Webinar: Learn about Fine-tuning + RAG (w/ Victoria Lin, author of RA-DIT)
LlamaIndex
LlamaIndex Webinar: What's next for AI after OpenAI Dev Day?
LlamaIndex
Introducing create-llama
LlamaIndex
LlamaIndex Webinar: PrivateGPT - Production RAG with Local Models
LlamaIndex
Multi-modal Retrieval Augmented Generation with LlamaIndex
LlamaIndex
LlamaIndex Webinar: LLaVa Deep Dive
LlamaIndex
A deep dive into Retrieval-Augmented Generation with Llamaindex
LlamaIndex
LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini
LlamaIndex
LlamaIndex Webinar: Efficient Parallel Function Calling Agents with LLMCompiler
LlamaIndex
Introduction to Query Pipelines (Building Advanced RAG, Part 1)
LlamaIndex
LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)
LlamaIndex
LlamaIndex Webinar: Advanced Tabular Data Understanding with LLMs
LlamaIndex
Ollama X LlamaIndex Multi-Modal
LlamaIndex
Build Agents from Scratch (Building Advanced RAG, Part 3)
LlamaIndex
LlamaIndex Webinar: Build No-Code RAG with Flowise
LlamaIndex
LlamaIndex Sessions: Practical Tips and Tricks for Productionizing RAG (feat. Sisil @ Jasper)
LlamaIndex
Introduction to LlamaIndex v0.10
LlamaIndex
Build SELF-DISCOVER from Scratch with LlamaIndex
LlamaIndex
Introducing LlamaCloud (and LlamaParse)
LlamaIndex
LlamaIndex Sessions: 12 RAG Pain Points and Solutions
LlamaIndex
LlamaIndex Webinar: RAG Beyond Basic Chatbots
LlamaIndex
A Comprehensive Cookbook for Claude 3
LlamaIndex
LlamaIndex Webinar: RAPTOR - Tree-Structured Indexing and Retrieval
LlamaIndex
More on: Multimodal LLMs
View skill →
🎓
Tutor Explanation
DeepCamp AI