Mixed Attention & LLM Context | Data Brew | Episode 35

Databricks · Intermediate ·🧠 Large Language Models ·1y ago

Key Takeaways

This video explores innovative approaches in Large Language Models (LLMs) with a focus on Retrieval Augmented Generation (RAG) and mixed attention mechanisms, discussing their impact on improving efficiency and reducing operational costs.

Full Transcript

welcome to data Brew by data bricks with Daddy and Brooke the series allows us to explore various topics in the data and AI community and whether we're talking about data engineering or data science we interview subject matter experts to dive deeper into these topics while enjoing our morning Brew my name is Denny Lee I'm a principal developer advocate here at datab bricks in one half of data brew and hello my name is Brook wenig I'm the director of our machine learning practice and the other half of data and today I'm thrilled to introduce shash rajut who is a research scientist at mosaic and data bricks welcome Shang how did you get into the space of LMS I know you have a very academic background so perhaps we could start there before we dive into the main topic of today which is all about mixed attention sure um so before uh joining uh data breaks I was a graduate student at w medicine and for the most part of my PhD I actually worked on theoretical and mathem iCal aspects of like foundational machine learning and optimization um but then uh I think when Char GPD came out everyone uh got really excited and I I looked at it and I was like damn I need to I'm really interested and I'd like to see how it really works how so then I started understanding like the theoretical aspects of how it works and like um mathematically understanding how powerful uh the Transformer architecture really is Transformer architecture is what Powers all these LMS um so then we discovered like uh it it's powerful enough to actually simulate a f on computer um so then I was uh I got more and more interested um and then I did an internship at Deep mine and U there I've further worked on Transformer we we we use Transformer to build one of the first um generative recommender systems um so like applying the generative AI technology to recommended systems um and then yeah so I was uh I was interested in this and then I joined Derricks um their pre-training llm team um and then I've been learning ever since um and yeah that's how I got started in the llm a I just went through the state of report which showed like 74% of um all of the like lolm based paradigms still using Transformers um so I know there's a few alternative proposals out there but that one's definitely the main one and speaking of Transformers from the original paper attention is all you need before we get into what is mixed attention perhaps we could just level set to make sure everybody understands what is attention so if you think about let's just take llms which which take as input text and uh let's just say that they take one word um like each token is a word whereas like technically tokens can be different than words words but um let's just assume that um llmc is a sequence of words and the way llms process each word is um there's a feed forward part which is the part of the network that just looks at that particular token essention and then there's the attention part which only looks at the other um token so it's like you can think of it as a grid like ffns um like the feed forward networks uh process each token um if like this is your sequence um then FFN process like each token individually and then attention looks at um all the other tokens in your um sequence well it also looks at yourself but it also looks at every other token in the sequence um and then the way it does that is it um it essentially creates for each other word in your sequence it creates something called a key and a value um vector and then for the current token that it's processing it compared something called a query um Vector so it's like hey I have this word and this is like the semantic embedding which is a query Vector for this word and I would like to query all the other words in my you know uh in my sequence to say which uh one has the which one aligns the most with this query um so whose key Vector lines the most and that's how it like um tries to judge the importance of any other word in your sequence so if let's say um you're processing a particular token and you're basically trying to Output the next token and then you'll see which other like in the simplest Words which other token is the most important for the current context um so that it does using attenion and using these key query and value um vectors and so quick question with that so this is more of a newbie question like when you say sequence when it comes to a Ling the the token is the sequence in this case a sentence or is it a phrase an entire paragraph I'm curious as in how that feed forward or that attention is applied yeah it's uh it can be as big as you want it could be just as sentence the paragraph uh so when people say that you know there's this new model which supports um 100,000 context length or a million or 2 million context length it's essentially that you can feed all that like 1 milon in words into the uh Transformer and it will look um at all of those okay gotcha well then I guess that naturally makes me want to ask a question what's the drawback of standard attention like that seems like a grandio solution that solves everything but obviously there's some drawbacks so why don't you tell us a little bit about that in that case right so um what there's the main drawback is uh so it's kind of like a um Pro andac one the thing is attention looks at every other word in your context right so it has the capacity to like at one go look at everything right so that's a good thing but also that's a bad thing because it's spending too much time and memory looking at every other word um whereas most of the time right uh if you're let's say you're you have asked the llm to write a a long novel or something like that so it's let's say at the millionth token um you don't really want to look at all the other tokens like necessarily most of the times your next word that you're trying to Output um can just be derived from the past 100 or 200 words like even if you're writing a novel right um and you're like let's say you were writing Harry Potter and Harry Potter um switched is and then the next word would probably be wand you don't need to look at like all the previous Harry Potter um like tokens to know that like that the next word is probably going to be the wand um right um so yeah that's how uh so basically um one the advantage is that it can look at everything the disadvantage is that looking at everything um is computationally expensive and also takes memory um and this is both during training and inance got it so then based off of that context I guess this leads on to to talk about this idea of mix attention right so what exactly is mix attention um I believe it's experimental but please clarify yeah so as I said right um the transform like the attention mechanism looks at all the other tokens all previous tokens in your context um but it doesn't really need to always um in particular um this attention mechanism is uh present in every layer of your Transformer so if your Transformer has 100 layers um each layer will have this attention um and so there have been Works which uh they have been Works which show that you know even something like a sliding window attention works pretty well which basically means instead of looking at all the other tokens you're only looking at like say the last 100 tokens or last 200 or last thousand tokens instead of looking at all the so that's why it's called sliding window attention because your attention window literally is just like sliding as you keep like as you keep outputting more it keeps um sliding um so you're always looking at the last, tokens um and that saves a lot of memory uh during inference and it's also faster um but that this mechanism did lead to um like some degradation in model quality in particular as you can um see that if you're only allowed to look at the past thousand tokens um you won't be able to answer let's say a question which occurred like 1 million tokens ago um and there's like even theoretically you can show that like with the sliding window the maximum context then theoretically possible is your sliding window size times your number of uh layers so if you have a 10 layer Network and your sliding window is a th000 tokens um the maximum theoretical token that you can attend to in the past is th000 * 10 which is 10,000 but in practice it's much lesser than that but what we saw was if you add just a few um full attention layers right let's say you have again let's say you have 10 layers in your model and you have just two full attention layers and the rest are our sliding window we saw that you were able to recover a lot of your accuracy um a lot of your model quality on longer context um so that's one thing um and the other thing which uh which we were looking at was um this concept of KV cach sharing where there have been some papers which what they did was um so remember I told you about like each word has its query embedding and it tries to match it with like the other to con using their k um like key embeddings and then they also use uh value embeddings um so during inference what happens is these key and value embeddings are essentially the representation of all the other tokens that's what you store as your context and so you really want to minimize that so um people found that if you share the KV cach uh KV representation across layers so like not each layer has its own key and value um that if you do that you you still have you you you're still able to get good accuracy and like good model quality um so we were playing around with not just this but also some other ideas like um adding Mamba layers in the middle which is like a completely different form of um well it's not even attention it's like a different way of you know um processing sequences um so we were playing around with these and that's when um a Blog by character AI came out which was also exploring the exact same ideas um of mixing these up um but they also they had already they said that they productionize it and they were using they were like serving huge ups with those models um so we were like uh we uh we knew that this is going to work because some other people have already productionize that model so like whatever we are exploring is valid so that gave us a boost and uh so um we we were already trying out some different configurations but then we decided to choose just the ones that they had like talked about a little bit um but then when so what they were doing was they were using both sliding window attentions um interps with full uh attention list but then also sharing k cach um which is also something that we were trying but they had a particular configuration that they had shared so we started off with that then um and uh we saw that just using that naively actually did not give us good results um so we started playing around with changing the configuration and then we found that okay changing the configuration even um a little bit like makes the model quality change drastically so we were like okay let's just study this um you know how should we arrange the sliding window layers what should be the sharing pattern and all that stuff so thank you for all that context it really seems like this is all just a question of trade-offs of full attention is great but it's going to be costly both training and inference um as well as potential unnecessary and so with mix attention it seems like you kind of get the best of both worlds of you can get improved performance without necessarily sacrificing too much un qualities is that a decent synopsis yeah yeah um so that's what our experiments showed um uh we saw some degradation on some metrics but we weren't sure if that was because we just didn't train them enough or there exist some other configuration which will which will be able to recover that quality um but like for most of the other um metrics and like metrics that we care about uh like the quality metrics that we care about we saw we were able to retain the same quality and before we dive into like the metrics and how you evaluated it I'm just curious is there a differences in the network itself whether you have the full attention at the beginning at the end does it really matter where that happens in the network yeah that's a great question and that's actually one of the things that we explored in the paper um we found that uh having a full attention in the very first lay actually did not help at all um so which was surprising because we I always assumed that in the very first there if you um like like if in the very beginning of processing if you able to look at every other um thing uh every other word in your your context it would be helpful um but turns out it's not and then the way I mentally Justified that was hey like initially you want to look at the local context and then you figure out what you really want from your longer context um but I don't know if that's how it's really happening um but yeah uh having the full attention layer uh in the beginning does not help at least in our experiments um but having it in like third or fourth or fifth layer helps a lot um so and like also like having um we didn't experiment with having the a full attention there only in the last layer like all the other are stting and only the last layer is full attention so I'm not sure how that would work out um but that would be a great thing to try it's really interesting before I I definitely want to um understand a little about how those models value but what what I found really interesting about it especially when you're talking about sliding windows and caching is that you're literally reminding of the me of like the L1 L2 L3 like GPU or CPU caches and I'm wondering in some ways if the mix attention is analogous to that where you basically some cases you're going to use sliding Windows just like in some cases you use the L2 cache which is embedded with the the chip itself versus the L3 cache which is like around the whole system which is analogous to full attention so I'm just I I don't know I just thinking happen to be thinking about that way when you were talking about it that's all um yeah I mean there are some an and some analogies there because um um like sliding window is faster it's going to be faster um okay but if you really need longer context then you need a full attention um but I think in L1 L2 L3 Cash System it's like the if you if you have a cash miss you go to a higher level um right here in sliding window there's no such concept like if if if you don't it's not like it's Dynamic window size that you you run attention and then you find out okay I'm missing something and then you increase the sliding window length and then you try again um But like after a few layers if you have a full sliding window uh sorry full attention layer um that's when you can uh you'll be able to yeah yeah gotcha so in that sense not the L1 L not a hop from L2 L3 is just more a matter just keep using the sliding windows and you're you're you're fine because later on in one of the layers you'll have to full attention anyway so you're good to go you'll end up getting hit that okay got it yeah okay so this naturally leads uh me to ask a question how do you even evaluate all these models like how did what what's the mechanism by which you understand what's effective what's not effective uh when you build the architecture how many layers you want to build it like how is this evaluated um yeah so uh when you design the architecture so we and our team had like this uh different size of model architecture I think like some are pretty standard um they had like an a architecture read which we had pred decided was more of a trade-off between like you know the bigger the model the more gpus it will need for training and also the more inference um like the more time and computation it needs for insurance um so for example for these particular experiments um I knew let's say I I would have a budget of like X gpus and I need I want to run so many experiments um so what's the size of the model which I can um viably like train in those like this much computational budget um so that's how essentially we decided the uh model size for these uh experiments um but like to evaluate the quality we have multiple metrics the first thing you look at is just the loss how it goes down um um if your model architectur losses if like say you're comparing two models Model A and B and rest every other thing is the same um like they're trained on the same um data they're trained on the same like type of data everything like if the data set is the same and you have trained that for the same amount of time um then if one of them has higher loss than the other then probably that's a worse model um so that's uh the first thing you look at but most of the times so that thing does not work uh super well for long context models because um in the end loss is just like predicting the next word um most of like the loss the way you comp is you know you give it a a data set and you try to train on it by asking it to create the next word and you compute the loss uh on it um and for the bulk of the training you're training it on just like normal internet data and as I said um most of the times you don't really need to know the entire context to predict the next word um right um so then you need uh to have evals um one like evaluation data set or evaluation um so some you have some like long context eils that exist out there so you you have the existing like short contexts which are like the ones that people talk about mlu heras Swag and all these um but since my focus was on Long context abilities um we were looking at long context eils um so the one of the first eils that was proposed long ago was uh what it just it's like you have this huge piece of text and you hide a key there somewhere um or needle and the the task is called needle in a h so you hide some particular text in there and then at the end you ask the model hey U what was the text I like I hid in the this context so like it's a million length or like what 100,000 length text and you somewhere in their head like a little word and you're asking the model to retrieve it so that was one of the first uh eals that was proposed for longer but now there there are even more eals um for these experiments we chose a particular um eval like set which is called ruler um they do this needle in a h but they also have some other interesting um interesting evals where let's say um one of the EVS is the question answer e where what you do is um you take paragraph uh from let's say Wikipedia um let's say you sorry you take like 100 Wikipedia articles you concatenate them and then you ask a question about about any one particular document in the middle so the model has to uh look at the question figure out which particular of all these 100 documents which particular document the contains the relevant answer and then it has to go there and figure out the answer so this is kind of similar to what would happen if you have like retrieval augmented generation where um you have a bunch of documents and then you're asking the llm to do something about that um so yeah these are the kind of Els that I use R yeah the needle on the Haack one is a fun one I remember I think it when Claude first came out people asking like putting a random fact in there like about a squirrel or something and it would reply back with here is the answer but it feels like you're intentionally doing this to try to trick me it's like the models are definitely being trained I think I guess question for you are more and more models being trained with need H stack like uh evaluations um it's so for for closed Labs uh I don't really know like I don't know what open AI or like anthropic is doing um but it seems like for the open source models and the models that we are training we're not explicitly training on needle in a his in fact uh one big part of what we're trying to do is we're trying to figure out what eals are really important um let's say for our customers or for the public because um need uh it was one of the preliminary EVS but it's not really very useful right like um if you really want to search for something in a huge piece of text just search it using grip or whatever like some other computer functionality um Transformers are proba like really expensive way of doing that and also errorr um so one um really important uh direction of research or like yeah is just figuring out what kind of eals do we really do people care about um and then um once you have the E you can think about how you can train a model to be good at that task yeah so for example uh for U long context one problem is that you don't really have a lot of high quality long context data so um people resort to using synthetic uh training data sets so for example for these experiments what we did was uh we did the same thing we took Wikipedia articles concatenated them and then asked questions about the Articles and then trained it on the answers um so this sort of thing can be done using um so you can take a existing llm and give it one particular Wikipedia article ask it a question about it right and it'll create a question and an answer and then what you do is you concatenate all the hundreds of Wikipedia article together and then at the end um so give give an llm all of these and then ask it to answer the questions that the other llm generated so basically you can use an llm which has a shorter context to generate questions and answers and then synthetic synthetically in like just P it with a bunch of other Wikipedia articles and then um train it on answering that question very interesting yeah there are other ways of doing that uh doing synthetic long context data Generation Um one is like just summarization if you take a huge um text and just create a summary um so the way you create a summary is um you first chunk that long article up and then feed the individual smaller chunks to an llm um and ask it to summarize those and then feed all that summary into another LM to summarize it again and you keep repeating it until you have a small enough summary and then you train a model on the entire context toered that small sum um so yeah there are various ways of doing um I'm I'm just curious like the you just said use another model I'm just curious does it actually have like can you just use the same model to summarize just but with different prompts or are you actually literally using a different model just so that way you have different context okay I just probably over over overload of the term context right now by the way so by apologies uh no so uh you can do it either way um the thing is uh okay so there would be advantages of using the same model and there would be advantages of using a different model uh the advantage of using the same model is that uh there wouldn't be a huge distribution shift in your um generated output so um the llm wouldn't be confused by the like it would it would it would uh the would focus on just the gaining long Contex abilities essentially instead of like trying to learn all the the distribution shift because let's say Model A is being trained by you and you're using model B which was trained by another company and model B starts off its output with some interesting whatever um style um your model will end up starting to learn that style instead of what you really wanted to learn which is long context abilities um so that's one but like in the other uh where it might be useful to use a different model if this that other model is really really good quality right let's say you're training a smaller let's say you're training a smaller model U let's say Lama 8B like I think that's what they did in for llama um actually for llama they trained it with 8B outputs but let's say you have a way of um generating high quality answers but that answers come from a different llm then you'll use that because that's higher quality um but if this could also be another case where um the higher quality model is really slow to generate outputs so in that case you might compromise and say okay we'll use a faster model which does not have as much speed um so yeah there are all these factors so like yeah you might really try to figure out which one worth best for you so I have a lot of follow questions for you just from that first one is about the distribution shift would you see that if you were to use like the Llama models but different sizes uh because they are presumably trained on relatively similar data um so I'm curious if you'd expect to see that distribution shift within the Llama family just different sizes versus like the llama versus any of the proprietary models um yeah within the Llama family probably wouldn't see uh in fact there isn't much distribution if that's intentional um the way they did was not only they trained it on similar data but also the smaller model were distilled on the larger models outputs um so when you distill you really en forcing that not only the output but like before the output distribution of the um so like llms don't really just output a token they output like essentially like a probability over what the next token could be so you're forcing it to match that probability if you're dis when you say distilling most of the times it just means you're trying to match the um probability distribution so the intentionally like uh distillation has other benefits as well but one of the benefits is that you so match the distribution output distribution um and this really comes into this really come like becomes helpful in other places as well like if you're doing speculative decoding and stuff where you have a larger model which is verifying your outputs but a smaller model which is like actually generating the outputs um so in that case um if their distributions is are different then the larger model might say hey this is wrong output whereas the output could be right it's just say just staying in in a different way or essentially doing that so um having a family of models which have similar distributions is really helpful and like you might train them that way like to ensure that that that totally makes sense thank you for the additional context St Danny I did the same thing as you overloading the word context um after remove that word from my vook uh don't even want to say the word vocabulary um yeah yeah now we're we're going to overload vocab at this point yeah so exactly so in terms of using an llm to generate question answer pairs based off of documents um I know one of the drawbacks is you won't be able to generate a question answer pair where it's like I don't have the information to be able to answer this question uh because if you're generating from the document therefore you would have the answer presumably and so I'm curious when you're training models how do you avoid that problem right so that's a very interesting question in fact that's something that um uh I haven't done in my experiments yet but like I really want to do next um I don't have much context on how people do that but I'm pretty sure people do that because there are U evaluation benchmarks which do exactly that that they provide a paragraph and then ask a question which does like which um like you can't answer it based on the context right and so then the lm's task is to say that hey I don't have enough information from the context to answer this instead of just hallucinating something so there are evaluation metrics and benchmarks out there um I like so because like we're working on followup of mix attention um so I really want to like train on that kind of data where you know um it's trained to answer he I don't have enough information to answer this um but unfortunately that's the extent of my knowledge I I'm guessing one way to do that would be generate questions um from a different document and then provided a different document from a different source so for example generator document the question from the document about Milky Way and then ask it a question about uh Andromeda galaxy or some other Galaxy um and yeah or the Chate yeah oh yeah that could be a really interesting thing as well because um if you ask it a question about Milky Way but the the is there a CH bar called Milky Way right yeah that's right yeah there's a chocolate bar Milky Way that's why I brought yeah yeah yeah yeah because uh for some reason I was I remember there was one chocolate named like named after Galaxy I didn't remember it was Milky Way or androma but yeah Milky Way would probably makes more make more sense if it's a milk chocolate right oh no actually milky yeah I remember eating milky has such a bad memory um I actually have we're we're we're throwing the context all through the roof oh I'm sorry for that one I'm sorry for that joke yeah um but yeah if you give it a Wikipedia article about let's say Milky Way the Galaxy and ask it question about Milky Way the chocolate um that would be a good training example because uh you should train it that even though these look similar you know the article is about a different Milk Way um yeah so I I hope it don't mind but you brought this up with the mix attention a little bit earlier uh you brought Mamba architecture and I've heard about it but I don't know enough to be able to explain it so you know this is definitely can you explain like I'm five like what exactly is the Mamba architecture here so uh yeah so before Mamba architecture so the Mamba architecture is basically belongs to the family of uh machines called State space machines um where um essentially um okay so if you're if if let's consider Transformers when you're outputting the next word um you do have essentially a state which is your KV cache which is your context like the way your context is stored and the problem with Transformers is is that that trans well the benefit and like drawback of Transformers is as we discussed that um this context as it grows your memory requirement and computation grows because um in particular the memory grows because your KV cache which is like your internal representation of your context um has a vector for each each word in your sequence so if you have a million words in your sequence sequence your KV cash would be proportionately long so it will be like a million tokens long um so you do have essentially a state but it's the state grows linearly with the input um but like the other states like the not the other but like the state space machines like Mamba um and actually even RNN and like lsdm um they uh they have a fixed State like they're like okay we are only going to dedicate this much memory or like they have a fixed size of the memory which stores all the information about your context um the benefit is that you can have a million context length or 2 million context length and the size of that will remain same so your speed will remain the same no matter how long takes you output it or how much is your input length um and the memory will also not grow but the drawback is also that like that since you're act you know just the amount of memory is limited um it becomes really difficult for it to remember something which was long ago um which happened like really long ago um so yeah they are faster but um at least in our experiments and like I think the papers that I read um I mean there are papers which propose a new architecture and then they claim that hey we are um really good at long contact civilities and all that but um in the end when you try to verify that like you you find that Transformers are better and that's why most of these um um actual gig models are like hybrid they have Mamba but they also have like a few layers of butin um oh gota so yeah that's essentially what these models are they have like a fixed space H sorry State uh which like that's the state of the machine which is which summarizes the context that it has seen so far okay got it so then even with the you said the larger ones the idea is that you still from a architecture they still have a full attenion so sort of like how we talked about you might miss it a couple times because the from a long context that information isn't there but because four or five layers later there is a full attention layer you can always go back to that it presumably would slow things down a little bit because you've got the full attention but then you would actually still have a hit you wouldn't have a Miss basically yeah yeah I mean although like pure Mamba just refers to all the layers being Mamba layers no attention layers but I think there are some models forgot the names because they're named really similarly like Mamba Jamba Samba and then zamb J yeah yeah gotta samb right that one yeah Samba has Samba or Zamba one of them has like uh a hybrid architecture where you have Mamba and then full attention gotcha gotcha all right that's really cool yeah and so it seems kind of like the crack of the issue is just how much context we should be passing into an lolm um so like taking some stats from one of your other blog posts like the great gaps be I think had 72,000 uh roughly 72,000 tokens that's most of the modern models have a long enough context window to pass that in whether or not you should pass that in is a separate question yeah but I I'm just curious do you think that this kind of like continual expansion of the context window is the way to go like are we going to start seeing uh much larger context windows or is it kind of diminishing returns of it's not actually able to be as effectively leveraged we should said focus on ways of reducing the amount of context that's pasted yeah uh so that's a really interesting question because that's a very active uh area of research like outside and within data breakes as well um so one thing is um let's just use the use case of uh rag right retrial augmented generation where you have um where you want to ask let's say model of query so what you do is you retrieve the relevant documents from your database and then feed all those documents into your llm and then ask it the question and uh based on those documents the llm outputs some some answer um then your quality could be really improved if you have a really good retriever like the quality of those documents which you retrieved if they are really really relevant to the query um the output would be really good and the other direction you could go is you could have a bad retriever but then your llm has like a million context then so you're just feeding maybe even your entire data set into into your context and then um your llm just figures out um I would much rather have the former case where you have a very really good retriever so that your llm context window doesn't need to be that large um because in the end llms are expensive um to run um and also not just expensive like mainly slope um so if uh also in inally like your quality does degrade after uh some length um so yeah I mean figuring out what the tradeoff is and like building better retrievers um and then like I think that would be a better direct plan I mean obviously you should continuously work on trying to improve the long context abilities of your model um but from a practical perspective also exploring that other direction is important where like you know building a system about around these llms and ensuring that the system itself is good enough and then like not all the weight is like llm doesn't we uh Bear all like all the weight like the system also Bears some of the weight and that makes a lot of sense it's not just the quality it's also the cost and the latency uh to consider as well well uh we just want to say thank you so much for joining us today Shashank um enlightning us all on mix attention and the benefits of combining both sliding window attention with traditional attention for kind of the Optimal Performance and quality tradeoffs uh and just overall discussion on direction of research in llm so thank you so much for joining us to on data Brew thank you for inviting me it was a pleasure

Original Description

In this episode, Shashank Rajput, Research Scientist at Mosaic and Databricks, explores innovative approaches in Large Language Models (LLMs), with a focus on Retrieval Augmented Generation (RAG) and its impact on improving efficiency and reducing operational costs. Highlights include: - How RAG enhances LLM accuracy by incorporating relevant external documents. - The evolution of attention mechanisms, including mixed attention strategies. - Practical applications of Mamba architectures and their trade-offs with traditional transformers. Connect with Shashank Rajput: https://shashankrajput.github.io/ https://www.linkedin.com/in/shashank-rajput-51ba5372/ https://x.com/shashank_r12
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Databricks · Databricks · 16 of 60

1 Building AI Agent Systems with Databricks
Building AI Agent Systems with Databricks
Databricks
2 Databricks Workflows
Databricks Workflows
Databricks
3 Automate Unity Catalog Upgrade with UCX Part 1: Overview
Automate Unity Catalog Upgrade with UCX Part 1: Overview
Databricks
4 Automate Unity Catalog Upgrade with UCX Part 2: Installation
Automate Unity Catalog Upgrade with UCX Part 2: Installation
Databricks
5 Automate Unity Catalog Upgrade with UCX Part 3 - Assessment
Automate Unity Catalog Upgrade with UCX Part 3 - Assessment
Databricks
6 Automate Unity Catalog Upgrade with UCX  Part 4 - Group Migration
Automate Unity Catalog Upgrade with UCX Part 4 - Group Migration
Databricks
7 Table Migration and Catalog Design with UCX | Part 5
Table Migration and Catalog Design with UCX | Part 5
Databricks
8 Setting Up Azure Access for UCX Table Migration | Part 6
Setting Up Azure Access for UCX Table Migration | Part 6
Databricks
9 UCX Table Migration: Creating Catalogs and Schemas | Part 7
UCX Table Migration: Creating Catalogs and Schemas | Part 7
Databricks
10 Automate Unity Catalog Upgrade with UCX  Part 8: Code Migration
Automate Unity Catalog Upgrade with UCX Part 8: Code Migration
Databricks
11 Streaming to Kafka Just Got Easier with DLT Pipelines
Streaming to Kafka Just Got Easier with DLT Pipelines
Databricks
12 Data Engineering From Data to Dashboards with DABs: Crunching the Cookies Dataset
Data Engineering From Data to Dashboards with DABs: Crunching the Cookies Dataset
Databricks
13 Epsilon helps businesses connect with their consumers using Databricks Data Intelligence Platform
Epsilon helps businesses connect with their consumers using Databricks Data Intelligence Platform
Databricks
14 Unilever transforms operations with GenAI using the Databricks Data Intelligence Platform
Unilever transforms operations with GenAI using the Databricks Data Intelligence Platform
Databricks
15 ActionIQ enables businesses to unlock customer data with the Databricks Data Intelligence Platform
ActionIQ enables businesses to unlock customer data with the Databricks Data Intelligence Platform
Databricks
Mixed Attention & LLM Context | Data Brew | Episode 35
Mixed Attention & LLM Context | Data Brew | Episode 35
Databricks
17 Inside Databricks SQL: Engineering innovation with Hans
Inside Databricks SQL: Engineering innovation with Hans
Databricks
18 Inside Databricks: Engineering innovation with Michael Armbrust
Inside Databricks: Engineering innovation with Michael Armbrust
Databricks
19 The Money Team at Databricks: driving revenue and customer growth
The Money Team at Databricks: driving revenue and customer growth
Databricks
20 Unity Catalog unveiled: engineering data governance at scale
Unity Catalog unveiled: engineering data governance at scale
Databricks
21 Create a view in Databricks and share it with Power BI using Delta Sharing
Create a view in Databricks and share it with Power BI using Delta Sharing
Databricks
22 NDUS leverages Databricks Data Intelligence Platform to revolutionize higher education management
NDUS leverages Databricks Data Intelligence Platform to revolutionize higher education management
Databricks
23 Démo Databricks de AI/BI
Démo Databricks de AI/BI
Databricks
24 EMEA Data + AI World Tour 2024
EMEA Data + AI World Tour 2024
Databricks
25 GenAI: The Shift to Data Intelligence - Customer Panel on Industry Use Cases
GenAI: The Shift to Data Intelligence - Customer Panel on Industry Use Cases
Databricks
26 GenAI: The Shift to Data Intelligence - Ft. Ash Jhaveri, VP of Reality Labs Partnerships at Meta
GenAI: The Shift to Data Intelligence - Ft. Ash Jhaveri, VP of Reality Labs Partnerships at Meta
Databricks
27 Virtue Foundation leverages the Databricks Data Intelligence Platform to advance global health
Virtue Foundation leverages the Databricks Data Intelligence Platform to advance global health
Databricks
28 Announcing Synthetic Data Generation in Mosaic AI Agent Evaluation
Announcing Synthetic Data Generation in Mosaic AI Agent Evaluation
Databricks
29 AI/BI Dashboards Embedding - A tutorial
AI/BI Dashboards Embedding - A tutorial
Databricks
30 Bayer transforms global data management with the Databricks Data Intelligence Platform
Bayer transforms global data management with the Databricks Data Intelligence Platform
Databricks
31 Databricks at AWS re:Invent 2024
Databricks at AWS re:Invent 2024
Databricks
32 Hive Metastore and AWS Glue Federation in Unity Catalog
Hive Metastore and AWS Glue Federation in Unity Catalog
Databricks
33 Data + AI World Tour Paris 2024
Data + AI World Tour Paris 2024
Databricks
34 Retail reimagined: Currys data-first strategy to driving growth and improving operations
Retail reimagined: Currys data-first strategy to driving growth and improving operations
Databricks
35 Mixture of Memory Experts (MoME) | Data Brew | Episode 36
Mixture of Memory Experts (MoME) | Data Brew | Episode 36
Databricks
36 Verana Health Data Curation and Innovation with Databricks and AWS
Verana Health Data Curation and Innovation with Databricks and AWS
Databricks
37 Securing SaaS Applications: Obsidian Security on Their Journey with Databricks and AWS
Securing SaaS Applications: Obsidian Security on Their Journey with Databricks and AWS
Databricks
38 Twilio Eng VP on Data Intelligence & AI at AWS re:Invent 2024
Twilio Eng VP on Data Intelligence & AI at AWS re:Invent 2024
Databricks
39 Chegg Eng SVP on Data-Driven Approach to Student Success with Databricks and AWS
Chegg Eng SVP on Data-Driven Approach to Student Success with Databricks and AWS
Databricks
40 Ibotta Personalized Rewards Innovation with Databricks and AWS
Ibotta Personalized Rewards Innovation with Databricks and AWS
Databricks
41 Simplify AI governance with #databricks AI Gateway
Simplify AI governance with #databricks AI Gateway
Databricks
42 Databricks SQL and Power BI Integration
Databricks SQL and Power BI Integration
Databricks
43 Databricks Serverless SQL Warehouses
Databricks Serverless SQL Warehouses
Databricks
44 7 West powers audience growth with the Databricks Data Intelligence Platform
7 West powers audience growth with the Databricks Data Intelligence Platform
Databricks
45 Secret to Production AI: Tools & Infrastructure | Data Brew | Episode 37
Secret to Production AI: Tools & Infrastructure | Data Brew | Episode 37
Databricks
46 Skyflow CEO on Data Privacy with Databricks at AWS re:Invent
Skyflow CEO on Data Privacy with Databricks at AWS re:Invent
Databricks
47 Databricks Clean Rooms Product Demo
Databricks Clean Rooms Product Demo
Databricks
48 Dun & Bradstreet Enrichment & Monitoring, powered by Delta Sharing & Databricks Marketplace
Dun & Bradstreet Enrichment & Monitoring, powered by Delta Sharing & Databricks Marketplace
Databricks
49 Unpacking Libraries in Databricks
Unpacking Libraries in Databricks
Databricks
50 Providence uses an AI agent system from Databricks to help doctors improve their communication
Providence uses an AI agent system from Databricks to help doctors improve their communication
Databricks
51 How State Street Uses AI to Transform Millions of Trades Daily
How State Street Uses AI to Transform Millions of Trades Daily
Databricks
52 Vevo Therapeutics CEO on Curing Disease with Data at AWS re:Invent
Vevo Therapeutics CEO on Curing Disease with Data at AWS re:Invent
Databricks
53 Over Architected with Nick & Holly: Databricks updates for Feb 2025
Over Architected with Nick & Holly: Databricks updates for Feb 2025
Databricks
54 The Power of Synthetic Data | Data Brew | Episode 38
The Power of Synthetic Data | Data Brew | Episode 38
Databricks
55 Use Databricks Lakehouse Federation to break down data silos
Use Databricks Lakehouse Federation to break down data silos
Databricks
56 AI's rugby score: National Rugby League rallies fans with analytics and unified data
AI's rugby score: National Rugby League rallies fans with analytics and unified data
Databricks
57 Open Variant Data Type in Delta Lake and Apache Spark
Open Variant Data Type in Delta Lake and Apache Spark
Databricks
58 How would you sort Ætheldred in the alphabet using Databricks?
How would you sort Ætheldred in the alphabet using Databricks?
Databricks
59 A guide on how to operationalize the Databricks AI Security Framework (DASF)
A guide on how to operationalize the Databricks AI Security Framework (DASF)
Databricks
60 Future-Proof Your Asset Performance Management with Generative AI - Field Assistant Live Demo
Future-Proof Your Asset Performance Management with Generative AI - Field Assistant Live Demo
Databricks

This video teaches how to improve LLM accuracy and efficiency using Retrieval Augmented Generation (RAG) and mixed attention mechanisms, with a focus on practical applications and trade-offs.

Key Takeaways
  1. Explore RAG and its benefits for LLMs
  2. Understand mixed attention mechanisms and their evolution
  3. Design and implement Mamba architectures for LLMs
  4. Evaluate trade-offs between Mamba and traditional transformers
  5. Apply RAG to improve LLM accuracy and reduce operational costs
💡 RAG can significantly enhance LLM accuracy by incorporating relevant external documents, and mixed attention mechanisms can further improve efficiency.

Related AI Lessons

Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →