WizardLM: Evolving Instruction Datasets to Create a Better Model
Key Takeaways
The WizardLM project introduces a new way to distill a dataset for fine-tuning LLMs, going beyond just a new model. The paper and accompanying resources, including a Colab notebook and GitHub repository, provide a comprehensive overview of the project.
Full Transcript
all right in this video I'm going to be looking at wizard LM so this is a new academic paper that also includes a model which most people are just talking about the model and also includes a data set as well and the thing I'm going to say actually is that for me the model itself is probably not the most interesting thing about this project so this comes out of Microsoft I think it's from Microsoft research Asia I'm not 100 sure about that but I think so what they've basically done here is a really interesting idea building on where alpaca was going so if you think of alpaca started out with 175 written human examples and then distilled from gpt3 that amount up to 52 000. and it turned out that in that data set there were lots of Errors there were quite a number of bad responses that kind of thing and that that's why people went through manually and looked at it and created the cleaned alpaca data set so one of the things and then this also led to people basically training more of these models on distilled data sets so a distilled data set if you're not familiar with the term is something that's basically been taken from another large language model and is not something that's been created uh by humans usually people are using open AI usually people are using either the gbt3 model or the chat GPT model I think we're starting to see some people look at using the gpt4 model for this now there's a whole controversy around whether that is legal to do or when is that legal to do there are you know moral issues of where some people say that okay they scraped the internet so of course whatever they've got it's hard to defend legally I don't want to get involved in that what I want to look at in in this particular thing is this idea of how do we distill a better data set than actually what came from alpaca and then if we distill a a data set how does that relate to actually training up a model that's going to be better as well so if you're just here for the model look at the bottom you can Skip Along to the collab and I'll you know walk you through the model and stuff in a bit I the idea with the paper is really interesting so I they've released a data set and they've released a model and this current models and data set is a 7 billion parameter model trained with a 70 000 examples from what they're calling evolved instructions the cool thing that they're working on is that they're working on another version of this which is has three hundred thousand instructions so my guess is pretty soon we will see either a better wizard lm2 7 billion or perhaps even a 13 million or bigger model so let's look at the paper and look at what this actually does with the alpaca idea they took a human written instruction and then spun variations out on that here what they're doing is that they're basically taking a core data set similar kind of idea but then what they do is they evolve these instructions so they start out with a very simple kind of instruction and this is one of the key things that they point out about things like share GPT and alpaca is that the instructions on the whole are very simple instructions and because you're just training and fine-tuning on simple instructions you never actually give the model enough sort of hard instructions so that it won't be able to do those when it just comes to inference time so their goal is that you start out with some sort of simple instruction and then they actually just use a random process through the prompt and we'll look at the prompt in a minute to develop these this sort of a simple initial instruction into something that's going to be much more complicated and that could be going in a variety of different directions as they're showing in this diagram here and then when they get something that's more complicated they will store that and keep that but then they'll often also go okay let's make a more complicated version of the complicated version kind of thing and so they raise these up through degrees of difficulty as they go along and the idea is that they're getting both simple instructions right through to very complicated instructions and when they do the the actual fine tuning they want to have a nice mix of these so that you're not just training on sort of 80 really simple instructions and 20 really hard instructions that you've got simple media um right through up I think they go through 10 levels of difficulty in in here and you can see that the way that they talk about this um is that this idea of evolving or mutation now they have checks in there to see that if it evolves in a way that doesn't make sense or something like that they then don't accept that they're just going for the idea of it evolving and getting uh to something more complicated so if we look at the diagram here they will basically come up with initial instruction they will then generate out some different instructions I guess this is what the wizard is supposed to be doing here and we can we'll see in a sec that the wizard if this is what the wizard is doing it's just a prompt that does this if they find they basically have an instructional Eliminator if they find something wrong with an instruction they take it out otherwise they pass it through and they store it into the instruction pool so let's look at how they're actually doing this so they talk about the primary purpose of the in-depth evolving is to make currently given instructions more complex and increase the their difficulty level and and we can see if we look at this prompt so here they talk about the prompts they're very careful that they don't make it too complicated too quickly so they always want to push it just we limit each evolving to be a bit harder and restrict adding a maximum of 10 to 20 words so the idea is that each instruction should be slightly be harder but not too hard so they're getting these levels of difficulty or the degrees of difficulty as they go through the prompt template is as follows so they have this prompt template and then they just randomly change out some of the prompt here so you can see the idea is that I want you to act as a promptory writer your objective is to rewrite a given prompt into a more complex version to make those famous AI systems I.E chat GPT and gbg4 a bit harder to handle but the Rewritten prompt must be reasonable and it must be understood and responded by by humans so they've got a whole set of prompt thing going on here and you can see these bits where for certain things they will add a constraint or a requirement for certain things they'll have different ways that they can mutate it and change it and so this is an interesting idea so this this idea of evolving and mutating things and then picking uh the best ones uh I haven't seen it applied to prompts directly before but this is it's something that's been used in machine learning in the past that's for sure so this is what's going on in here I and from this they then go on to train up and they've got all their prompts in here it's really nice paper actually for seeing what they've done and for them explaining what what they've done in there so they've got you know these things going on like I said they've released a a data set so they've got this data set here and you could go and train your own model on this data set already and they're already working on 300 000 approximately 300K which is going to be the full evolved instructions and they're planning to train another model based on that so that's something that's interesting to to look at all right so they've released a demo if you want to try this out without having to run it yourself come in here try this out see how it goes for you the model I'll just give you the tldr thing the model is definitely very good I it's not uh perhaps as good as the the stable vikuna model that just came out but that model is almost twice as big as this so I wouldn't expect that this would necessarily contend with that but you could imagine that a version of The Wizard LM so it's a llama model trained on 13 billion parameters on this 300K data set is going to be pretty good and probably perhaps even better than the the stable uh vikuna model with the RL HF so this is definitely a nice interesting alternative to using RL HF for this so I've got a code lab here this is I've set it up the same as my other ones I've put in some so filtering stuff that I basically was using from the stable vacuna in here and we can come through and just look at the standard outputs on the whole I think that the standard outputs are actually very good that this model is doing when we ask at the questions about the the llamas raccoonas it's it's on top of that it does make mistakes and these mistakes I think can be attributed a lot of them attributed to the model size that perhaps a bigger model would do better for this so here we can see that we've asked it about gbt4 it's replying about gpt3 so that's not a good sort of sign the simple questions like you know what is the capital of England yes it certainly knows that there's no you know no problems with facts like that writing the stories I feel like I it's perhaps not as good as some of the other models that are have come along like koala or something but again that was trained on with extra data sets that relate to stories and poems so that sort of benefited I think from that unfortunately I know for a lot of people are going to find that it it doesn't a good job where if you don't like the smai language model it tends to say that quite a bit what I found so in this case the stable vacuuming model was doing much better in its answers that we didn't get that as a AI language model in here now this is probably because their instructions have been distilled from Chachi BT where it says this a lot so you could imagine that once the 300K version is out you could filter that data set to remove all the as an AI language model responses and be able to get a better more open data set for doing fine tuning for something like this okay it's logic and reasoning was not very good here so we can see here that this is the same question that we asked before where you've got 23 apples you use 20 you buy six more should be nine unfortunately here it's saying that it's three so again this is partly due to the size of the model I think although it could be that they've just not fine-tuned on this and my guess is on on the evolved instructions you're going to find that there are probably quite a number of errors in there just from the way that they've done it I don't know that I'm just guessing at that but from when you're spinning up and Distilling large data sets like this later on you tend to find that okay actually there were mistakes in this the alpaca data set being the classic one there that said this was a question that the stable de vacuna got wrong which was can I write a Haiku in a single tweet and this one says yes you can and actually goes ahead and writes a haiku there think that's Haiku I'm not sure the number of syllables Etc when we ask it the hypothetical questions it doesn't do a great job of these Ken Jeffrey Hinton have a conversation with George Washington it really doesn't get the concept of this question so it's answering I can provide information questions however I it's not possible to physically bring together to individuals who are not alive again this would be interesting to see have any of these sorts of things been mixed it what's in the data set that's like that I so this kind of data set is a very academic data set and so it would be interesting to see something like a wizard LM or a stable vacuna trained with some of the academic stuff as well as the more you know distilled sort of data sets or rlhf data sets too and they're going to do very well if we ask it can Jeffrey Hinton have dinner with Harry Potter I it doesn't really get this at all it appears to be hypothetical or fan fiction type of question that is not based in reality that's true I guess in in many ways it's answer actually is accurate but it's certainly it doesn't want to it doesn't propose a nice simple answer for this when I ask at some facts about Marcus Aurelius these were things I did with vacuna it does quite a good job on this first one asking about three facts it actually gets his son where a stable kuna got this one wrong and it then also gets the other questions about him you know right as well so it does a good job on both of those final question since it was The Wizard LM I asked her tell me about Harry Potter and studying Hogwarts and we got the as I an AI assistant I can provide you with information on the fictional World of Harry Potter and then that it gives us some stuff from the thing which is actually not bad anyway have a play with the model the data set I think is something that's very interesting my guess is that in the not too distant future we're going to see a wizard lm2 which is probably going to be uh better than this and we may even see a 13 billion version of this which is going to be a lot better than this so stay tuned for those things as always if you've got questions please put them in the comments below if you found the video useful please click like And subscribe I will talk to you in the next video bye for now
Original Description
Colab: https://colab.research.google.com/drive/1H308Mj11PTMCUm_TxTj8DF189ujDG_1w?usp=sharing
Demo: https://261f01fdd31bfe1ca0.gradio.live/
Github: https://github.com/nlpxucan/WizardLM/tree/main
Paper: https://arxiv.org/abs/2304.12244
In this paper I look at the WizardLM project which goes beyond just a new model and introduces a new way to distill a dataset for fine tuning.
For more tutorials on using LLMs and building Agents, check out my Patreon:
Patreon: https://www.patreon.com/SamWitteveen
Twitter: https://twitter.com/Sam_Witteveen
My Links:
Linkedin: https://www.linkedin.com/in/samwitteveen/
Github:
https://github.com/samwit/langchain-tutorials
https://github.com/samwit/llm-tutorials
00:00 Intro
02:57 Paper
08:32 Colab Walkthrough
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Sam Witteveen · Sam Witteveen · 49 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
▶
50
51
52
53
54
55
56
57
58
59
60
LangChain Basics Tutorial #1 - LLMs & PromptTemplates with Colab
Sam Witteveen
LangChain Basics Tutorial #2 Tools and Chains
Sam Witteveen
ChatGPT API Announcement & Code Walkthrough with LangChain
Sam Witteveen
Trying Out Flan 20B with UL2 - Working in Colab with 8Bit Inference
Sam Witteveen
LangChain - Conversations with Memory (explanation & code walkthrough)
Sam Witteveen
LangChain Chat with Flan20B
Sam Witteveen
LangChain - Using Hugging Face Models locally (code walkthrough)
Sam Witteveen
PAL : Program-aided Language Models with LangChain code
Sam Witteveen
Building a Summarization System with LangChain and GPT-3 - Part 1
Sam Witteveen
Building a Summarization System with LangChain and GPT-3 - Part 2
Sam Witteveen
Microsoft's Visual ChatGPT using LangChain
Sam Witteveen
Building a Summarization System with LangChain - Part 3 Using ChatGPT Turbo
Sam Witteveen
LangChain Agents - Joining Tools and Chains with Decisions
Sam Witteveen
Investigating Alpaca 7B - Finetuned LLaMa LLM
Sam Witteveen
Comparing LLMs with LangChain
Sam Witteveen
Running Alpaca7B in Colab
Sam Witteveen
How to finetune your own Alpaca 7B
Sam Witteveen
How to make a custom dataset like Alpaca7B
Sam Witteveen
Understanding Constitutional AI - the paper and key concepts
Sam Witteveen
Using Constitutional AI in LangChain
Sam Witteveen
Talking to Alpaca with LangChain - Creating an Alpaca Chatbot
Sam Witteveen
Text-to-video-synthesis with Diffusers and Colab
Sam Witteveen
Meet Dolly the new Alpaca model
Sam Witteveen
Checking out the Cerebras-GPT family of models
Sam Witteveen
A Step-by-Step Guide to Fine-Tuning Your Dolly Model (tutorial)
Sam Witteveen
Is GPT4All your new personal ChatGPT?
Sam Witteveen
Raven - RWKV-7B RNN's LLM Strikes Back
Sam Witteveen
Talk to your CSV & Excel with LangChain
Sam Witteveen
Vicuna - 90% of ChatGPT quality by using a new dataset?
Sam Witteveen
Koala Revealed: The ChatGPT Alternative You Need to Know! 🔍
Sam Witteveen
Running Koala for free in Colab. Your own personal ChatGPT? (tutorial)
Sam Witteveen
BabyAGI: Discover the Power of Task-Driven Autonomous Agents!
Sam Witteveen
Auto-GPT - How to Automate a Task Based AI with GPT-4
Sam Witteveen
Improve your BabyAGI with LangChain
Sam Witteveen
Generative Agents - Deep Dive and GPT-4 Recreation
Sam Witteveen
GPT4ALLv2: The Improvements and Drawbacks You Need to Know!
Sam Witteveen
Dolly 2.0 by Databricks: Open for Business but is it Ready to Impress!
Sam Witteveen
Red Pajama - Operation: Freeing LLaMA
Sam Witteveen
Investigating Open Assistant - Models, Datasets and Addons
Sam Witteveen
Investigating MiniGPT-4 - The Secret behind GPT-V?
Sam Witteveen
Stable LM 3B - The new tiny kid on the block.
Sam Witteveen
Bard can now code and put that code in Colab for you.
Sam Witteveen
Checking out Bark: a Text to Speech system by Suno AI
Sam Witteveen
Fine-tuning LLMs with PEFT and LoRA
Sam Witteveen
Master PDF Chat with LangChain - Your essential guide to queries on documents
Sam Witteveen
Using LangChain with DuckDuckGO Wikipedia & PythonREPL Tools
Sam Witteveen
Building Custom Tools and Agents with LangChain (gpt-3.5-turbo)
Sam Witteveen
StableVicuna: The New King of Open ChatGPTs?
Sam Witteveen
WizardLM: Evolving Instruction Datasets to Create a Better Model
Sam Witteveen
LaMini-LM - Mini Models Maxi Data!
Sam Witteveen
Finding the Best Free ChatGPT
Sam Witteveen
MPT-7B - The First Commercially Usable Fully Trained LLaMA Style Model
Sam Witteveen
LangChain Retrieval QA Over Multiple Files with ChromaDB
Sam Witteveen
LangChain Retrieval QA with Instructor Embeddings & ChromaDB for PDFs
Sam Witteveen
LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!
Sam Witteveen
Transformers Agent - Is this Hugging Face's LangChain Competitor?
Sam Witteveen
StarCoder - The LLM to make you a coding star?
Sam Witteveen
Testing Starcoder for Reasoning with PAL
Sam Witteveen
The New Wizards - Unfiltered & Unaligned
Sam Witteveen
Camel + LangChain for Synthetic Data & Market Research
Sam Witteveen
More on: LLM Engineering
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
Chapters (3)
Intro
2:57
Paper
8:32
Colab Walkthrough
🎓
Tutor Explanation
DeepCamp AI