Generative AI for Code with StarCoder

Rob Mulla · Intermediate ·📰 AI News & Updates ·3y ago

Key Takeaways

The video demonstrates the capabilities of StarCoder, a large language model specifically designed for code generation, and compares it to other models like Copilot, highlighting its performance and features.

Full Transcript

thank you baby [Music] hello everyone out there I hope you can hear me this is my first time trying this new um this new setup I'm just trying a midday stream broadcast to see how this new setup will work so if you can hear me let me know in the chat it is uh Tuesday May 23rd in the afternoon and we're gonna take a look at um some new models that have come out if the audio is okay let me know um definitely new to the trying this so um not sure if it's going to sound okay a little bit early yeah it's early for me you guys hear things you hear the music [Music] someone says they hear it sounds okay okay let me show this I'm I'm testing out this uh different streaming software so Maximus pain thanks for saying sounds okay and the music is actually doing um a loop so I need to figure out how to do that oh I feel I feel so out of my element here oh it's on Loop okay so I turned that off all right so today we are going to take a look at something called star coder okay so I need to realize that this chat will keep on showing unless I do that I also have banners here I can try um yeah let's do oh look you can check me out on um twitch YouTube Twitter everything and also join our Discord Community which is linked in my description down below so if you aren't already part of that please do that and we are um I do like live coding data science machine learning stuff and um and that sort of stuff okay so people are saying the music could come down just a little bit uh star coder stream 2.0 yes and I kept on delaying My Stream about star coder today because I was trying to get it running locally and I'm going to show you exactly what the issue was um but yeah all right let me see if I can do this audio thing no this is not working as well as I hoped it would we'll get there we'll get there there's a first time for everything I don't even know where the music is where did it go and now it stopped all right so I guess we're gonna do it without music if I could figure this out that would be great um but let's go ahead and let's go ahead and load up the star coder uh star coder announcement so we can talk about what exactly we're looking at I don't know if any of you out there have used audio coming over is just a little hum for the music of Fest okay some people are gonna hum for to make it uh uh sound like a music effect okay I I feel uncomfortable without the music but you know that's just me that's just my insecurities being shown so uh let's go ahead and share this screen and share this announcement so uh earlier this month in may they it was announced that they have this new uh large language model specifically for code that was released called star coder so starcoder and star coder base which is actually just star coder with some added um fine tuning for python we're released now I hadn't tried this out until someone mentioned it on my live stream on Sunday night but um I started taking a look into it and trying to install it locally and just get a better understanding of how it worked um and have it run but we're going to just do a quick summary of what it is so it's similar to The Llama model where they trained uh 15 billion parameter model on one trillion tokens so they basically took all of GitHub and they uh downloaded Jupiter notebooks they download voted over 80 programming languages they even took the get commits and the get issues and they trained this large language model on it and uh then they released it open source so I tried to uh download it and run it myself I'll show you exactly what the results were of that not so great it's a very large model and we try to download it on stream the other day thought it was about 10 gigabytes in size it turned out that was just uh part one of seven so it's about 70 gigs a very very large model when we tried to um to run it and that caused a bunch of issues when trying to run it locally um so how did they evaluate it so they if we look at the paper they're saying that this model is one of the best code based large language models out there when looking at a few different evaluation metrics so we have uh it compared it to a bunch of different existing models uh popular Benchmark they say is human eval which tests the model can complete functions based on their signature and Doc string so we're going to test that out today and see how well it does in practice we're also going to see how it it compares to uh to github's copilot which I know I personally have been using a lot and it's it works really well so um yeah so we're gonna see how this does we also knows uh what else are they saying so we found that both star coder and star base outperformed the largest models including Palm Lambda and llama despite being significantly smaller they also perform outperform code gen 16b mono and open ai's code Cushman so that's supposed to be supposedly the base model that's used in collab we also noticed that the find that failure case of the model was that it would produce solution here code probably because the type of code is usually part of the exercise to force the model to generate an actual solution we added the prompt file name Solutions here is the correct implementation of the code exercise I'm not quite sure what they're talking about there this significantly increased the human eval score from 34 to over 40 percent so these are some of the existing open source models has anyone out there tried any of these before is there anyone out there who's tried some of these open source code generated large language models I'd be interested to know if not they're using these metrics and they're showing that you could see code Cushman here is 33.5 on the human eval metric and they're saying that they have a 40.8 score um MBP I'm not sure what metric that is um so that's uh that they aren't in first place uh well okay that's star coder versus Star coder prompted uh so they they're the top of the evaluation metrics is this an interesting aspect of star coder is that it's multilingual we used thus we evaluated it on multi-ple which extends human eval to many other languages okay so that's what I believe that is we observed that star coder matches or outperforms Cushman in many languages on the data science benchmark called ds-1000 it clearly beats as well as all other Open Access models but let's see what else the model can do so they're showing that it has a text um assistant so you can just uh I think they've trained it to be more like a chat bot so we could see how that looks uh they also have released this data set that they used to train the model and they use that data set called the stack so I don't know if this is going to share the window let's stop sharing and present this entire window there we go um so this is a very large data set that they use to train the model on and we actually looked at this on stream the other night that so uh uh nope that's for evaluation um let's see where their data set yeah here's the star coder data so the star chord star coder data is 783 gigabytes of code and you think about just this being text the fact that this text could be 800 and or 783 gigabytes that's a lot of data for this to have been trained on of course they're saying over 80 uh programming languages that is trained on and they also included these GitHub issues I would think these Jupiter notebooks would be kind of large because sometimes those contain images um yeah so we could download this data set and sit here and wait for a few months for it to finish downloading but we're not going to do that we're going to go back over here uh to some of the examples here and um of course there's also this paper if you want to actually just read the straight up paper feel free to do so I'll put that here in the chat and uh and you guys can take a look at that a lot of different authors here one of the things I also saw that was pretty interesting is that um that they claim to have used several important steps towards safe Open Access model release including the improved pii so that's personal identical identifiable information redaction Pipeline and a novel tracing tool and make Star cuddle models publicly available and under uh more commercially viable version so they they're trying to do it responsibly when they create these models of course uh you know if you've used any of these large language models that they they probably were developed using a lot of external data and it's hard to say without a doubt that these um the sources were given under the whatever license they were released so they're trying to make sure that it's actually released under um public license all right so how how can we test this out we can test this out by using some of their existing demos but I actually want to show the GitHub repo and we can check out and see how they say you can run this locally if you were going to say install it on your local machine and try to run this locally what would you do okay so you would need to install the requirements.txt and I had done this and then you use the Transformers Library by hugging face and you import this Auto model causal llm and the tokenizer for this and of course you're going to need a GPU to run this on this is what I tried to do um and then you load this checkpoint and you give it a prompt so they're saying they're showing here that you load the checkpoint for the tokenizer the model you take your prompt and then you can encode it using this tokenizer and in theory then you pass that into the model you have it generate the output and then you can um decode that output and see what it would look like so these models of course it's going to take text as a string as an input and it needs to convert it into something that the model can recognize and can work with so that's why it needs to tokenize it and then the model can interact with that and the output then needs to be decoded um so I tried running a code Cushman a few times in open eye playground but it didn't notice a difference from standard chat gbt yeah um that's the crazy thing about how well chat gbt works when providing answers to coding questions as a chatbot it's really strong and it's and and it's interesting to see how well it actually can do compared to these models that are only trained on code uh because maybe it actually has some of that extra information that um that could be taken not necessarily from from something like GitHub um okay all right I think I can figure out how to get the audio back [Music] yeah so let's go into star coder chat actually didn't work when I tried it this link didn't work there's a vs code extension and that's where we're going to test it out most of it um you can also just go to this playground so they have two two different ones this uh cars star coder editor which you can play around with and this you could actually just write in like a python function like let's make this syntax python we can define a function what's a what's a good function example that you guys would think of let's define Square the numbers and let's do [Music] n pretty simple right it should return the result of this should be return and star star two let's see if it does that we're hitting extend okay so here's what it does did it created the correct answer for squaring the number it also took the liberty of making a version two a version three and a version four interesting um now we did have the number of tokens up to 64. so that's going to be um that's gonna change the way that this model performs also the temperature is going to make the results more and less verbose so if I take the temperature all the way down to 0.1 uh version four is broken yeah it must have hit the it must have hit the limit of how many uh how much it could write without before breaking all right so let's take this and let's put that temperature way down maybe keep the the number of tokens at 80 and extend this it's still going it's still going with the low temperature so it just wants to complete it wants to help us out with coating so bad and um and that's kind of interesting we also have this uh code completion playground so maybe here if we Define a function uh called train models and models we give it a bunch of models and validation type and then we'll put k-fold and then we'll let me yeah [Music] train the model so let's just pretend like this is a function that takes in a bunch of different models and we want to give it a validation type it should do k-fold cross validation by default train the model um on this stuff so and let's see how it goes so train the model using validation type and I'm writing this kind of as a prompt but it's going to expect this to be more of a doc stream and let's see what the results look like all right so it's using ml flow and this is kind of interesting so it's taking the number of it's assuming this models and knows because it's models that there are multiple ones that this is a link a a list and then it seemed the length of those models um and then deciding whether or not how to do the training why is it loading Json okay so this is because it's loading ml flow it's assuming that there's an ml flow logger going on in your training Loop so this that's not too bad um clip full-time streamer no not not really just just trying it during the day Clips welcome to the chat how's it going oh by the way let me try some dinners all that stuff okay so we tried this um but this is running like a cloud version oh we can also set the parameters here so advanced settings let's let's turn the temperature up a bunch and let's provide it with um find First Prime and then we give it a bunch of numbers and we're going to make this function finds the first prime number in the list of numbers and see how this goes now this is the solution with the temperature up at 0.9 so it's hot in here it's getting hot in here with the temperature at 0.9 uh it numerates over these numbers why is it at this why is it at this like colon while faults in or has two digits oh so it's assuming there's this other function called has two digits [Music] I don't know if this would work or not I we TR we tested this up on some leak code the other day and it was kind of Hit or Miss but the the point here is that this temperature really changes what the output is going to be so now that I turn the temperature all the way down let's make it colder it kind of make it 0.05 and let's generate this ah just about the same same solution I don't know if that actually took effect uh let's try reloading advanced settings temperature 0.2 put this in here generate the solution and temperature really had no effect on it uh we can also use star coder base which was trained specifically on Python and see if this does better or differently okay much simpler of a solution but this this assumes that there is an is prime function which I'm not sure can you please help me I want to learn how the yellow Works where you learn all the algorithms about it check out my YouTube videos about YOLO uh Hank that's a good question can't help you right now because we're in the middle of something but you know you can uh you can check out my YouTube videos if you do exclamation point YouTube on Twitch you will be linked to my YouTube videos about that um all right does it only do code or can also give insights on code um it looks like they have started to make or they have a a chat star coder chat which is more of like a question answering sort of Bot however it doesn't work in this and when I tried to run it on the Hungarian face site it said that there was not enough memory to run it um so this is just then they also just have this called um that's a chat they also have just this bass which is kind of like an auto complete basically so now let's see what happened when I tried to do this locally I'm gonna share my different screen [Music] all right so I have a pretty decent machine here I have 64 gigs of RAM which isn't huge but then I have two gpus one that I just upgraded to a 3080. um each GPU has about 12 gigabytes of RAM I found that when I tried to run this with just one of them that it wouldn't even load because the the model is too large but the way we would have so if I actually show up both of these devices both of my gpus we can get at least the model to load into the gpu's memory um [Music] and it's starting to load this checkpoint so these first when I first ran this it had to download all the different checkpoints of the model's weights like I mentioned before this was 70 gigabyte [Music] the room [Music] [Music] thank you [Music] thank you [Music] [Music] [Music] [Music] foreign [Music] [Music] foreign [Music] I think I'm back now can you hear me the end of the world yes my audio is breaking up I think it might have had to do with trying to load that model which uh which was you know a lot of Weights loading into my GPU on my local machine I run into this problem a lot when I'm streaming where I accidentally um will kill my stream by by accident by going to um trying to do too much on it can you guys hear me now okay so people are saying yes and it is the end of the world all right we're trying out this news I'm trying out this new uh streaming thing and it's I think it's a lot of user error on my part now I can't hear the audio from this [Music] settings audio you guys hear the music okay there we go all right so let's go back and present what I was before sharing before share screen just showing you guys this is the GPU that I was running this is my machine when I tried to run this model locally even given the fact that I have um these decently sized gpus and decently sized machine uh first of all it makes my stream completely stop working and second of all it wouldn't actually correctly run but that's fine we can run it using some of the plugins that will log into the remote a remote version and let us run it that way so I'll stop sharing that and we'll start presenting back to vs code all right so it was pretty easy to install basically just to install I installed these extension star coder if any of you guys have used um have used copilot before I have this down here co-pilot which we can compare it to um but star coder here is here and I think it's hugging face auto complete so I disabled it but I'm going to go ahead and enable it and then I had to put in my API token which you can get if you have an account there and then you it'll start to look like this when you start to write code so let's let's get up example like we were before and let's do Define a function any ideas for what we could do oh you look you guys can see it's already Auto completing here it's already trying its best to auto complete a function what's your local computer GPO ram my local computer my local computer is uh has a 30 80 TI and a 1080 ti so 3080 is 13 gigs and the 1080 is about 12. so API token free yeah I from what I've seen or for what I experienced I just created an account you had to accept a few things to make sure that you're abiding by their rules and then that's all it really took [Music] do you think it's a Linux issue have you tried on streaming on other platforms okay so streaming yeah yeah streaming is is probably harder than it would be on Windows but I'm using I'm trying to stream today using a different software than I'm I'm normally used to and I think this is just user error I don't think it has to do with the OS what operating system do I use Hank uh if you check out my YouTube Channel I also have a video about everything I use for my setup but I'm using um I'm using Ubuntu Linux I love it even though some stuff doesn't work perfectly all the time I I think it's the best operating system that you can install out there um all right so we can see that there's autocomplete here um so let's let's like import pandas I don't even know if this would work in a in a notebook so we can test that already so it already knows because I imported pandas that we should be using this or that the function that it's suggesting here is related to a data frame so let's do pars dates the data frame and see what the autocomplete comes up with okay this is a helper function parser dates parses dates into the year it is in and so if you do this we can see that it's already it tried to actually do another function if you want me to so it resets the index it takes what it expects to be a date column and then why does it drop those columns it drops the date column and the index I don't know why you would reset the index and then just randomly drop it hey but that's what they're doing oh daytime Index this is not a good way to do this whatever what they're trying to do what I would do if you were parsing dates and this is the thing about having these results come from a large language model is they could be completely horrible and you're just uh you don't want to just believe exactly what they say so couldn't you just do year equals PD to date time date and then new DT year [Music] and then I guess you could drop the index or drop the date column like they do here and return the DF look at this why not do this instead of this it gave us a solution it gave us something to start off on um and it just it really this is the strange thing about Auto completion for code is when it throws things in here that are clearly from someone else's code we're just referencing things like the path of their files like I don't know who Ted is but apparently on his desktop there's a project with metadata in it and the code completion isn't smart it's just thinking it's just thinking you know this is the next potential thing that the person typing the code will want to see um and whoa it seems like that this is doing it a lot more of it than it would if we're running collab which we'll switch and run what am I doing Rob um yeah just so everyone knows where you we're looking at Star coder which is an open source large language model trained on all of GitHub so it's a code specific one let's see what this question is hey Rob how much of an emphasis is there get a CS degree from the ivy league industry I'm stuck siding between Georgia Techs and upems you Penn's program I think you just do whatever I think you have to consider all your options consider how much it costs and go from there I think if you're dedicated and you're going to work hard and you're going to prove yourself after a few years out of school you're gonna have hopefully established yourself in your career and it won't matter as much where you came from now coming right out of school it might help you to have a different name on your diploma but I wouldn't worry too much about it if I were you um hey quasi-quaza welcome to the chat look at that I got you up on the screen all right so this is uh this is kind of interesting let's see if it works in a notebook well let's make a new star coder test the ipnb make a notebook it does work in here so we're going to use this and for pandas as PD [Music] and let's get some sample [Music] sample data if we import Seaborn which I don't actually think this load a different cattle or a different kernel here let's import Seaborn as SNS and then we can load just a fake data set in using seaborn's load data set and we have an iris iris data set all right so does the autocomplete work in here let's it does all right so let's do um plot create scatter plot Matrix so what we want this to do is it's something that Seabourn ordered already does but we're just gonna see how the code responds when we try to auto complete it uh create a matrix of Scatter Plots so try what is this this is interesting I'm not quite sure what it's trying to do here I would like so it doesn't seem to really understand the context of this uh Auto Eda plots [Music] plot distributions of pandas data frame columns so it did it autocomplete for that pretty good and let's see what it comes up with this oh no there's already a comment that says to be added [Music] I read it as star craft ABI AI oh Starcraft AI that would have been a lot cooler right [Music] quasi accidentally thought I was doing Starcraft AI on my stream a lot of commented out code and it also seems like it it's not Auto completing super long so numerical columns so it's selecting it's finding the numerical columns this is pretty good it's doing us huh it's concatenating a sample of 50 of the data with another sample of 50 of the data so this would give us the same size as the input I don't know why this would suggest this [Music] four column in [Music] come on I'm trying to help help you out for columns and numerical columns [Music] all right so it's at least it's at least creating this just plot well let's also import testing and we're going to call this testing out star coder so this function was mostly created by starkhoder let's see if we give it this data frame what it will do with it oh this is called Iris not DF so this does not work correctly the result of this is a data frame not the actual column names so if I did this it might work and the rest of this is junk so we'll comment that out [Music] just make this DF I'm helping out star coder a lot okay there we go so with a little bit of help it's plotting out the distributions of each feature in this data set all right let's compare this [Music] let's compare this to copilot now granted co-pilots co-pilot is uh it costs money so it's not like free and open source so there's a little bit of difference here but they did say that when compared to co-pilot that or at least the foundational base model that use is used to create co-pilot that it performed better using their benchmarks so let's see let's see how it works all right so I turned co-pilot on I think I need to go here to my extensions autocomplete turn that off let's actually just turn it off copilot 2 and see [Music] make this code and yeah so now there is no Auto completion on there's no large language model running and we'll go down here and we will turn on Copilot enable for python is this my first run through it gzt yes yes it is my first run we're doing it live all right so we will take in let's do this all this is still the same delete this cell go down there and create a copilot version okay so let's do Eva what do we do what do we do what do we give this as a prompt Eda plot plot distributions a pandas data frame columns copilot what are you gonna do so it's actually not doing anything until I actually type this it seems like the autocomplete functionality is a little bit different of when it when it kicks off so I might need to actually type this out nope I don't want that um come on copilot where are you of numerical columns in a pandas data frame okay so it did that all right so I think what it's doing could it could it actually have just been understanding the context of this file and spitting that back out that's kind of interesting so it looks like what it's done is it's taken what we had previously already in this same code and it fed it back to us wait copilot is off [Music] oh maybe it just wasn't it's only enabled for python yeah copilot does look within the file so maybe we need to [Music] Halloween make this co-pilot as its own maybe it'll still scan the rest of the files maybe the context of of copilot is going to look beyond that and it's going to be cheating a little bit regardless so let's select this kernel [Music] okay [Music] plot distribution and a correlation heat map whoa so it wants to do more okay go ahead go ahead I give you the floor co-pilot all right so it's doing a pair plot it's already assuming that this is the iris data set so it knows that this is a iris which has a species column a little bit of cheating there and let's see what the results look like all right so these are pretty standard plots that you would see when working with the iris data set how good is co-pilot now it's good enough for this I've found that co-pilot is not very helpful or any of this Auto completion is not as helpful for doing exploration-ish sort of things because it assumes so much like you remember before we were just writing some code and it gave us a suggestion for a file folder which didn't exist on our system it was Ted's file folder that's what copilot and these large language models I've found try to do when you're exploring a data set like look at this co-pilot test okay um it's pretty good for writing functions that you you want to write a good doc string for and just get the results uh would you know how to build a custom LM um on a code base I get that you store embeddings as a vector database but since it's unsupervised with the training data just be code split by context um so I know there are some good tools out there for fine tuning large language models one of them is H2O has released something uh called the llm studio and we'll definitely be talking about this at some point um but it's sort of like a One-Stop shop let's share this a One-Stop shop for uh training these types of models so I want to do a stream where I can actually show this in action it's kind of similar to what we're doing now um that you could take these pre-trained models that they have and you can tune it on your own data set so there's no reason to actually do it from scratch as since I'm trying to build a natural [Music] language inference for SQL code base that's it sounds like a very interesting project but I would recommend looking into something like this it could potentially be helpful for you [Music] all right so going back here to where we're comparing copilot versus uh star coder let's go ahead turn off co-pilot again [Music] let's look for this extension go back to autocomplete enable this yeah we could test if you guys have examples or that you think would be fun and someone tried to approach the ASL child challenge for long sentences of signs I was thinking how I would identify a word ending in a certain frame [Music] yeah um I haven't started on that one I want to definitely take a look at that competition because they um almost immediately after closing the last American Sign Language competition launched that new one the first one was to just detect um like signs of single words and the new one is actually they're spelling out phrases addresses those sort of things and you actually have to create a model that can detect those can we build an ml project to delete the negative comments of a given social media post of a user okay and um that's a good that's a good idea you mean here try to see if this can do it so let's call this social cleanup social clean up and let's see if we can clean this data data up get users tweaks let's see if it can do this we'll provide it an API and an author name this will use Tweety API or find the recent Peeps made by user that's passed by the author variable and return pins hey that's pretty good but it didn't keep going count equals zero so get IDs of tweets on the first page it's doing a try except here so [Music] and try looks like it's extending the API because it's limited to a certain number this is all probably coming from a specific user's um code base it's probably just taking its directly from someone who has written this [Music] so we are having to go here and Just Tab through every little step of the way and I'm not entirely sure this would actually work so I did spell author is not defined find here not quite sure why this is complaining okay so I'm gonna save this uh we got multiple try except blocks it doesn't like we're gonna import time work oh so this is good it knows which to import based on our code base so it knew that we had pandas and numpy before it had tweepy and inner tube tools which we won't actually use here uh clean bad tweets [Music] all right so this is on text characters being a certain size are you teach are you teaching basic python not right now we're just checking out uh free open source large language model that does um that does auto completion for code hey the music is too high okay thanks for that feedback I will move oh bring that down how's that that could be dangerous if you're worried about licensing yeah so they've tried I know that in their documentation they said that they're in their paper they wrote that they've have a custom pipeline to remove personal identifiable information and also pulled this this data assuming that the licenses or or taking only code with licenses that allowed for commercial use I believe [Music] but then in the GitHub repo there are people raising issues saying that their codes being used in it and they didn't approve it so I don't know it's it's kind of like this dangerous territory that these models have um remove tweets from data frame that have negative negative sentiment see how it will do this this isn't super helpful it's really trying though oh copilot is on all right so it's assuming that it needs its analysis now one thing I wasn't able to get running to is this show code attribution so it looks like this should and it says no code found in this stack look but it it tries to find where in the data set that this model was trained on I believe that's what it's trying to do this show attribution to to link you to where it actually pulled the data from which is that which would be a really cool addition as opposed to copilot which doesn't really tell you where the code came from um but if it was able to kind of show that uh hey Rob what do you think about the new competition on kaggle icr identifying age-related conditions it's a it looks like a really interesting one because there are um there are not that many competitions that have come out recently that have such a small data set and maybe we could end on this note so we'll be done talking about this new model and instead we'll switch over to looking very briefly at this competition and maybe that's something that could do a future stream about so let's do a quick rundown of what I'm seeing here so this is the competition that you were talking about right it's identifying age-related conditions already 1300 participants in it and it's only been going on for about 12 days I know that they've limited the leaderboard to one submission per day because they're worried about people overfitting why would they be worried about people overfitting because look how small this data is 356 kilobytes folks kilobytes that's very small so the key here is going to be not overfitting which earphone am I using like the cheap ones from uh from Amazon I was entering Goku's question so yeah this is a gonna be an interesting one to see if people cannot overfit um and it's gonna be interesting to see actually how much information you can pull out from the data that's provided so this this seemed the same case a lot of the time with health related data is you have to have people involved in the study that the sign up for it they have to oh get sign over all the rights to their information in order to be part of the study and then they need to be tracked pretty closely and usually the number of participants that they can have in a study is by Machine learning standards fairly small so what you end up having having is data sets that don't have that many examples so 617 unique values probably took a lot of work just to get that much that much data but in the scheme of things it's it's a small data set so doing any sort of advanced machine learning stuff on this may not maybe Overkill may not not actually work it reminds me of this competition that ended recently where is that one uh this competition that asks people to try to predict the progression of Parkinson's disease and if we looked at the data a little bit larger UH 60 megabytes worth of data but most of the data that they provide were blood samples taken from the patients and unfortunately it would have been nice if these um if this data actually was helpful in predicting the progression of Parkinson's disease because if that's the case and that's sort of what people had agreed might be you know the research was pointing to that you could predict the predict progression then you could in theory have a better way of treating these patients but it turned out that the leading Solutions in this did not use that data at all they more so just looked at when the patients visited the doctor any ideas what to do with anonymized features try some Transformations yeah basically throw anything you can at the wall and see what works um a lot of times like if you read this solution thinking about the data in a clever way and also removing data that isn't helpful can be a big more important than actually including more data or creating more features so um using something like leave one feature out important to determine if the features that you do have in that data set are even helpful to start with two models what do you mean by that I don't wanna hear it not sure what you mean about that okay so I hope you guys enjoyed this stream it's my first time trying on this platform we had a little bit of hiccups I got I got dropped out of the stream for part of it um we tested a little bit of this new large language model to see how well it performed compared against co-pilot they do the same thing they do very similar things they just try to predict what your the next bit of code that you're trying to write would be the performance felt a little bit different maybe copilot was a little bit more polished and um there are some promising things about about this new star coder like the fact that you can reference the actual code that it's pulling from when you when you uh write your code so when it does its auto complete so pretty cool yeah I hope you guys enjoyed this and yeah Mr Gabriel says he still prefers co-pilot I think I'm going to be sticking with that for now too but this is open source and the great thing about open source is anyone can contribute try to help make it better and yeah definitely want to be supporting that all right well thanks everyone for hanging out today I'm gonna end the broadcast and check out if you haven't already make sure that you are subscribing on YouTube at Rob Muller you can uh join our Discord I have a link in the description of the YouTube stream right now and also follow on Twitch uh that's the funnest way to be following I usually stream [Music] Tuesdays Thursdays Sundays ish but I'm trying to do these streams during the day and see how they work um usually I was doing them at night but yeah it was fun everyone hope you have a great rest of your week and I'll see you next time bye-bye

Original Description

Live stream taking a look at the newly released open sourced StarCoder! More about starcoder here: https://huggingface.co/blog/starcoder Links to my stuff: * Youtube: https://youtube.com/@robmulla?sub_confirmation=1 * Discord: https://discord.gg/HZszek7DQc * Twitch: https://www.twitch.tv/medallionstallion_ * Twitter: https://twitter.com/Rob_Mulla * Kaggle: https://www.kaggle.com/robikscube
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Rob Mulla · Rob Mulla · 0 of 60

← Previous Next →
1 A Gentle Introduction to Pandas Data Analysis (on Kaggle)
A Gentle Introduction to Pandas Data Analysis (on Kaggle)
Rob Mulla
2 Exploratory Data Analysis with Pandas Python
Exploratory Data Analysis with Pandas Python
Rob Mulla
3 7 Python Data Visualization Libraries in 15 minutes
7 Python Data Visualization Libraries in 15 minutes
Rob Mulla
4 Kaggle competition starter notebook walkthrough
Kaggle competition starter notebook walkthrough
Rob Mulla
5 Kaggle Competitions: A Beginner's Guide to Winning
Kaggle Competitions: A Beginner's Guide to Winning
Rob Mulla
6 Jupyter Notebook Complete Beginner Guide - From Jupyter to Jupyterlab, Google Colab and Kaggle!
Jupyter Notebook Complete Beginner Guide - From Jupyter to Jupyterlab, Google Colab and Kaggle!
Rob Mulla
7 Audio Data Processing in Python
Audio Data Processing in Python
Rob Mulla
8 Complete Data Science Project!
Complete Data Science Project!
Rob Mulla
9 Make Your Pandas Code Lightning Fast
Make Your Pandas Code Lightning Fast
Rob Mulla
10 Image Processing with OpenCV and Python
Image Processing with OpenCV and Python
Rob Mulla
11 Speed Up Your Pandas Dataframes
Speed Up Your Pandas Dataframes
Rob Mulla
12 This INCREDIBLE trick will speed up your data processes.
This INCREDIBLE trick will speed up your data processes.
Rob Mulla
13 Complete Guide to Cross Validation
Complete Guide to Cross Validation
Rob Mulla
14 Easy Python Progress Bars with tqdm
Easy Python Progress Bars with tqdm
Rob Mulla
15 Economic Data Analysis Project with Python Pandas - Data scraping, cleaning and exploration!
Economic Data Analysis Project with Python Pandas - Data scraping, cleaning and exploration!
Rob Mulla
16 Python Sentiment Analysis Project with NLTK and 🤗 Transformers. Classify Amazon Reviews!!
Python Sentiment Analysis Project with NLTK and 🤗 Transformers. Classify Amazon Reviews!!
Rob Mulla
17 Get Started with Machine Learning and AI in 2023
Get Started with Machine Learning and AI in 2023
Rob Mulla
18 The Trick to Get Unlimited Datasets
The Trick to Get Unlimited Datasets
Rob Mulla
19 Video Data Processing with Python and OpenCV
Video Data Processing with Python and OpenCV
Rob Mulla
20 Object Detection in 10 minutes with YOLOv5 & Python!
Object Detection in 10 minutes with YOLOv5 & Python!
Rob Mulla
21 Pandas for Data Science #shorts
Pandas for Data Science #shorts
Rob Mulla
22 Object Detection in 60 Seconds using Python and YOLOv5 #shorts
Object Detection in 60 Seconds using Python and YOLOv5 #shorts
Rob Mulla
23 Machine Learning for Facial Recognition in Python in 60 Seconds #shorts
Machine Learning for Facial Recognition in Python in 60 Seconds #shorts
Rob Mulla
24 Time Series Forecasting with XGBoost - Use python and machine learning to predict energy consumption
Time Series Forecasting with XGBoost - Use python and machine learning to predict energy consumption
Rob Mulla
25 Detect Text in Images with Python - pytesseract vs. easyocr vs keras_ocr
Detect Text in Images with Python - pytesseract vs. easyocr vs keras_ocr
Rob Mulla
26 Solving an Impossible Riddle with Code
Solving an Impossible Riddle with Code
Rob Mulla
27 Do these Pandas Alternatives actually work?
Do these Pandas Alternatives actually work?
Rob Mulla
28 Time Series Forecasting with XGBoost - Advanced Methods
Time Series Forecasting with XGBoost - Advanced Methods
Rob Mulla
29 Data Science Uncut - Data Shootout Kaggle Competition (Aug 1 2022 Stream)
Data Science Uncut - Data Shootout Kaggle Competition (Aug 1 2022 Stream)
Rob Mulla
30 Kaggle Dataset Creation from Scratch- Data Science Uncut (Aug 10 2022)
Kaggle Dataset Creation from Scratch- Data Science Uncut (Aug 10 2022)
Rob Mulla
31 Chess Board Computer Vision AI - Data Science Uncut (Sep 7, 2022)
Chess Board Computer Vision AI - Data Science Uncut (Sep 7, 2022)
Rob Mulla
32 25 Nooby Pandas Coding Mistakes You Should NEVER make.
25 Nooby Pandas Coding Mistakes You Should NEVER make.
Rob Mulla
33 DEFCON Hacking AI CTF Solution on Kaggle - Data Science Uncut Sep 11, 2022
DEFCON Hacking AI CTF Solution on Kaggle - Data Science Uncut Sep 11, 2022
Rob Mulla
34 More Chessboard Computer Vision AI - Data Science Uncut - Sep 13
More Chessboard Computer Vision AI - Data Science Uncut - Sep 13
Rob Mulla
35 Medallion Data Science Live Stream
Medallion Data Science Live Stream
Rob Mulla
36 Community Kaggle Competition Overview - Corn Classification (
Community Kaggle Competition Overview - Corn Classification (
Rob Mulla
37 Deep Learning Image Classification - Corn Kernels - Data Science Uncut
Deep Learning Image Classification - Corn Kernels - Data Science Uncut
Rob Mulla
38 OpenAI Whisper Demo: Convert Speech to Text in Python
OpenAI Whisper Demo: Convert Speech to Text in Python
Rob Mulla
39 Yolov7 Custom Object Detection in Python Tutorial  - Chess Piece Detection
Yolov7 Custom Object Detection in Python Tutorial - Chess Piece Detection
Rob Mulla
40 Live Kaggle Coding - Enzyme Stability Prediction - Data Science Uncut Sep, 27 2022
Live Kaggle Coding - Enzyme Stability Prediction - Data Science Uncut Sep, 27 2022
Rob Mulla
41 Finding Chess Cheaters with Python! - Data Science Uncut Livestream
Finding Chess Cheaters with Python! - Data Science Uncut Livestream
Rob Mulla
42 Data Science Uncut - Kaggle Community Competition & Chess Data Analysis - Oct 4, 2022
Data Science Uncut - Kaggle Community Competition & Chess Data Analysis - Oct 4, 2022
Rob Mulla
43 Flight Delay Dataset Creation (Data Science Uncut)
Flight Delay Dataset Creation (Data Science Uncut)
Rob Mulla
44 5 Reasons to Kaggle #shorts
5 Reasons to Kaggle #shorts
Rob Mulla
45 ♟️ Data Science - Chess Data Analysis
♟️ Data Science - Chess Data Analysis
Rob Mulla
46 EXTREME PYTHON & DATA SCIENCE LIVE STREAM
EXTREME PYTHON & DATA SCIENCE LIVE STREAM
Rob Mulla
47 What is Clustering in ML?
What is Clustering in ML?
Rob Mulla
48 What is K-Nearest Neighbors?
What is K-Nearest Neighbors?
Rob Mulla
49 LIVE CODING: Flight Data Exploration with Pandas & Python
LIVE CODING: Flight Data Exploration with Pandas & Python
Rob Mulla
50 Kaggle Survey vs. Twitter Sentiment
Kaggle Survey vs. Twitter Sentiment
Rob Mulla
51 If Top Chess.com Players were STOCKS - Live Coding Data Anaylsis Stream
If Top Chess.com Players were STOCKS - Live Coding Data Anaylsis Stream
Rob Mulla
52 Data Visualization BATTLE!
Data Visualization BATTLE!
Rob Mulla
53 LIVE CODING: Stocks & Sentiment Analysis
LIVE CODING: Stocks & Sentiment Analysis
Rob Mulla
54 Progress Bar in Python with TQDM
Progress Bar in Python with TQDM
Rob Mulla
55 Flight Cancellation Data Analysis
Flight Cancellation Data Analysis
Rob Mulla
56 Synthetic Dataset Creation for Machine Learning - Blender and Python
Synthetic Dataset Creation for Machine Learning - Blender and Python
Rob Mulla
57 The Ultimate Coding Setup for Data Science
The Ultimate Coding Setup for Data Science
Rob Mulla
58 Dataset Creation SPEED RUN - Live Coding With Python & Pandas
Dataset Creation SPEED RUN - Live Coding With Python & Pandas
Rob Mulla
59 Data Wrangling with Python and Pandas LIVE
Data Wrangling with Python and Pandas LIVE
Rob Mulla
60 Forecasting with the FB Prophet Model
Forecasting with the FB Prophet Model
Rob Mulla

This video showcases StarCoder, a powerful generative AI model for code generation, and demonstrates its capabilities in code completion, fine-tuning, and machine learning. Viewers can learn how to use StarCoder for coding tasks and improve their coding workflows with AI assistance.

Key Takeaways
  1. Install StarCoder extension in VS Code
  2. Enable Hugging Face Auto Complete
  3. Use StarCoder for code completion in Jupyter notebooks
  4. Compare StarCoder to Copilot
  5. Fine-tune StarCoder for specific coding tasks
💡 StarCoder outperforms other models like Copilot in code completion tasks and can be fine-tuned for specific coding tasks, making it a powerful tool for coders and developers.

Related AI Lessons

The AI Moat Paradox: The Better Models Become, the Less Models Matter
The AI moat paradox suggests that as AI models improve, their importance may decrease, and understanding this concept is crucial for AI professionals and businesses.
Medium · AI
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Discover the biggest AI research shifts of 2026 based on 170,927 papers, and learn how to apply these trends to your work
Medium · Machine Learning
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Discover the major research shifts in AI from 170,927 papers published in the first half of 2026, and learn how to analyze trends in AI research
Medium · Data Science
[PoV] When Everyone Is Smart, No One Is
In a world where AI makes everyone smart, the value of intelligence decreases, and new challenges arise
Medium · AI
Up next
‘ENOUGH IS ENOUGH’: Lebanon is STANDING UP to Iran, expert says
Fox Business
Watch →