CodeBERT
Key Takeaways
The video explains CodeBERT, a model that bridges the gap between natural language and programming languages by leveraging documentation and code pairs, and demonstrates its applications in code search, documentation generation, and zero-shot learning. CodeBERT uses the BERT transformer architecture and is pre-trained with masked language modeling and replaced token detection.
Full Transcript
this video explore code burped a new model for combining neural representations of natural language with programming languages by leveraging documentation of functions and languages like Python JavaScript and PHP code Bert uses masked language modeling on the natural language programming language pairs as well as a novel replaced token detection tasks to improve pre-training with a similar framework as generative adversarial networks code Bert is then fine-tuned for applications such as natural language code search and generating code documentation I'm really excited about the potential of this kind of technology especially to help beginners learning how to code with their natural language questions about their code this video will explain the details behind the code bird model this video explained the code Bert model using the Bert transformer architecture to bridge information between natural language and programming language pairs such as the natural language documentation of this programming language Python function this is useful for downstream applications like natural language code search and automatically generating code documentation semantic code search has become more popular lately with the release of the code search net challenge this takes examples of having a natural language query like how do i Center a div with flexbox or maybe something like how do I parse a regular expression with JavaScript and then trying to find the corresponding programming code that corresponds with this natural language query this is a really interesting application for helping programmers with these frustrating problems that you just can't seem to find an answer to on the Internet I also think this is a really interesting kind of idea in the zero shot setting where you have these kind of like experimental libraries and then you run into problems with them and there's really nothing to find on the Internet to help you solve your problems I think it's really interesting to see these tools for bridging the natural language programming language questions that people have especially when you're first learning how to program to facilitate this kind of alerting with these natural language queries before we get into the specific details of code Bert will do a quick background on the original Bert model so Bert is kind of like an encoder only transformer similar to how GB t2 is a decoder only transformer and the original attention is all you need transformer developed for neural machine translation as the encoder decoder architecture the idea is that you have a mass sentence to mass sentences that you're masking out intermediate tokens and you're predicting the masked intermediate tokens and in this way it attends over this entire input sequence as to predict the mass tokens so say token eight will be replaced with this mask token and then the output is predicting that mass token another detail about this is the input format so you have the CL s token denoting the start of this which is also indexed at the end of the burp model for the respective tasks in this case they pre trained it with the next sentence prediction where the CLS token is predicting whether sentence becomes after sentence a and then in other cases like not maybe not question answering but say natural language inference the CLS token would be like entailment contradiction neutral something like that would come out of this spot of the output you also have the special separator token denoting the difference between ascendants a and sentence B and in the case of code Bert it's gonna separate between the natural language sequence and then the programming language sequence Bert is really easily extendable into multiple modalities in deep learning this term bimodal generally describes having two different types of data so you have language and vision or audio and vision something like that is generally described as a bimodal data set in the deep learning literature so some examples of Bert combining modalities is particularly with image language or things like image bird or vilbert or combining video and language like video bird so in this case code Bert is going to be combining the by modalities of natural language and programming language pairs so you see in the case of image bird how they take these vision tokens from the object detection network and use this as the visual sequence and combine it with the natural language of sequence and the attention will look at this whole sequence so this idea of self attention is looking at this entire sequence to integrate all the information from the sequence in order to perform the respective task the output layer this video shows an example of one of these natural language programming language pairs you see how we have this function parse memory and in the middle of it we have this natural language description parcel memory string in the format supported by Java blah blah blah and the idea here is that we want to leverage these natural language descriptions of the code in order to facilitate downstream applications like natural language code search so later on if we put a query in like how do you parse a memory string supported by Java and Python we'd want this kind of function to be returned so this is the idea of a natural language programming language pair we're gonna format this as sentence a that goes into the BIRT model and then we're going to format the programming language as the sentence B that goes into the ber model code bird is trained with two pre training objectives these pre training objectives described the self supervised learning tasks that let these models take advantage of these kind of unlabeled data sets that are massive and help it do this pre training of representations useful for downstream tasks transfer so first we're going to go from this raw data set of these natural language programming language pairs and then we're gonna fine tune the bird representations on the natural language code search problem sandhya behind math language modeling is you have this program a natural language input and the programming language input and you're gonna randomly mask out tokens and then predict the tokens that were masked out so you're doing the attention over the natural language and the programming languages in order to inform decisions about the output layer on so say you use something like I don't know the units in this programming language in order to inform a mass token in the natural language part of the sequence and vice versa in this paper code bird the authors also introduced a novel pre training objective replaced token detection for pre training their code bird model on this massive almost like unlabeled style text data set containing these national image programming language pairs the idea behind replace token detection is you have the original natural language sequence and the original programming language sequence and you're gonna randomly mask out the tokens but then you're gonna train a generator model so it's not quite like a transformer generator model like GPT 2 rather is just like an Engram kind of probabilistic I see this context window so I assign a high probability to this token so it's a simpler generator model than these you know bigger transformers like GT 2 or what have you so you take the generated token that comes out of the generator and then you have this new sequence where the masked tokens have been replaced by the prediction from the generator Network so now this original burp model is getting this sequence of tokens and in the output layer it's predicting which was the original one and which one was replaced but differently from say generative adversarial Network training if the generator actually does predict the right token it's going to be labeled as original rather than replaced putting this together the code BER model has 125 million parameters and is a 12 layer transformer encoder similar to the burp model an interesting detail about this is that it takes some 250 hours of training on the nvidia dgx 2 workstation using the 16 bit employee and mix precision kind of training so they have these 16 interconnected Nvidia Tesla 5100 GPUs each with 32 gigabytes of memory and it still takes 250 hours in order to pre train this model on the math language modeling task and the replace detection token detection task it's also interesting to see the statistics of the data set used for doing these two pre training objectives to get the initial kind of context predictions for the code burp model you see how you have these distributions between the bimodal data which is cases where you have the code and then it has the natural language documentation kind of in the code and the you know modal code which is you know modal meaning just programming language so there's no natural language corresponding to the programming language in the data set you see how you have these differences particularly in the JavaScript training data set you see a big difference between you know modal codes and bimodal data although most of the cases it's pretty balanced about like two times or three times as much you know modal as bimodal data which is interesting because the unimodel modal data is useful for pre training those generators and the replaced token detection task but overall it's not that useful in the math language modeling task because you want to have the you want have the programming language and the natural language together so that you can attend over the entire sequence and learn this kind of a joint representation the first application they test code burden is natural language code retrieval the idea here is to have a high similarity between the query and the code that matches it so the idea is that you take the question or the query such as how do I Center a div with Flex block sir how do I open a file with JavaScript and then you have the corresponding code pairs that you're looking to match the similarity so the way that this works is you have the question and then you have the given code sequence and you're going to index the CLS token in the output to score the similarity between the query and the code and then you're gonna rank the most similar codes to the query as the output to this problem it's also interesting to look at the data set you see you have a lot of training examples for PHP Python and then not as much for say Ruby because interesting to see this kind of data set from the code search net challenge this table shows the results of natural language code retrieval with the code burp model and other baseline models like the bi-directional are then or the Roberta model only so one interesting detail of this experiment is they find when they initialize the code bird model with the Roberta parameters Roberta is like slight modification to the bear model basically in the paper Roberta they look at a lot of little details in the bird training that actually add up and provide a pretty significant boost compared to the original Burt model but when they initialize code Burt with Roberta so it has these initial parameters of the Roberta model it is trained on probably something like web text or one of these kinds of data sets I'm not exactly sure but they use the pre trained representations from the Roberta model into the code bird so it's definitely interesting to think of how you transfer from unilateral language models like the data set that roberto is trained on into something like natural language programming language pairs for the cobra model it's also analogous to something like how you might do like a multilingual bert where sentence a is like English and then tetanus B is French if you want to do that kind of thing with the bird model so you see in the case of using the masked language modeling and replaced token detection to further pre train the model after being initialized with the Roberta parameters you get a much higher performance than when it's initialized from scratch and then your training with the code only or when you're doing the original case where you only do math language modeling and you have a random initialization in order to get further insight into how the code bird model makes predictions and how it differs from the original Roberta model trained on natural language only they do this kind of zero shot natural language programming language probing so zero shot describes how they're doing this in the validation and test set meaning that the code bird model hasn't already seen this data and that's how they're describing zero shot transfer into this domain C idea behind their national English programming language probing task is that they're going to deliberately mask out these tokens that have either min max mean these kind of tokens that from the natural language perspective are pretty similar to each other but in the coding language particularly setting it's they're very different kind of ideas so you see the difference between the probability distributions on the first token between the Roberta model and the code BER model and then you see an even more dramatic difference in the programming language part of this kind of a sequence to tell just how fine-tuning the model on the programming language pairs gives it's more insight into programming languages compared to a model like Roberta only trained on natural language the authors then test code bird on code documentation generation attaching a new decoder part to the only encoder in order to produce these generations so you see the performance achieved by doing this code burr compared to just having a transformer or a Berta or a sequence of sequence model that encodes the programming language into the input to this decoder that is now randomly initialized so all these models are encoders and then you're attaching it to a new decoder in order to test this documentation generation task and it's evaluated with this blue score between some kind of baseline of maybe like a set of human written documentation for the code another interesting experiment they perform in this study although it doesn't quite outperform this code to sequence model using an abstract syntax tree is zero shot documentation generation to c-sharp so it's interesting is that the code burr model has never seen c-sharp code before but now it's in the setting where it's given some c-sharp code and it's gonna try to describe it in natural language with this appended decoder part to the original code Bert encode a model of the c-sharp code so the only thing that's interesting to think of this zero shot documentation generation and look at the results achieved by the code Bert model thanks for watching this explanation of code Bert a really interesting adaptation to the Bert model to bridge information between natural language and programming language pairs I think that applications of this kind of natural language code search will be really useful for people who are just learning how to program as well as people who are dealing with this almost zero shot setting where people are constantly releasing these new libraries and frameworks like a new release of Pi torch a new environment like torch meta or some new benchmark like the fire environment or something like that and you're trying to figure out how to debug these problems but you know it can be difficult because there's not that many people on the internet and say Stack Overflow writing about the int or like these github like issue logs that are writing about these problems so that how these kind of models that can bridge the natural language programming language gap automatically could be so useful for these kind of head-scratching programming problems that you come across thanks for watching and please subscribe to Henry AI labs for more deep learning in AI videos
Original Description
This video explains how CodeBERT bridges information between natural language documentation and corresponding code pairs. CodeBERT is pre-trained with Masked Language Modeling and Replaced Token Detection and fine-tuned on tasks like Code Search from Natural Language and Generating Documentation. I am excited about the future of these kinds of tools, although I wish they were around when I started coding!
Paper Link:
CodeBERT: https://arxiv.org/abs/2002.08155
Thanks for watching! Please Subscribe!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Connor Shorten · Connor Shorten · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
DenseNets
Connor Shorten
DeepWalk Explained
Connor Shorten
Inception Network Explained
Connor Shorten
StackGAN
Connor Shorten
StyleGAN
Connor Shorten
Progressive Growing of GANs Explained
Connor Shorten
Improved Techniques for Training GANs
Connor Shorten
Word2Vec Explained
Connor Shorten
Must Read Papers on GANs
Connor Shorten
Unsupervised Feature Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Embedding Graphs with Deep Learning
Connor Shorten
Transfer Learning in GANs
Connor Shorten
ReLU Activation Function
Connor Shorten
AC-GAN Explained
Connor Shorten
SimGAN Explained
Connor Shorten
DC-GAN Explained!
Connor Shorten
ResNet Explained!
Connor Shorten
Graph Convolutional Networks
Connor Shorten
Neural Architecture Search
Connor Shorten
Henry AI Labs
Connor Shorten
Video Classification with Deep Learning
Connor Shorten
BigGANs in Data Augmentation
Connor Shorten
Introduction to Deep Learning
Connor Shorten
EfficientNet Explained!
Connor Shorten
Self-Attention GAN
Connor Shorten
Curriculum Learning in Deep Neural Networks
Connor Shorten
Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Connor Shorten
Deep Compression
Connor Shorten
Skin Cancer Classification with Deep Learning
Connor Shorten
Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Connor Shorten
The Lottery Ticket Hypothesis Explained!
Connor Shorten
SqueezeNet
Connor Shorten
GauGAN Explained!
Connor Shorten
AutoML with Hyperband
Connor Shorten
DL Podcast #3 | Yannic Kilcher | Population-Based Search
Connor Shorten
Weakly Supervised Pretraining
Connor Shorten
Image Data Augmentation for Deep Learning
Connor Shorten
Unsupervised Data Augmentation
Connor Shorten
Wide ResNet Explained!
Connor Shorten
RevNet: Backpropagation without Storing Activations
Connor Shorten
GANs with Fewer Labels
Connor Shorten
BigBiGAN Unsupervised Learning!
Connor Shorten
Self-Supervised Learning
Connor Shorten
Multi-Task Self-Supervised Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Population Based Training
Connor Shorten
Show, Attend and Tell
Connor Shorten
Siamese Neural Networks
Connor Shorten
WaveGAN Explained!
Connor Shorten
VAE-GAN Explained!
Connor Shorten
Evolution in Neural Architecture Search!
Connor Shorten
AI Research Weekly Update August 18th, 2019
Connor Shorten
Weight Agnostic Neural Networks Explained!
Connor Shorten
AI Research Weekly Update August 25th, 2019
Connor Shorten
Neuroevolution of Augmenting Topologies (NEAT)
Connor Shorten
CoDeepNEAT
Connor Shorten
AI Research Weekly Update September 1st, 2019
Connor Shorten
Randomly Wired Neural Networks
Connor Shorten
Genetic CNN
Connor Shorten
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Asked ChatGPT to Fix My Life. It Couldn’t — Until I Changed One Thing
Medium · AI
I Asked ChatGPT to Fix My Life. It Couldn’t — Until I Changed One Thing
Medium · ChatGPT
Claude Sonnet 5 Is Here: Why It Might Replace Your Opus Subscription
Medium · Programming
Claude AI vs ChatGPT: Which One Is Actually Better in 2026?
Medium · AI
🎓
Tutor Explanation
DeepCamp AI