Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Aleksa Gordić - The AI Epiphany · Beginner ·📄 Research Papers Explained ·5y ago

Skills: Reading ML Papers90%Research Methods80%RAG Basics70%Vector Stores60%RAG Evaluation50%

Key Takeaways

The video discusses the Graphormer model, a new transformer model that achieved state-of-the-art results on the OGB large-scale challenge benchmark, and explores its architecture and performance in graph representation tasks, including its ability to outperform classical graph neural networks by adding structural inductive biases.

Full Transcript

what's up in this video i'm covering this new paper called do transformers really perform bad for graph representation introducing the former model by chiang chong ging chiang laikai zheng jie lua and others from uh they did this as an internship at microsoft research asia basically uh so what is cool about this paper is that this is the first time that uh transformer uh has outperformed classical graph neural networks on this important benchmark called ogb large scale challenge and the whole point of the paper the whole the whole idea is to encode structural inductive biases into the transformer model and they showed that by adding three of those encodings they can actually outperform obviously those classical gnns which is super exciting um having said that so basically we up until now we we knew that transformers are really good for sequence modeling so they they dominated the nlp world then recently we had the vision transformer entering the computer vision world uh but and outperforming like a bunch of different computer vision models but like the only uh downside there was that it required a huge amount of pre-training so like it used 300 m a million um data points to to pre-train it and a couple of days ago i have also covered this paper where the vit the vision transformer use this sharpness aware minimization objective to actually uh not need any pre-training anymore and it still outperforms a resonant bass lines which is really cool but up until now up until graph former we didn't have the same thing happening in the graph ml field where gns were dominating pretty much and so let me just briefly show you how it looks pictorially so they had three encodings they have this thing called centrality encoding they used something called degree centrality and they also add these edge encoding and spatial encoding information directly as a bias before the softmax in the self-attention layer and those just can encode the edge information as well as the structural information uh into into the reformer model and we'll see soon see the details but yeah that was that was a high level picture so now let's start digging into the paper um so the transformer architecture has become a dominant choice in many domains such as natural language processing and computer vision yet it has not achieved competitive performance on popular leaderboards of graph level prediction compared to mainstream gnn variants so just as a small digression here what graph level means is the following so if you have a graph a simple graph here you can do multiple types of tasks on on this graph you can do node wise so node level uh predictions you can do edge level predictions and or you can do graph level predictions so a graph level prediction would be if you have a molecule here if this graph represents a certain molecule so for the graph level you take into account all of the node and edge features and you try to regress maybe the solubility of this molecule on the other hand if you had edge prediction for example you could have multiple edges here and these nodes can be maybe drugs and you're trying to predict the side effect between you by using the two drugs together and for example you may regress that this edge has a probability of 0.9 that means that this side effect has a huge probability of occurring occurring on the other hand this edge here may be unplausible because the probability is only 0.001 for example and there will be a edge level prediction task but here we're focusing on graph level prediction tasks and most of the benchmarks actually contain molecules so that makes sense next up our key insight to utilizing transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model so what that means is the following so if you know how transformers work and hopefully do and i'll link the paper somewhere here so you can check it out and understand transformers uh hopefully you do but like this is this is a like a 101 basically if we have some sequence sequence of tokens like this um and what transformers do let me draw another one is every single token will be attending to each other tokens including itself it will be attending to this one it will be attending to this one it will be attending to this one and the same goes for all of the other tokens and as you can see what it implicitly assumes is that the underlying data is a fully connected forms a fully connected graph so if i i can just redraw this in a bit different fashion like this and what happens is that transformer assumes that all of these tokens nodes whatever they may be they can be image patches in the case of a vision transformer they can be like like subword tokens in the case of nlp um and so it assumes this this structure but sometimes when you have a molecule for example you know that the molecule is connected like this you know that this is the connection and you want to somehow encode this information into the transformer and that's what reformer did and we'll soon see how those exactly look like but that's the the whole point you want to ingrain the structural information into the transformer to make it uh more more like more performant okay and finally many popular genome variants could be covered as a special cases of graph former and i'll later show you this in the ape index and i'll explain how this uh comes to be okay uh having said that let me just give you uh like a brief uh introduction to gnns if you're not familiar with them i have a whole graph ml playlist you can check it out i'll link it somewhere here um but like just let me shortly explain you how how this works so what what you usually have is in in in at least this one family of graph neural networks uh called mpn so message passing neural networks you have this aggregation step and you have this combined step and so let me draw a small graph here something like this and what happens is the following so let me connect these like this so these are the direct neighbors and we have some indirect neighbors so what will happen is that for this so every single node so that's the first point every single node and every single edge will have a certain feature vector associated with it so why is that well basically assume you have a assume this is like these are atoms so this graph represents a molecule so this edge this feature vector may contain the information such as is this a single bond is this a double bond is this an aromatic bond et cetera so they can be as you can see useful information encoded directly into the edge feature vector so what you do is the following so you have this as i said aggregate step so what what that does is the following so it will take the neighbors uh so either the one hop or the multiple neighbors but let's assume we have one hub neighborhood here so that means that this i node will take into account this vector this vector this vector as well as the edges and um it will somehow aggregate that information so that may be like a mean you basically add them up and you just divide them by the number of of elements or like sum or whatever there can be different aggregation methods but once you have that so again you have the original feature vector here and you have the one you just formed by aggregation and now the the following step is this combined method as you can see so this is the aggregated one so this is this one and then you have the original one that's that's this one and you somehow combine them so they can be maybe via mlp you just pass them through a multi-layer perceptron and you form a novel representation here which will look something like this okay so that will be how graph neural networks roughly work and you then stack a bunch of these layers together finally you have some depending on the task you you just have some classification head whatever you back prop you train these weights and that's it um so that's that's that's it and now there is one more thing if you want to do graph level prediction what you usually do is you have this readout function and what it does is you can see it just takes the nodes from the set of nodes of that graph and you just kind of combine them somehow so again you have layer one and you do on the layer one you do a processing that looks something like this so you do those kind of aggregations and you you combine them and you repeat that multiple times so you have layer two all the way up until layer l whatever that number may be and then you take the node feature vectors from that layer and use for example sum them up and that will be a readout function so you have a single vector which you can then use to maybe uh regress something or whatever um and so that's how the the pipeline roughly looks like for for graph neural networks so now having said that let me just briefly tell you about transformer and as i said so i won't be delving into details here but you have a couple of procedures here so you basically first project your feature vectors into these query vectors into key vectors into value vectors then you find the similarity between the query and the keys and that will provide you with this attention scores which then you you apply just a soft max and you then use those attention attention coefficients to aggregate the value vectors so that's just a high level overview as i said you can watch my video to understand this that's not the point of this paper and now let me let's start digging into the encodings that's the meat of this paper okay first up let's see the centrality encoding so uh centrality node centrality is is a really uh abstract uh like a concept uh there can be many things and they concretely use something called degree centrality we'll see that in a moment but like the rough notion is that those centrality measures tell you how important a certain node or an edge is in that particular graph so they say it here in equation four the equation four is just a transformer equation for like accumulating value vectors uh the attention distribution is calculated based on the semantic correlation between nodes so that's the query keys dot product however node centrality which measures how important a node is in the graph is usually a strong signal for graph understanding for example celebrities who have a huge number of followers are important factors in predicting the trend of a social network as you can imagine if elon musk tweets something you can assume that a nice portion of the twitter graph will be copy pasting he's retweeting his content so you can kind of predict just because of a single note you can predict the behavior of many other nodes so that's a maybe rough intuition for why this works and as i mentioned they're using degree centrality and that's a simple concept so again assume you have a node and assume you have some edges so you have some in-going edges you have some outgoing edges here so what no centrality is it's the following so uh basically the out degree will be three here as you can see we have three edges so that will mean we have our degree three and we'll have two in-going edges so the uh degree centrality of the so of the in-degree is two so that's basically it so that's the that's the number that's the measure we want here so for an influencer node you'll have like thousands and thousands of these in-going edges which kind of signal that it's an important note and how they include that information into this graph former is the following way so as you can see this is the original feature vector in your graph and let me try and draw this so again you have a small graph here and it's connected with some other nodes and so what happens is the following you have the original so this x side that's the original feature vector okay and you add up to that feature vector these um additional embeddings so you basically form what they do is they form a table which has two columns so one column will correspond to the in degree vectors and the other will correspond to the out degree vector and now if you focus on this specific node here and yeah assuming assuming we have assuming this one is maybe in degree maybe these two maybe this two is out uh outgoing edge uh you can see that uh we can we can basically form the the the the degree numbers and we have two for the out degree and we have one for the in degree and so using that information we can basically index into this embedding table which is initially just randomly uh initialized so what we'll do is we'll have as you can see here so this will be zero slot first slot second slot we'll just take this vector here and we'll add it directly to this one so we'll just add it up to this one and because we have one in degree we'll go to the in-degree column and we'll just index into this one so this is one and we'll take this vector and we'll again add it up to this representation okay so um as you can see uh that's basically what they do and yeah this is it here where z minus and z plus are uh d dimensional real valued vectors so are learnable embedding vectors specified by the in degree and out degree respectively so what this roughly does now you can imagine if you had certain node which is an influencer node which had a bunch of these in-going edges maybe thousand and you had another node that also had a thousand of these in-degree edges okay that means that they'll be indexing the same the very same vector here will be used in both of those and that and that kind of can signal to this uh grafformer that these are somehow special um that's the rough intuition behind all of all of this and beca behind this first embedding uh embedding uh procedure so the second thing which is really important is this spatial encoding and they said here concretely for any graph g we consider a function phi between v i and v j so we have node space and we have node space just a cartesian product and we map that into real number so which measures the spatial relation between vi and vj in graph g so this is very abstract let's make it a bit more concrete so in this paper we choose five to be the distance of the shortest path so the shortest path distance between vi and vj if the two nodes are connected if not we set the output of phi to be a special value for example -1 okay let's see what this practically means so again we have this equation as you can see this is basically transformer equation we have just the dot product between so this is the query vector this is the key vector we do a dot product this is transposed and we just kind of scale it uh where d is the number of dimensions in those vectors we just take a square root okay so this is the transform part and that's not that interesting so what's interesting here is they add a bias so this is a learnable scalar as they say here to encode this information of the structure uh in the graph and they do that by you can see here this phi is used to index into the this b uh array so b will be again uh as we had the that we had this embeddings here now we'll have similarly array that we'll call b and which will be able to index using this phi where phi is the shortest path between any two points so that means the following again let me draw a simple diagram so we have maybe something like let me draw like a path diagram so we have a simple path diagram a path graph and as you can see the this node here will have so this this node here will be have a distance of one this node here will have a distance of two so that's the shortest path and finally this one will have distance of three so that means when we do when we form a query vector here and key is here what we'll do is the following so we'll be adding so for this particular node here we'll be adding let me just change the color so this is zero one two three so for this one we'll be using exactly this one we'll be using this scaler then for the second one we'll be using this scaler so uh now you can imagine that this can help us learn some interesting stuff so they say here for example if b is learned to be a decreasing function with respect to phi so as we so that what they say is as we go down this b array if this becomes like smaller and smaller so they say that uh for each now the model will likely pay more attention to the nodes near it and pay less attention to the nodes far away from it and in the extreme case imagine you have you have maybe a zero here for the first node and you have minus infinity for all of the other ones so if this is minus infinity what will happen is because you're adding those to the attention score after applying soft max all of those nodes which have minus infinity the attention coefficients will will drop to zero whereas for this one it will be non-zero it will be probably one so in this particular graph so we'll be focusing so what we'll end up doing is focusing on only this node here and ignoring the other the rest of these so if we had more neighbors more direct neighbors we'd be basically focusing only on the one hop neighborhood just because of this learnable scalar here so that's the idea there this is how they encode this um structural information of the graph by the shortest path distance um scalars okay that's the second important uh encoding and the third one and the final one is this edge encoding and they say here there are mainly two edge encoding methods used in previous works in the first method the edge features are added to the associated nodes features in the second method for each node its associated add feature will be used together with the node features in the aggregation however such ways of using edge feature only propagate the edge information to its associated nodes which may not be an effective way to leverage edge information in the representation of the whole graph so in a nutshell what this means is the following so if you have a if you have a graph again let me draw a simple graph here what they do is if you have a graph like this and when you do those aggregation and combine steps in the first layer of the gnn what we'll end up with what we'll end up doing is we'll be end up we'll end up including this information so these two edges and information from these two nodes into this novel representation that will form here and we are ignoring we're ignoring these edges here we're ignoring them in the first pass at least and what they are suggesting here is that it may be wise to kind of ingrain that information straight away and uh like during this aggregation process just kind of use that edge information from all of the other uh notes and that's how they do that is the following so again you can see simple stuff we have the transformer part we have the spatial embedding part which i just explained and we have this novel uh edge encoding part which is this fancy formula which i'm now going to try to explain so again we'll have a embedding table so we'll have a table which will correspond to these learnable edge features and like the length of this table will just be the following thing the length will be the maximum shortest path in the graph so you find all of the so between every single node in your graph you find the shortest path and you find the maximal such path and that will be the length of this table let's call it i don't know maybe e and okay so once we have that we do the following thing so let's assume we want to find so let's assume we're finding a tension score between i and j so if this is i and if this node here is j what we'll do is the following so we'll have we'll have some feature vectors which are associated with these edges right so we'll have this and this and this and we'll do the following we'll take the first edge so we'll have edges here so these are again random initialized initially so we have one two three etc so we'll take this vector here we'll do a product between that vector and this vector here so we'll do the dot product between those two because this is the first edge on the shortest path so the shortest path is this one so we have this is the shortest path between i and j so we'll first do a dot product between those two then we'll take the following vector we'll do a dot product between that vector and this vector and we'll take the third vector because this is the third uh vector on this path and we'll do a dot product and what we'll just do is we'll just sum them up as you can see here and divide them by n which is the number of of edges in the path which is three here so these are the edge features these are the learnable uh edge features and finally just sum them up and divide by n that's it and intuitively what this does is we ingrain the edge information across so between i and j along the shortest path and they show that in the ablation studies later that this helps boost the performance of the reform model okay so those are the main ideas of this paper uh these three encodings and now i'll just again so everything else remains the same as in the transform model they just ingrained these encodings everything else remains the same you have the multi-head attention sub module you have the ffn so the fully uh the fully uh the feed forward network module you have the layer normalization standard stuff so i won't go into those details and i'll skip this an interesting part for now which shows that classical genens can be considered as a special case of this graphora module but before that let me show you the experiment results so looking at the baselines here so again this is this large scale challenge data set it has 3.8 million graphs so it's huge compare that to previous uh g and popular graph ml benchmarks such as cora sites here or pubmed and you'll see that this is much bigger and so much of a much better benchmark compared to those and we compare with gcn so that's graph convolutional network with genes with uh gcns with the virtual node etc and we have this is some modification of a transformer also an attempt to apply transformers to graph level prediction tasks but not as successful as graph former as we'll see so as you can see looking at the test so this is mean absolute error we we achieve so graph formula achieves the lowest error on compared to compared to all of these other classical genets so those are some nice results and um there's one thing i want to mention here let me just see where it is um it basically uh yeah as stated in section 3.3 we further find the proposed reformer does not encounter the problem of over smoothing i the train and validate error keep going down along with the growth of depth and width of models so now that this may be strange if you're not familiar with graph neural networks but basically compared to cnns where depth more depth basically take resonance for example so the more depth you have the more expressive and powerful and performant the models are here in the gnn literature basically that doesn't necessarily help so you usually see those shallow gen ends like two or three layer deep and those work just fine so the reason for that are there are multiple reasons so one of them is maybe bottleneck so that means if you have if you take a look at the one hop neighborhood you'll have maybe 100 neighbors if you take a look at the two hop neighborhood you may end up with ten thousand neighbors if you take a look at the three hop neighborhood so that's the third layer of the gin and you'll maybe be attending to millions of nodes and just by if you try and average all of those feature vectors you'll end up basically collapsing into into a single feature vector and that's something that's called over smoothing so basically all of your feature vectors start collapsing into very same representation and thus you don't have the the discriminative power to kind of figure out whatever the task at hand is so maybe protect this predict the solubility of the molecule whatever so that's the over smoothing problem and it seems that graph former uh kind of dealt with it and i'll be really really excited i'm really excited to see like what the future research shows and explains why this is uh finally they also show some results on other ogb data sets again these are mostly molecular data sets and you can see again it just outperforms uh all of the other baselines this flag is just a simple adversarial augmentation so so just that you know what this suffix means um and yeah one zinc again uh lowest uh mean uh absolute error compared to the previous baselines so that's that's pretty much it they also did some ablations showing that all of these encodings boost up the performance so this is the edge encoding as you can see the the error drops when they added same goes for the no centrality saying goes for the spatial encoding they always have these boosts in the performance when they add those particular encodings uh finally uh as we know basically transformers have problem with with the scalability at least the the the vanilla transformer we have this quadratic dependency between the tokens because of this attention attending mechanism and so the quadratic complexity of the self-attention module restricts performance application on large graphs which is pretty much obvious so that means that even though it achieves really nice results on these benchmarks which are molecular data sets molecular graphs i can imagine that this still can't be used to any uh like like huge graphs such as maybe social media graph and so yeah i can i just expect to see people um kind of using the ideas that came from the transformer literature like linformer performer etc to kind of make this thing more efficient and then we'll see whether it can achieve some nice results even on larger graphs um having said that that's the those are the main ideas of this paper um and now i want to focus on on this on this effect one and kind of delve a bit deeper into it and explain why this is okay let's focus on this part by choosing proper weights and distance function five the graph formal layer can represent aggregate and combine steps of popular gnm modules models such as gin gcn and graph sage moreover we show further that by using our spatial encoding graph former can go beyond classic message passing genens whose expressive power is no more than the weisswala lemon one test so by the way there is a a bit of a confusion about this weiss file alignment test and the thing is uh all of these expressivity um like uh statements basically only are valid for and so-called anonymous uh graph neural networks whereby anonymous i mean that the the node feature vectors are actually um basically non-existent or they are not discriminative so that means um where well by using uh standard graph neural networks you always have those known features so these analyses are not that they can be kind of tricky i'll link this blog down in the description you can see it on the screen which explains this problematic really good and kind of explains what a gns can learn and what they can't learn and so do check it out it's really really cool having said that let's focus on this part how can we how can we treat these classical gnns as a special case okay um but let's start first with this vice versa lemon thing so um so voice file element or the color test uh basically what it does it it outputs a color histogram and if the color histograms are the same it claims that those two graphs are isomorphic and we can clearly see that these two graphs here are not isomorphic but weiswala lemon tells us that they are isomorphic because we have four red colors and we have two blue colors uh and because of that weiswala lemon would falsely claim that these are isomorphic on the other hand if you focus on the shortest path distance so let's let's see what i see here so these two graphs cannot be distinguished by vice ver layman one test but the spd sets i the spd from each node to others are different the two types of nodes in the left graph have spd sets as you can see here and let me explain let me kind of break it down for you what this means so obviously the shortest path distance between this node this this red node and itself is zero uh it has one distance until it reaches this this these nodes and two for these two and finally three and as you can see here that's zero one one two two three and for all of the red nodes we have the same spd set okay finally for this blue node again we have trivially zero shortest path to itself we have one one one here so that's three one so those are these ones here and while finally we have two two which are these two and similarly goes for this graph here and we can see that the spd sets are different thus uh this spd uh will help us determine that these graphs are not isomorphic i they are non-isomorphic so that's a like a a visual explanation and a particular example where they showed that graph former is more powerful than those uh simple uh like vice versa one test kind of architectures okay uh now let's focus on the really important part and that's these how we how can we uh basically show that uh gnns are some genens or special case of graph of this reform module so let's focus on mean aggregate function and explaining how that works so we begin by showing that self-attention module with spatial encoding can represent mean aggregation this is achieved by in equation six settings so this is equation six this is your transformer this is your basically bias using this shortest path uh function okay so by setting the the the the scalar when um phi equals one so direct neighbors to zero and to minus infinity otherwise where phi is the shortest path distance and setting these two matrices so the query projection matrix and the key projection matrix to zero matrix and the value projection matrix to identity matrix then the softmax gives the average represent of representations of the neighbors and let's let me let me try and break it down how this exactly works it's quite a simple so we have a note here we have some direct neighbors let's say we have three direct neighbors so these are connected and finally let's say we have some undirect indirect ma like neighbors okay like these two so what happens is the following so as we know these will have and we are focusing on this node uh and trying to figure out the the next layer representation so we can see that these nodes here have distance one and these nodes here have distance two and finally three okay so what i say is the following if we set these b's to zero so if we set this one to zero this one to zero and this one to zero and if we set these to minus infinity then as you can see looking at this formula here we'll have that after we apply softmax these will get to zero and these will get will have some non-zero value and now the second portion is this these matrices should be zero matrices and the value should be identity that will help us achieve the the mean aggregation of neighbors so let's see so for now we just have some arbitrary non-zero values here and we have zeros here for the attention coefficients so now let me let me do it like this so let's focus on this particular node and so we have some feature vector once we map them this feature vector using the key query and value matrices we'll get the following things so query and key so key and query will just be zero vectors because remember we have zero matrices there and this one here because we have an identity matrix will be the same just a copy based version of this vector here so this is v this is we as well so this is the value vector okay and now we do the same thing for all of the nodes in the graph and uh as you can see if we take this query vector which is all zeros and we do a dot product between any of the keys because all of the keys are also all zero vectors will always have this thing going to zero so this this term in the equation will go to zero and finally as i already explained we'll have this bias terms will be zero for the local direct neighbors and minus infinity there so that means we'll basically end up having the same coefficients here and that's going to be one over three and these will be zero so we'll only be we'll end up attending only these nodes here and we'll take a mean average of of of those three uh feature vectors okay so if you're confused by one three it's pretty simple so it's just soft maxing so you'll have e so let's focus on for example on on this note we'll have e raised to the power of zero because that's the attention that's the attention score we got and down in the denominator we'll just have a sum and that sum will be equal to three because we'll have three times e raised to the power of zero and we'll have two times e raised to the power of minus infinity which is obviously zero and that's why we achieve that's why this goes to one over three and that means that yeah basically those are the attention coefficients that this node will use to accumulate the value vectors of the direct neighbors and as you can see we literally achieved this mean aggregation so that was the first the first part now the sum aggregate part is uh pretty easy once we understand this first mean aggregate part but let me let me let me zoom in a bit here and explain it the summer variation can be realized by first performing mean aggregation and then multiply the node degrees we'll see what that means so specifically the node degrees can be extracted from centrality encoding so that's the degree encoding by an additional head and be concatenated to the representations after mean aggregation then the fully the feed forward network module in reformer can represent the function of multiplying the degree of the dimensions of average representations by the universal approximation theorem of fully of fully feed forward networks okay so let me kind of break that down it's pretty simple um and i'll just draw it over this chart so uh let's assume the first head gives us the mean representation for this vector okay so we have that part and now what they say is the following let's take another uh head so we have multiple attention heads remember this is transformer so what that have will do it will learn how to extract the degree of this node which is as you can see here just consider it's undirected graph so it's three so we'll learn how to extract maybe something like a vector which will have all zeros and only three at the bottom here okay and so how that will happen is the following so remember this so this this vector here is formed by taking the original vector and adding to it those embeddings for the degree centrality so basically we'll be adding those nodes those vectors which contain the degree information and this head can learn how to extract from this vector how to extract this number here so that have we'll learn how to do that and once we have that we do the following we basically concatenate because if you remember how transformer works every single head attention head will basically output a representation those will get concatenated then we'll have some mlps so again we'll have the following we'll have the mean representation and we'll have the degree of the node and then by this universal by this theorem of universal approximation theorem where these mlps can basically like given not enough width and in depth they can basically approximate any function so what we get here is after mapping them it can learn how to take this number let's say three and multiply it with every single feature dimension here so with testing with testing with this thing with this thing and if we do that we'll just end up with a sum aggregation instead of mean aggregation and hopefully this was clear enough uh this is how i understood it uh basically and yeah if i'm wrong feel free to comment down but yeah i think that's in a nutshell how it works so uh similar thing for max aggregate similar thing for combine and mean readout you can you can do it yourself but this is the the whole logic the whole point how this works and yeah we saw we can emulate these uh as a special case of grafformer and that's super cool uh having said that if you like this video consider sharing subscribe and until next time bye bye

Original Description

❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Paper: Do Transformers Really Perform Bad for Graph Representation? In this video, I cover Graphormer a new transformer model that achieved SOTA results on the OGB large-scale challenge benchmark. It achieved that by introducing 3 novel types of structural biases/encodings into the classic transformer encoder. ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ Paper: https://arxiv.org/abs/2106.05234 ✅ What can GNNs learn blog: https://andreasloukas.blog/2019/12/27/what-gnn-can-and-cannot-learn/ ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 00:00 Key points of the paper 02:15 Graph-level predictions and why we need structural information 05:35 GNNs basics 09:40 Centrality encoding explained in depth 14:15 Spatial encoding explained in depth 18:00 Edge encoding explained in depth 22:30 Results on OGB LSC 23:30 Graphormer handles over-smoothing 25:10 Other results and ablation study 28:20 Graphormer is more expressive than WL 30:30 Mean aggregation as a special case of Graphormer 35:00 Sum aggregation as a special case of Graphormer ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💰 BECOME A PATREON OF THE AI EPIPHANY ❤️ If these videos, GitHub projects, and blogs help you, consider helping me out by supporting me on Patreon! The AI Epiphany ► https://www.patreon.com/theaiepiphany One-time donation: https://www.paypal.com/paypalme/theaiepiphany Much love! ❤️ Huge thank you to these AI Epiphany patreons: Petar Veličković Zvonimir Sabljic ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition". ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👋 CONNECT WITH ME ON SOCIAL LinkedIn ► https://www.linkedin.com/in/aleksagordic/ Twitter ► https://twitter.com/gordic_aleksa Instagram ► https://www.instagram.com/aiepiphany/ Facebook ► htt

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 47 of 60

← Previous Next →

Intro | Neural Style Transfer #1

Intro | Neural Style Transfer #1

Aleksa Gordić - The AI Epiphany

Basic Theory | Neural Style Transfer #2

Basic Theory | Neural Style Transfer #2

Aleksa Gordić - The AI Epiphany

Optimization method | Neural Style Transfer #3

Optimization method | Neural Style Transfer #3

Aleksa Gordić - The AI Epiphany

Advanced Theory | Neural Style Transfer #4

Advanced Theory | Neural Style Transfer #4

Aleksa Gordić - The AI Epiphany

Anyone can make deepfakes now!

Anyone can make deepfakes now!

Aleksa Gordić - The AI Epiphany

What is Computer Vision? | The Art of Creating Seeing Machines

What is Computer Vision? | The Art of Creating Seeing Machines

Aleksa Gordić - The AI Epiphany

Feed-forward method | Neural Style Transfer #5

Feed-forward method | Neural Style Transfer #5

Aleksa Gordić - The AI Epiphany

Alan Turing | Computing Machinery and Intelligence

Alan Turing | Computing Machinery and Intelligence

Aleksa Gordić - The AI Epiphany

Feed-forward method (training) | Neural Style Transfer #6

Feed-forward method (training) | Neural Style Transfer #6

Aleksa Gordić - The AI Epiphany

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

Aleksa Gordić - The AI Epiphany

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Aleksa Gordić - The AI Epiphany

How to get started with Machine Learning

How to get started with Machine Learning

Aleksa Gordić - The AI Epiphany

How to learn PyTorch? (3 easy steps) | 2021

How to learn PyTorch? (3 easy steps) | 2021

Aleksa Gordić - The AI Epiphany

PyTorch or TensorFlow?

PyTorch or TensorFlow?

Aleksa Gordić - The AI Epiphany

3 Machine Learning Projects For Beginners (Highly visual) | 2021

3 Machine Learning Projects For Beginners (Highly visual) | 2021

Aleksa Gordić - The AI Epiphany

Machine Learning Projects (Intermediate level) | 2021

Machine Learning Projects (Intermediate level) | 2021

Aleksa Gordić - The AI Epiphany

Cheapest (0$) Deep Learning Hardware Options | 2021

Cheapest (0$) Deep Learning Hardware Options | 2021

Aleksa Gordić - The AI Epiphany

How to learn deep learning? (Transformers Example)

How to learn deep learning? (Transformers Example)

Aleksa Gordić - The AI Epiphany

How do transformers work? (Attention is all you need)

How do transformers work? (Attention is all you need)

Aleksa Gordić - The AI Epiphany

Developing a deep learning project (case study on transformer)

Developing a deep learning project (case study on transformer)

Aleksa Gordić - The AI Epiphany

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Aleksa Gordić - The AI Epiphany

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

Aleksa Gordić - The AI Epiphany

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Aleksa Gordić - The AI Epiphany

Attention Is All You Need (Transformer) | Paper Explained

Attention Is All You Need (Transformer) | Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Attention Networks (GAT) | GNN Paper Explained

Graph Attention Networks (GAT) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Convolutional Networks (GCN) | GNN Paper Explained

Graph Convolutional Networks (GCN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

Aleksa Gordić - The AI Epiphany

OpenAI CLIP - Connecting Text and Images | Paper Explained

OpenAI CLIP - Connecting Text and Images | Paper Explained

Aleksa Gordić - The AI Epiphany

Temporal Graph Networks (TGN) | GNN Paper Explained

Temporal Graph Networks (TGN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Aleksa Gordić - The AI Epiphany

Graph Attention Network Project Walkthrough

Graph Attention Network Project Walkthrough

Aleksa Gordić - The AI Epiphany

How to get started with Graph ML? (Blog walkthrough)

How to get started with Graph ML? (Blog walkthrough)

Aleksa Gordić - The AI Epiphany

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

Aleksa Gordić - The AI Epiphany

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

Aleksa Gordić - The AI Epiphany

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

Aleksa Gordić - The AI Epiphany

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

Aleksa Gordić - The AI Epiphany

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

Aleksa Gordić - The AI Epiphany

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

Aleksa Gordić - The AI Epiphany

Implementing DeepMind's DQN from scratch! | Project Update

Implementing DeepMind's DQN from scratch! | Project Update

Aleksa Gordić - The AI Epiphany

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

Aleksa Gordić - The AI Epiphany

DeepMind's Android RL Environment - AndroidEnv

DeepMind's Android RL Environment - AndroidEnv

Aleksa Gordić - The AI Epiphany

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

Aleksa Gordić - The AI Epiphany

Non-Parametric Transformers | Paper explained

Non-Parametric Transformers | Paper explained

Aleksa Gordić - The AI Epiphany

Chip Placement with Deep Reinforcement Learning | Paper Explained

Chip Placement with Deep Reinforcement Learning | Paper Explained

Aleksa Gordić - The AI Epiphany

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Aleksa Gordić - The AI Epiphany

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Aleksa Gordić - The AI Epiphany

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

Aleksa Gordić - The AI Epiphany

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

Aleksa Gordić - The AI Epiphany

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

Aleksa Gordić - The AI Epiphany

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Aleksa Gordić - The AI Epiphany

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Aleksa Gordić - The AI Epiphany

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

Aleksa Gordić - The AI Epiphany

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

Aleksa Gordić - The AI Epiphany

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

Aleksa Gordić - The AI Epiphany

DETR: End-to-End Object Detection with Transformers | Paper Explained

DETR: End-to-End Object Detection with Transformers | Paper Explained

Aleksa Gordić - The AI Epiphany

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

Aleksa Gordić - The AI Epiphany

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

Aleksa Gordić - The AI Epiphany

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Aleksa Gordić - The AI Epiphany

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Aleksa Gordić - The AI Epiphany

The Graphormer model is a new transformer model that achieves state-of-the-art results on graph representation tasks by adding structural inductive biases, and its architecture and performance are explored in this video.

Key Takeaways

Set up the Graphormer model with centrality encoding, edge encoding, and spatial encoding
Implement the transformer equation with bias to encode graph structure information
Use the Graphormer model to perform mean aggregation and extract degree information from node vectors
Evaluate the performance of the Graphormer model on graph representation tasks
Compare results of the Graphormer model with other classical graph neural networks

💡 The Graphormer model can effectively encode structural information of a graph into the transformer model, allowing it to outperform classical graph neural networks on graph representation tasks.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Chapters (12)

Key points of the paper

2:15 Graph-level predictions and why we need structural information

5:35 GNNs basics

9:40 Centrality encoding explained in depth

14:15 Spatial encoding explained in depth

18:00 Edge encoding explained in depth

22:30 Results on OGB LSC

23:30 Graphormer handles over-smoothing

25:10 Other results and ablation study

28:20 Graphormer is more expressive than WL

30:30 Mean aggregation as a special case of Graphormer

35:00 Sum aggregation as a special case of Graphormer

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling