Generating Wikipedia by Summarizing Long Sequences
Key Takeaways
The video discusses the paper 'Generating Wikipedia by Summarizing Long Sequences', which uses a transformer decoder architecture with a novel approximation to full attention, and introduces a memory compressed attention layer to generate summaries of source documents. The paper also constructs a dataset called wiki some and uses tf-idf extractive algorithm and transformer abstracted summarization model for abstractive summarization.
Full Transcript
this video explores a paper generating Wikipedia by summarizing long sequences from Google AI this experiment explore is a really interesting application of natural language processing which is generating unique summaries from a massive set of source documents about a given topic the data set they deal with in this paper is really interesting they collect two million input-output pairs for supervised learning in which the input is all of the articles cited in the Wikipedia page as well as the top 10 search results of that topic in order to make this problem more tractable they filter the raw data with a tf-idf extractive algorithm that's then passed into the transformer abstracted summarization model this paper introduces a lot of interesting transformer details like reducing the memory overhead by dropping the encoder half and using a decoder only transformer they also introduce a memory compressed attention layer that approximates full attention by alternating between these layers that either split the input sequence up continuously and then pass each chunk and do a separate attention layer or use strided convolutions to reduce the size of embedding matrices in the key and value matrices this video explored miscellaneous details of the experiments in this paper this video will explore the paper generating Wikipedia by summarizing long sequences from researchers at Google AI they use a transformer decoder architecture with a novel approximation to full attention in order to generate opening sections of Wikipedia articles when inputted a collection of reference documents about the topic these experiments are looking at abstracted summarization extractive summarization describes taking a massive set of documents and then exactly copying sentences from the reference documents in order to make up the summary of it differently abstracted summarization described using a language model to generate the summary so the final summary is composed of original language from the generative model or so to say to summarize it in your own words the authors of this paper construct a really interesting data set that they call wiki some wiki some uses all the Articles referenced in a Wikipedia article and the top 10 search results from Google as the input for the language model and the output is the lead section of a Wikipedia article it's really interesting to think of the construction of these massive data sets for natural language processing tasks at the end of this they end up with about 2 million of these source documents to Wikipedia article pairs these experiments are all test generating this opening section of the Wikipedia article rather than the full article although they do show the capability of putting these together and having a coherent full article in the appendix of the paper Table one shows some interesting characteristics of this data set compared to previous works on abstractive and extracts of summarization the wiki some dataset has a fortunate structure in the output space because most Wikipedia articles follow a style guide but they still have a massive variance in this style of the inputs because it's coming from all sorts of different articles around the web additionally the input for the wiki some dataset is much larger than previously studied datasets on this as well as in the output space the ro ug e or Rouge one recall score is a metric signaling the overlap of output words contained in the input so this lower score indicates that less of the output is contained in the input and it also indicates that it's a harder task having a lower Rouge one score table two shows massive variants in the wiki some data set a lot of Wikipedia entries have very few references which is why the dataset is supplemented with Google search results as well the input data space of all of the articles referenced in the Wikipedia article as well as the top 10 Google search results is too large to do end to end abstract of summarization with so what they do is they first have a middleman extractive summarization pipeline that uses things like tf-idf or a cheating method to exactly extract the paragraphs from the raw data that has the most overlap with the target summary in order to filter the data and make it more tractable for abstractive summarization in the paper they explore five different techniques for extractive summarization as the middleman in between the raw data and the abstract of summarization task but I chose to isolate these two because I think it's the most interesting for the sake of summarizing this paper so the term frequency inverse document frequency is basically weighting the number of times the word appears in this document compared to the number of documents and then the overall number of times that word appears in all of the documents so say you have a word like tensorflow and that's the query and then you will see how many times quick tensorflow appears in this new document times the inverse of how many documents there are then how many times tensorflow appears in all the documents so you use this in order to rank the paragraphs that have the most similarity with the query in this case the query is like the of the Wikipedia article and the other thing that they show is cheating so cheating would be like the you know the Oracle extracted summarization tasks or you have the reference to the final target paragraph and so you're doing the by grams between these different paragraphs that you're ranking as well as the target summary of the paragraph so ideally you'd imagine having this high overlap between the output and then each of these paragraphs would give you a lot of key information for constructing that target paragraph so after the extract of summarization methods like tf-idf and the cheating method or these other things like the identity you pass this as the new input into the abstractive model so there are four different abstracted models a test the sequence the sequence LCM with attention the transformer encoder/decoder then they present this novel transformer decoder which is then used in GP g2 and other these transformer models that choose to abandon the encoder part of the transformer and then they also introduced a new transformer decoder with memory compressed attention the idea of the transformer decoder is to abandon this encoder half and just have the inputs and go right into the output space of the previous transformer architecture and then masking and doing this language modeling task so the way that the abstract of summarization model is trained is it has this sequence of M 1 to n which is the ranked order of the paragraphs from the extracted summarization model that are then tokenized and then truncated to length and in order to fit into the memory then you have this special Delta separator token and then you have the output which is the tokens of that original Wikipedia opening paragraph as uses the input-output labelled data set for supervised learning of abstractive summarization so the way that the language model works is during training it's going to predict the input as well autoregressive lee as it makes this first prediction then shifts the mask over one makes that prediction and shifts the attention mask over one and does that in order to train the model so originally it's predicting both the input and the separator token and then the output but then later on when the model is deployed it's just going to have this input and then have the mask originally and then predict the output like that one of the trickiest problems with training these transformer language models is the bottleneck of the dot product attention computation when you do this query time to transpose key matrix you have this length by length matrix which is really difficult for memory constraints so what they say in the paper is with their sixteen gigabyte GPU they're able to store a length of 4,000 tokens we're using this memory compress attention they're able to get this input sequence length up to 11,000 so they present two different techniques for doing an approximate approximation to full attention the first of which is to take these value in key matrices and then reduce the embedding lengths of them by using a strided convolution so the strategy convolution will take down the second dimension of the key matrix from length two you know some smaller number and then it'll do the same for the value matrix to line up the dimensions of that matrix multiplication then the local attention is this idea of splitting the input sequence so you take the first 256 tokens and send it into this multi-head attention layer then you take the next hundred fifty-six tokens and put them into a separate layer so you split up the input sequence pass them into separate attention heads and then merge them with something like a fully connected layer these are examples of different models with different ablations of the parameters of them on this task of summarizing this Wikipedia post about this law firm so you see the ground truth this is the output that's used to train these models with the you know from the Wikipedia article so this is some of the different summarizations written by these different models with different parameters so this is the transformer encoder decoder attending on a sequence length of 100 tokens this is just the decoder attending on 500 tokens and this is the width the memory compressed attention attending on 7,000 tokens and they also add this mixture of experts layer to add more model capacity to it to see how the subarray gets better with this model and it's also good to see the you know like the qualitative summarization is better as the automated metric of log perplexity is going down which is a metric where lower is better these are some of the results from the ablation showing the effect of the different extraction methods and the different input data set corpus so the combined corpus described using the referenced articles in the Wikipedia post as well as the Google search results so the citations only is only the Wikipedia citations and search only is only the search results so you see the combined corpus has the best performance by far and then with respect to the extraction you see that the cheating method where you're doing this by grams overlap between the raw combined data set and then that final output your opening section extracted for the Wikipedia article when you do the buy grams between those paragraphs that has the best result by a pretty high standard on this metric so this implies that the extractive method they could further improve on this by having a better algorithm than the tf-idf this plot further shows the results of using an abstract of summarization model as well as the extract of summarization model so this shows the performance of two other extractive methods that they use that I didn't describe in the video text rank gets some basic and you can see that they perform about the same as tf-idf but you see how when you use the abstract of summarization layer on top of the extract of summarization you get a better performance on this overall summary of the input documents this ablation further shows the performance differences between different parameters on the transformer and the LS TM sequence the sequence architecture you see that the sequence of sequence model the lsdm with attention encoder/decoder attention has the highest perplexity and really is heavily underperforming compared to the transformers so first you have the transformer encoder/decoder with attending over sequence length of 500 and that has a decent perplexity but you can see further improvements done by dropping the encoder part which allows you to attend over a long seek longer sequence as well because you're cutting away the number of parameters in the model and now you have your attending over signals like the 4,000 tokens and then you see how you introduce this full approximation to the attention and then you get attention over eleven thousand tokens and then you add in this mixture of experts layer that you know increases the memory bottleneck so eventually as you have 256 hidden units in the mixture of experts layer you attend over a smaller sequence but you achieve better performance as a result of the higher capacity this table shows the results of human evaluation on these summarizations comparing the transformer with the compressed memory only having the decoder part compared with only having the extractive summarization with the tf-idf algorithm and then the lsdm encoder decoder sequence of sequence and you see that this is evaluated on dimensions of focus grammar non redundancy referential clarity and structured coherence with the transformer with the compressed memory performing the best and you can also see a comparison of two models that have different of these automated scores and then also result in different human evaluation so this is showing the correlation between the human evaluators and in these automated metrics thanks for watching this video on generating Wikipedia by summarizing long sequences this experiment explores supervised learning of transformers with a massive data set their wiki some dataset contains two million of these input-output pairs in which the input is all of the articles referenced in the Wikipedia page as well as the top 10 search results of the topic and the output is the original opening section of that Wikipedia article this paper introduces a lot of other interesting ideas like the decoder only transformer and the memory compressed attention with a local attention and then the approximation with the strided convolutions thanks for watching and please subscribe to Henry AI labs for more deep learning and AI videos
Original Description
This video explores the paper "Generating Wikipedia by Summarizing Long Sequences". Natural Language Processing models that can generate summaries of source documents on a single topic such as "Generative Adversarial Networks" or "Reinforcement Learning" are one of the NLP applications that I find the most interesting! This paper is frequently cited for introducing the Transformer decoder architecture, but there is a lot more interesting details about this paper. I also think the approximations proposed to full attention in this paper are really interesting!
Paper Link: https://arxiv.org/pdf/1801.10198.pdf
Thanks for watching! Please Subscribe!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Connor Shorten · Connor Shorten · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
DenseNets
Connor Shorten
DeepWalk Explained
Connor Shorten
Inception Network Explained
Connor Shorten
StackGAN
Connor Shorten
StyleGAN
Connor Shorten
Progressive Growing of GANs Explained
Connor Shorten
Improved Techniques for Training GANs
Connor Shorten
Word2Vec Explained
Connor Shorten
Must Read Papers on GANs
Connor Shorten
Unsupervised Feature Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Embedding Graphs with Deep Learning
Connor Shorten
Transfer Learning in GANs
Connor Shorten
ReLU Activation Function
Connor Shorten
AC-GAN Explained
Connor Shorten
SimGAN Explained
Connor Shorten
DC-GAN Explained!
Connor Shorten
ResNet Explained!
Connor Shorten
Graph Convolutional Networks
Connor Shorten
Neural Architecture Search
Connor Shorten
Henry AI Labs
Connor Shorten
Video Classification with Deep Learning
Connor Shorten
BigGANs in Data Augmentation
Connor Shorten
Introduction to Deep Learning
Connor Shorten
EfficientNet Explained!
Connor Shorten
Self-Attention GAN
Connor Shorten
Curriculum Learning in Deep Neural Networks
Connor Shorten
Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Connor Shorten
Deep Compression
Connor Shorten
Skin Cancer Classification with Deep Learning
Connor Shorten
Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Connor Shorten
The Lottery Ticket Hypothesis Explained!
Connor Shorten
SqueezeNet
Connor Shorten
GauGAN Explained!
Connor Shorten
AutoML with Hyperband
Connor Shorten
DL Podcast #3 | Yannic Kilcher | Population-Based Search
Connor Shorten
Weakly Supervised Pretraining
Connor Shorten
Image Data Augmentation for Deep Learning
Connor Shorten
Unsupervised Data Augmentation
Connor Shorten
Wide ResNet Explained!
Connor Shorten
RevNet: Backpropagation without Storing Activations
Connor Shorten
GANs with Fewer Labels
Connor Shorten
BigBiGAN Unsupervised Learning!
Connor Shorten
Self-Supervised Learning
Connor Shorten
Multi-Task Self-Supervised Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Population Based Training
Connor Shorten
Show, Attend and Tell
Connor Shorten
Siamese Neural Networks
Connor Shorten
WaveGAN Explained!
Connor Shorten
VAE-GAN Explained!
Connor Shorten
Evolution in Neural Architecture Search!
Connor Shorten
AI Research Weekly Update August 18th, 2019
Connor Shorten
Weight Agnostic Neural Networks Explained!
Connor Shorten
AI Research Weekly Update August 25th, 2019
Connor Shorten
Neuroevolution of Augmenting Topologies (NEAT)
Connor Shorten
CoDeepNEAT
Connor Shorten
AI Research Weekly Update September 1st, 2019
Connor Shorten
Randomly Wired Neural Networks
Connor Shorten
Genetic CNN
Connor Shorten
More on: LLM Foundations
View skill →Related Reads
📰
📰
📰
📰
How to Use Poe for Llm-Friendly Content Structure in 2026
Dev.to AI
Kairos-4B: the open-source world model that just lapped the competition four times over
Medium · Machine Learning
Google’s Open Knowledge Format (OKF): Is This the Beginning of the End for RAG?
Medium · Programming
New AI tutor achieves 0.71-1.30 SD effect size in Dartmouth course [pdf]
Hacker News (AI)
🎓
Tutor Explanation
DeepCamp AI