Generating Wikipedia by Summarizing Long Sequences

Connor Shorten · Beginner ·🧠 Large Language Models ·6y ago

Skills: LLM Foundations80%Prompt Craft70%Fine-tuning LLMs60%

Key Takeaways

The video discusses the paper 'Generating Wikipedia by Summarizing Long Sequences', which uses a transformer decoder architecture with a novel approximation to full attention, and introduces a memory compressed attention layer to generate summaries of source documents. The paper also constructs a dataset called wiki some and uses tf-idf extractive algorithm and transformer abstracted summarization model for abstractive summarization.

Full Transcript

this video explores a paper generating Wikipedia by summarizing long sequences from Google AI this experiment explore is a really interesting application of natural language processing which is generating unique summaries from a massive set of source documents about a given topic the data set they deal with in this paper is really interesting they collect two million input-output pairs for supervised learning in which the input is all of the articles cited in the Wikipedia page as well as the top 10 search results of that topic in order to make this problem more tractable they filter the raw data with a tf-idf extractive algorithm that's then passed into the transformer abstracted summarization model this paper introduces a lot of interesting transformer details like reducing the memory overhead by dropping the encoder half and using a decoder only transformer they also introduce a memory compressed attention layer that approximates full attention by alternating between these layers that either split the input sequence up continuously and then pass each chunk and do a separate attention layer or use strided convolutions to reduce the size of embedding matrices in the key and value matrices this video explored miscellaneous details of the experiments in this paper this video will explore the paper generating Wikipedia by summarizing long sequences from researchers at Google AI they use a transformer decoder architecture with a novel approximation to full attention in order to generate opening sections of Wikipedia articles when inputted a collection of reference documents about the topic these experiments are looking at abstracted summarization extractive summarization describes taking a massive set of documents and then exactly copying sentences from the reference documents in order to make up the summary of it differently abstracted summarization described using a language model to generate the summary so the final summary is composed of original language from the generative model or so to say to summarize it in your own words the authors of this paper construct a really interesting data set that they call wiki some wiki some uses all the Articles referenced in a Wikipedia article and the top 10 search results from Google as the input for the language model and the output is the lead section of a Wikipedia article it's really interesting to think of the construction of these massive data sets for natural language processing tasks at the end of this they end up with about 2 million of these source documents to Wikipedia article pairs these experiments are all test generating this opening section of the Wikipedia article rather than the full article although they do show the capability of putting these together and having a coherent full article in the appendix of the paper Table one shows some interesting characteristics of this data set compared to previous works on abstractive and extracts of summarization the wiki some dataset has a fortunate structure in the output space because most Wikipedia articles follow a style guide but they still have a massive variance in this style of the inputs because it's coming from all sorts of different articles around the web additionally the input for the wiki some dataset is much larger than previously studied datasets on this as well as in the output space the ro ug e or Rouge one recall score is a metric signaling the overlap of output words contained in the input so this lower score indicates that less of the output is contained in the input and it also indicates that it's a harder task having a lower Rouge one score table two shows massive variants in the wiki some data set a lot of Wikipedia entries have very few references which is why the dataset is supplemented with Google search results as well the input data space of all of the articles referenced in the Wikipedia article as well as the top 10 Google search results is too large to do end to end abstract of summarization with so what they do is they first have a middleman extractive summarization pipeline that uses things like tf-idf or a cheating method to exactly extract the paragraphs from the raw data that has the most overlap with the target summary in order to filter the data and make it more tractable for abstractive summarization in the paper they explore five different techniques for extractive summarization as the middleman in between the raw data and the abstract of summarization task but I chose to isolate these two because I think it's the most interesting for the sake of summarizing this paper so the term frequency inverse document frequency is basically weighting the number of times the word appears in this document compared to the number of documents and then the overall number of times that word appears in all of the documents so say you have a word like tensorflow and that's the query and then you will see how many times quick tensorflow appears in this new document times the inverse of how many documents there are then how many times tensorflow appears in all the documents so you use this in order to rank the paragraphs that have the most similarity with the query in this case the query is like the of the Wikipedia article and the other thing that they show is cheating so cheating would be like the you know the Oracle extracted summarization tasks or you have the reference to the final target paragraph and so you're doing the by grams between these different paragraphs that you're ranking as well as the target summary of the paragraph so ideally you'd imagine having this high overlap between the output and then each of these paragraphs would give you a lot of key information for constructing that target paragraph so after the extract of summarization methods like tf-idf and the cheating method or these other things like the identity you pass this as the new input into the abstractive model so there are four different abstracted models a test the sequence the sequence LCM with attention the transformer encoder/decoder then they present this novel transformer decoder which is then used in GP g2 and other these transformer models that choose to abandon the encoder part of the transformer and then they also introduced a new transformer decoder with memory compressed attention the idea of the transformer decoder is to abandon this encoder half and just have the inputs and go right into the output space of the previous transformer architecture and then masking and doing this language modeling task so the way that the abstract of summarization model is trained is it has this sequence of M 1 to n which is the ranked order of the paragraphs from the extracted summarization model that are then tokenized and then truncated to length and in order to fit into the memory then you have this special Delta separator token and then you have the output which is the tokens of that original Wikipedia opening paragraph as uses the input-output labelled data set for supervised learning of abstractive summarization so the way that the language model works is during training it's going to predict the input as well autoregressive lee as it makes this first prediction then shifts the mask over one makes that prediction and shifts the attention mask over one and does that in order to train the model so originally it's predicting both the input and the separator token and then the output but then later on when the model is deployed it's just going to have this input and then have the mask originally and then predict the output like that one of the trickiest problems with training these transformer language models is the bottleneck of the dot product attention computation when you do this query time to transpose key matrix you have this length by length matrix which is really difficult for memory constraints so what they say in the paper is with their sixteen gigabyte GPU they're able to store a length of 4,000 tokens we're using this memory compress attention they're able to get this input sequence length up to 11,000 so they present two different techniques for doing an approximate approximation to full attention the first of which is to take these value in key matrices and then reduce the embedding lengths of them by using a strided convolution so the strategy convolution will take down the second dimension of the key matrix from length two you know some smaller number and then it'll do the same for the value matrix to line up the dimensions of that matrix multiplication then the local attention is this idea of splitting the input sequence so you take the first 256 tokens and send it into this multi-head attention layer then you take the next hundred fifty-six tokens and put them into a separate layer so you split up the input sequence pass them into separate attention heads and then merge them with something like a fully connected layer these are examples of different models with different ablations of the parameters of them on this task of summarizing this Wikipedia post about this law firm so you see the ground truth this is the output that's used to train these models with the you know from the Wikipedia article so this is some of the different summarizations written by these different models with different parameters so this is the transformer encoder decoder attending on a sequence length of 100 tokens this is just the decoder attending on 500 tokens and this is the width the memory compressed attention attending on 7,000 tokens and they also add this mixture of experts layer to add more model capacity to it to see how the subarray gets better with this model and it's also good to see the you know like the qualitative summarization is better as the automated metric of log perplexity is going down which is a metric where lower is better these are some of the results from the ablation showing the effect of the different extraction methods and the different input data set corpus so the combined corpus described using the referenced articles in the Wikipedia post as well as the Google search results so the citations only is only the Wikipedia citations and search only is only the search results so you see the combined corpus has the best performance by far and then with respect to the extraction you see that the cheating method where you're doing this by grams overlap between the raw combined data set and then that final output your opening section extracted for the Wikipedia article when you do the buy grams between those paragraphs that has the best result by a pretty high standard on this metric so this implies that the extractive method they could further improve on this by having a better algorithm than the tf-idf this plot further shows the results of using an abstract of summarization model as well as the extract of summarization model so this shows the performance of two other extractive methods that they use that I didn't describe in the video text rank gets some basic and you can see that they perform about the same as tf-idf but you see how when you use the abstract of summarization layer on top of the extract of summarization you get a better performance on this overall summary of the input documents this ablation further shows the performance differences between different parameters on the transformer and the LS TM sequence the sequence architecture you see that the sequence of sequence model the lsdm with attention encoder/decoder attention has the highest perplexity and really is heavily underperforming compared to the transformers so first you have the transformer encoder/decoder with attending over sequence length of 500 and that has a decent perplexity but you can see further improvements done by dropping the encoder part which allows you to attend over a long seek longer sequence as well because you're cutting away the number of parameters in the model and now you have your attending over signals like the 4,000 tokens and then you see how you introduce this full approximation to the attention and then you get attention over eleven thousand tokens and then you add in this mixture of experts layer that you know increases the memory bottleneck so eventually as you have 256 hidden units in the mixture of experts layer you attend over a smaller sequence but you achieve better performance as a result of the higher capacity this table shows the results of human evaluation on these summarizations comparing the transformer with the compressed memory only having the decoder part compared with only having the extractive summarization with the tf-idf algorithm and then the lsdm encoder decoder sequence of sequence and you see that this is evaluated on dimensions of focus grammar non redundancy referential clarity and structured coherence with the transformer with the compressed memory performing the best and you can also see a comparison of two models that have different of these automated scores and then also result in different human evaluation so this is showing the correlation between the human evaluators and in these automated metrics thanks for watching this video on generating Wikipedia by summarizing long sequences this experiment explores supervised learning of transformers with a massive data set their wiki some dataset contains two million of these input-output pairs in which the input is all of the articles referenced in the Wikipedia page as well as the top 10 search results of the topic and the output is the original opening section of that Wikipedia article this paper introduces a lot of other interesting ideas like the decoder only transformer and the memory compressed attention with a local attention and then the approximation with the strided convolutions thanks for watching and please subscribe to Henry AI labs for more deep learning and AI videos

Original Description

This video explores the paper "Generating Wikipedia by Summarizing Long Sequences". Natural Language Processing models that can generate summaries of source documents on a single topic such as "Generative Adversarial Networks" or "Reinforcement Learning" are one of the NLP applications that I find the most interesting! This paper is frequently cited for introducing the Transformer decoder architecture, but there is a lot more interesting details about this paper. I also think the approximations proposed to full attention in this paper are really interesting! Paper Link: https://arxiv.org/pdf/1801.10198.pdf Thanks for watching! Please Subscribe!

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Connor Shorten · Connor Shorten · 0 of 60

← Previous Next →

DeepWalk Explained

DeepWalk Explained

Inception Network Explained

Inception Network Explained

Progressive Growing of GANs Explained

Progressive Growing of GANs Explained

Improved Techniques for Training GANs

Improved Techniques for Training GANs

Word2Vec Explained

Word2Vec Explained

Must Read Papers on GANs

Must Read Papers on GANs

Unsupervised Feature Learning

Unsupervised Feature Learning

Self-Supervised GANs

Self-Supervised GANs

Embedding Graphs with Deep Learning

Embedding Graphs with Deep Learning

Transfer Learning in GANs

Transfer Learning in GANs

ReLU Activation Function

ReLU Activation Function

AC-GAN Explained

AC-GAN Explained

SimGAN Explained

SimGAN Explained

DC-GAN Explained!

DC-GAN Explained!

ResNet Explained!

ResNet Explained!

Graph Convolutional Networks

Graph Convolutional Networks

Neural Architecture Search

Neural Architecture Search

Video Classification with Deep Learning

Video Classification with Deep Learning

BigGANs in Data Augmentation

BigGANs in Data Augmentation

Introduction to Deep Learning

Introduction to Deep Learning

EfficientNet Explained!

EfficientNet Explained!

Self-Attention GAN

Self-Attention GAN

Curriculum Learning in Deep Neural Networks

Curriculum Learning in Deep Neural Networks

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Compression

Deep Compression

Skin Cancer Classification with Deep Learning

Skin Cancer Classification with Deep Learning

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

The Lottery Ticket Hypothesis Explained!

The Lottery Ticket Hypothesis Explained!

GauGAN Explained!

GauGAN Explained!

AutoML with Hyperband

AutoML with Hyperband

DL Podcast #3 | Yannic Kilcher | Population-Based Search

DL Podcast #3 | Yannic Kilcher | Population-Based Search

Weakly Supervised Pretraining

Weakly Supervised Pretraining

Image Data Augmentation for Deep Learning

Image Data Augmentation for Deep Learning

Unsupervised Data Augmentation

Unsupervised Data Augmentation

Wide ResNet Explained!

Wide ResNet Explained!

RevNet: Backpropagation without Storing Activations

RevNet: Backpropagation without Storing Activations

GANs with Fewer Labels

GANs with Fewer Labels

BigBiGAN Unsupervised Learning!

BigBiGAN Unsupervised Learning!

Self-Supervised Learning

Self-Supervised Learning

Multi-Task Self-Supervised Learning

Multi-Task Self-Supervised Learning

Self-Supervised GANs

Self-Supervised GANs

Population Based Training

Population Based Training

Show, Attend and Tell

Show, Attend and Tell

Siamese Neural Networks

Siamese Neural Networks

WaveGAN Explained!

WaveGAN Explained!

VAE-GAN Explained!

VAE-GAN Explained!

Evolution in Neural Architecture Search!

Evolution in Neural Architecture Search!

AI Research Weekly Update August 18th, 2019

AI Research Weekly Update August 18th, 2019

Weight Agnostic Neural Networks Explained!

Weight Agnostic Neural Networks Explained!

AI Research Weekly Update August 25th, 2019

AI Research Weekly Update August 25th, 2019

Neuroevolution of Augmenting Topologies (NEAT)

Neuroevolution of Augmenting Topologies (NEAT)

AI Research Weekly Update September 1st, 2019

AI Research Weekly Update September 1st, 2019

Randomly Wired Neural Networks

Randomly Wired Neural Networks

This video teaches how to generate summaries of source documents using a transformer decoder architecture with a novel approximation to full attention, and introduces a memory compressed attention layer. The paper also constructs a dataset called wiki some and uses tf-idf extractive algorithm and transformer abstracted summarization model for abstractive summarization. By watching this video, viewers can learn how to build abstractive summarization models and use transformer decoder architecture

Key Takeaways

Use tf-idf extractive algorithm for data filtering
Build a transformer decoder architecture with a novel approximation to full attention
Introduce a memory compressed attention layer
Construct a dataset called wiki some
Use transformer abstracted summarization model for abstractive summarization
Fine-tune the model using supervised learning
Evaluate the model using Rouge one recall score

💡 The use of a memory compressed attention layer allows the model to handle sequences up to 11,000 tokens, making it possible to generate summaries of long documents.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

How to Use Poe for Llm-Friendly Content Structure in 2026

Use Poe to structure content for search engines and AI-powered answer engines

Kairos-4B: the open-source world model that just lapped the competition four times over

Learn about Kairos-4B, an open-source world model that surpasses competition four times over, and how it achieves real-time performance on edge devices

Medium · Machine Learning

Google’s Open Knowledge Format (OKF): Is This the Beginning of the End for RAG?

Google's Open Knowledge Format (OKF) might enhance Retrieval-Augmented Generation (RAG) rather than replace it, and understanding OKF is crucial for professionals working with AI and knowledge management

Medium · Programming

New AI tutor achieves 0.71-1.30 SD effect size in Dartmouth course [pdf]

Phosphor, an AI-powered learning platform, achieves significant learning gains by integrating LLM-graded formative assessments into instructional content, increasing student engagement and efficacy

Hacker News (AI)

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)