Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

AI Coffee Break with Letitia · Advanced ·📄 Research Papers Explained ·5y ago

Skills: Research Methods90%Reading ML Papers80%ML Maths Basics70%Supervised Learning60%

Key Takeaways

The video explains a research paper on using deep neural networks to solve symbolic mathematics problems, such as integration and ODEs, with a transformer architecture and self-supervised learning, achieving high accuracy and outperforming traditional computer algebra systems like Mathematica, Maple, and MATLAB.

Full Transcript

hello miss coffee bean and her team of archaeologists discovered an ancient machine learning paper and the finding is interesting for three reasons firstly the paper solves a part of symbolic mathematics with deep neural networks surpassing even the mighty mathematica on the test set secondly it is not really self-supervised learning but yet the amount of training data for this task is virtually infinite stick around to see why and thirdly it is one of miss coffee beans favorite papers i think she has a weak spot for mathematics so welcome to this ai coffee break where you will find out how artificial intelligence tackles symbolic mathematics deep learning can do symbolic mathematics or at least a part of it as far as this paper goes but why should anybody care about this well you should be hyped because of at least two reasons first mathematics follows rigorous rules and is exact perhaps you remember that e to the power of x integrated is again e to the power of x plus a constant and nothing else would be the answer here but deep learning with its neural networks is clearly statistical which can be seen as an opposite to rule-based systems where you could prove that you arrive at the right answer 100 percent of the time one more reason to care about neural networks solving symbolic mathematics is exactly the symbolic nature of this problem neural networks had huge success on domains where representations are continuous in the first place like in computer vision where pixels take continuous values dark spots take low values while bright pixels have large values but while neural networks perform well on continuous problems they are still struggling with symbolic reasoning so with exactly the mathematical equation this paper deals with which are in forms like x plus y which could stand for any two numbers being added up while the numerical expression is continuous as there are several other numbers between two and three x and y are the opposite of continuous they are symbols that can stand for numbers or even for other symbols and there's nothing in between the symbol x and the symbol y so to say the only relationship defined here is stated in the equation but that's about it and the neural network so far have been applied on numeric tasks where the network gets as inputs two numbers and the output is again a number like adding or multiplying two numbers and there they perform quite poorly but this paper considers the symbolic type of calculations where the input is an equation describing the relationship of symbols and the output is again in symbolic form but this time representing the solution if you remember from mathematics classes depicted here is integration another set of symbolic mathematics problems are first and second order ordinary differential equations or short odes miss coffee bean thinks that it is here in the problem choice where the genius of the paper is condensed for this video we will stick with the integration since we want to stay at the beginner level and odes are renowned to scare people for everyone interested in odes check out the paper linked in the video description disclaimer it works very similar to integration so we think we agreed that both integration and odes are hard for humans and now we want to solve symbolic integration with neural networks but neural networks require training data so pairs of functions and their integrals where to get that from we could generate some functions randomly and take systems like matlab and mathematica to predict the integral but these systems are not very fast and sometimes fail even for integrable functions so forward generation is a good start for some training data but won't get us far and we won't be better than mathematica therefore a little out of the box thinking is required integration is hard but the backward operation to it is differentiation which is fast easy to compute following simple rules so let's use this lesson and additionally to forward generation use something else by generating random functions on the output side instead of the input side and then computing their derivatives these function pairs can be then added to the training pool this is cool and it's called backward generation and it can generate a lot of hard examples but has the following problem it is very unlikely to randomly generate very complicated examples in the first place and even more unlikely is to have them simplify to a very simple function in the derivative it is important to have these long short combinations too otherwise the network could become biased to long equations to address this the authors use another clever trick two random functions f and g are generated their derivatives small f and small g are computed and these pairs are added to the training set but now with the rule of integration by parts one can also add the training set combinations of these f and g functions like this f times g is already known because it is already in the training set only that the second term is unknown we know small f and the integral of g but we do not know the integral of f times g but with enough luck it might already be in the training set if this already contains a lot of stuff generated with forward and backward generation beforehand so every time the authors discover something like this in the training set they generate a new data point by backward generation with integration by parts but miss coffee bean wait a little the authors have now three means of generating as much training data as they want with forward and backward generation also using integration by parts but all this relies on randomly generating functions which are then either integrated or differentiated so three questions remain how to generate functions randomly and how to feed them into a neural network and what is this mysterious neural network anyway okay we will answer these questions one at a time to randomly generate functions we first need a way to represent mathematical expressions the choice here falls to trees where operators and functions are internal nodes the operands are children and constants and variables are leaps the choice falls to trees because of lots of reasons trees represent the order in which operations are executed associativity is clear and parentheses become obsolete and because of this they are very easy to grow on the other hand expanding expressions in the sequential representation is very clear at both ends but adding something in between might get a little messy so yeah if one wants to grow an expression randomly one just has to randomly select functions operators and operands randomly and just check that unary operands like sinuses or exponentials get only one child while binary operands like addition or multiplication get two children then on to the second question how to process mathematical expressions now in form of trees by a neural network well we don't process the trees wait what the authors linearize the trees to a sequence again by using prefix notation where each node is written before its children from left to right like in this example the advantage over the common way of writing mathematical expressions is that the prefix notation does not require any parenthesis which would make the life of the neural network harder by having to remember to close them after they have been opened so no trees neural networks working on trees were very unvogue in natural language processing c3 lstms but now nlp is dominated by transformers and guess what the neural network employs to solve symbolic integration and odes is a transformer you did not expect that to happen right transformers are now everywhere yeah so the authors use a transformer architecture to translate from the input sequence to the output sequence and with the awesome data generation procedure that can generate virtually infinite amounts of it there is enough data for the transformer to learn the problem it does not really matter that symbolic integration is hard i think we all agree here because there is enough data which proves to be in this case 20 million examples yeah they generated 20 million data samples just like that this is why miss coffee bean thinks that this is genius take this equation you want to maximize research results you know transformers can do anything so of course you use them you have your problem but you realize it's limited data is too little for the transformer to learn something meaningful but you still want to do good research so you choose an application where you have virtually infinite amounts of data and the recipe to maximizing success is set well don't take miss coffee bean so seriously but it seems like this is kind of the recipe for success nowadays and this is what partially happens in self-supervised learning because there you have almost infinite amounts of training data so yeah in any case the 20 million samples that the authors here have seem to be very effective because their transformer network is able to solve the integration examples with around 99 accuracy in less than a second for example while mathematica given 30 seconds of compute time can only solve 84 percent maple 67 and matlab only 65 but what are the remaining one percent in integration performance of the transformer well we remember that we are in the statistical world of neural networks so until we meet you next time watch out for outliers if you wander out alone in this world and hey do not forget to like and subscribe you

Original Description

"Neural Nets are inexact beasts that will never solve exact problems", right? Wrong. Ms. Coffee Bean explains, draws and animates how neural networks can solve symbolic mathematics problems, e.g. integration, ODEs. It can even tackle integrals that Mathematica fails to solve. Do not worry, Mathematica, you are still awesome! Amazing work by Guillaume Lample and François Charton @AIatMeta . ➡️ AI Coffee Break Merch! 🛍️ https://aicoffeebreak.creator-spring.com/ ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 🔥 Optionally, pay us a coffee to boost our Coffee Bean production! ☕ Patreon: https://www.patreon.com/AICoffeeBreak Ko-fi: https://ko-fi.com/aicoffeebreak ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 📄 Lample, Guillaume, and François Charton. "Deep learning for symbolic mathematics." arXiv preprint arXiv:1912.01412 (2019). https://arxiv.org/pdf/1912.01412.pdf 📺 Ms. Coffee Bean explains the Transformer: https://youtu.be/FWFA4DGuzSc Outline: * 00:00 Neural networks integrate and solve ODEs. So what? * 03:55 Generating training data * 06:42 Representing and generating random functions * 07:56 Symbolic equations with neural nets Music 🎵 : Pretty Boy by DJ Freedem ----------------- 🔗 Links: YouTube: https://www.youtube.com/AICoffeeBreak Twitter: https://twitter.com/AICoffeeBreak Reddit: https://www.reddit.com/r/AICoffeeBreak/ #AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research #mathematics

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Coffee Break with Letitia · AI Coffee Break with Letitia · 33 of 60

← Previous Next →

AI Coffee Break - Channel Trailer

AI Coffee Break - Channel Trailer

AI Coffee Break with Letitia

How to check if a neural network has learned a specific phenomenon?

How to check if a neural network has learned a specific phenomenon?

AI Coffee Break with Letitia

A brief history of the Transformer architecture in NLP

A brief history of the Transformer architecture in NLP

AI Coffee Break with Letitia

Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop

Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop

AI Coffee Break with Letitia

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

AI Coffee Break with Letitia

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

AI Coffee Break with Letitia

Pre-training of BERT-based Transformer architectures explained – language and vision!

Pre-training of BERT-based Transformer architectures explained – language and vision!

AI Coffee Break with Letitia

GPT-3 explained with examples. Possibilities, and implications.

GPT-3 explained with examples. Possibilities, and implications.

AI Coffee Break with Letitia

Adversarial Machine Learning explained! | With examples.

Adversarial Machine Learning explained! | With examples.

AI Coffee Break with Letitia

BERTology meets Biology | Solving biological problems with Transformers

BERTology meets Biology | Solving biological problems with Transformers

AI Coffee Break with Letitia

Can a neural network tell if an image is mirrored? – Visual Chirality

Can a neural network tell if an image is mirrored? – Visual Chirality

AI Coffee Break with Letitia

The ultimate intro to Graph Neural Networks. Maybe.

The ultimate intro to Graph Neural Networks. Maybe.

AI Coffee Break with Letitia

Can language models understand? Bender and Koller argument.

Can language models understand? Bender and Koller argument.

AI Coffee Break with Letitia

GANs explained | Generative Adversarial Networks video with showcase!

GANs explained | Generative Adversarial Networks video with showcase!

AI Coffee Break with Letitia

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

AI Coffee Break with Letitia

Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS

Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS

AI Coffee Break with Letitia

Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

AI Coffee Break with Letitia

An image is worth 16x16 words: ViT | Vision Transformer explained

An image is worth 16x16 words: ViT | Vision Transformer explained

AI Coffee Break with Letitia

AI understanding language!? A roadmap to natural language understanding.

AI understanding language!? A roadmap to natural language understanding.

AI Coffee Break with Letitia

"What Can We Do to Improve Peer Review in NLP?" 👀

"What Can We Do to Improve Peer Review in NLP?" 👀

AI Coffee Break with Letitia

The curse of dimensionality. Or is it a blessing?

The curse of dimensionality. Or is it a blessing?

AI Coffee Break with Letitia

PCA explained with intuition, a little math and code

PCA explained with intuition, a little math and code

AI Coffee Break with Letitia

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

AI Coffee Break with Letitia

OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.

OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.

AI Coffee Break with Letitia

Leaking training data from GPT-2. How is this possible?

Leaking training data from GPT-2. How is this possible?

AI Coffee Break with Letitia

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

AI Coffee Break with Letitia

Transformers can do both images and text. Here is why.

Transformers can do both images and text. Here is why.

AI Coffee Break with Letitia

UMAP explained | The best dimensionality reduction?

UMAP explained | The best dimensionality reduction?

AI Coffee Break with Letitia

NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean

NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean

AI Coffee Break with Letitia

Transformer in Transformer: Paper explained and visualized | TNT

Transformer in Transformer: Paper explained and visualized | TNT

AI Coffee Break with Letitia

[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?

[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?

AI Coffee Break with Letitia

Pattern Exploiting Training explained! | PET, iPET, ADAPET

Pattern Exploiting Training explained! | PET, iPET, ADAPET

AI Coffee Break with Letitia

Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

AI Coffee Break with Letitia

FNet: Mixing Tokens with Fourier Transforms – Paper Explained

FNet: Mixing Tokens with Fourier Transforms – Paper Explained

AI Coffee Break with Letitia

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained

AI Coffee Break with Letitia

"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.

"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.

AI Coffee Break with Letitia

Scaling Vision Transformers? How much data can a transformer get? #Shorts

Scaling Vision Transformers? How much data can a transformer get? #Shorts

AI Coffee Break with Letitia

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]

AI Coffee Break with Letitia

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained

AI Coffee Break with Letitia

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

AI Coffee Break with Letitia

Adding vs. concatenating positional embeddings & Learned positional encodings

Adding vs. concatenating positional embeddings & Learned positional encodings

AI Coffee Break with Letitia

Self-Attention with Relative Position Representations – Paper explained

Self-Attention with Relative Position Representations – Paper explained

AI Coffee Break with Letitia

Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts

Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts

What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes

Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes

AI Coffee Break with Letitia

Is today's AI smarter than YOU? #Shorts

Is today's AI smarter than YOU? #Shorts

AI Coffee Break with Letitia

Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts

Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts

AI Coffee Break with Letitia

Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts

Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts

The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts

How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

What is tokenization and how does it work? Tokenizers explained.

What is tokenization and how does it work? Tokenizers explained.

AI Coffee Break with Letitia

Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”

Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”

AI Coffee Break with Letitia

How modern search engines work – Vector databases explained! | Weaviate open-source

How modern search engines work – Vector databases explained! | Weaviate open-source

AI Coffee Break with Letitia

Eyes tell all: How to tell that an AI generated a face?

Eyes tell all: How to tell that an AI generated a face?

AI Coffee Break with Letitia

Swin Transformer paper animated and explained

Swin Transformer paper animated and explained

AI Coffee Break with Letitia

Data BAD | What Will it Take to Fix Benchmarking for NLU?

Data BAD | What Will it Take to Fix Benchmarking for NLU?

AI Coffee Break with Letitia

SimVLM explained | What the paper doesn’t tell you

SimVLM explained | What the paper doesn’t tell you

AI Coffee Break with Letitia

Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?

Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?

AI Coffee Break with Letitia

Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

AI Coffee Break with Letitia

The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

AI Coffee Break with Letitia

The video explains how deep neural networks can be used to solve symbolic mathematics problems, such as integration and ODEs, with high accuracy and outperform traditional computer algebra systems. The paper uses a transformer architecture and self-supervised learning to achieve these results. Viewers can learn how to apply these techniques to solve mathematical problems and understand the limitations and potential of deep learning for symbolic mathematics.

Key Takeaways

Generate random functions on the output side instead of the input side and compute their derivatives
Use integration by parts to add training set combinations of f and g functions
Linearize trees to a sequence using prefix notation
Randomly select functions, operators, and operands to grow an expression randomly
Use a transformer architecture for symbolic integration and ODEs
Generate a large dataset for training using self-supervised learning

💡 Deep neural networks can be used to solve symbolic mathematics problems with high accuracy and outperform traditional computer algebra systems, but require careful design of the training data and architecture.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Research Methods

View skill →

Mechanics of Materials III: Beam Bending

Mechanics of Materials III: Beam Bending

Inaugural Lecture: Juliane Reinecke

Inaugural Lecture: Juliane Reinecke

Saïd Business School, University of Oxford

Hands-On Learning: How and Why You Should Build a Home Lab

Hands-On Learning: How and Why You Should Build a Home Lab

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

Does Water Swirl the Other Way in the Southern Hemisphere?

Does Water Swirl the Other Way in the Southern Hemisphere?

Undergraduate Research Forum 2026

Undergraduate Research Forum 2026

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling