Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED
Key Takeaways
The video explains a research paper on using deep neural networks to solve symbolic mathematics problems, such as integration and ODEs, with a transformer architecture and self-supervised learning, achieving high accuracy and outperforming traditional computer algebra systems like Mathematica, Maple, and MATLAB.
Full Transcript
hello miss coffee bean and her team of archaeologists discovered an ancient machine learning paper and the finding is interesting for three reasons firstly the paper solves a part of symbolic mathematics with deep neural networks surpassing even the mighty mathematica on the test set secondly it is not really self-supervised learning but yet the amount of training data for this task is virtually infinite stick around to see why and thirdly it is one of miss coffee beans favorite papers i think she has a weak spot for mathematics so welcome to this ai coffee break where you will find out how artificial intelligence tackles symbolic mathematics deep learning can do symbolic mathematics or at least a part of it as far as this paper goes but why should anybody care about this well you should be hyped because of at least two reasons first mathematics follows rigorous rules and is exact perhaps you remember that e to the power of x integrated is again e to the power of x plus a constant and nothing else would be the answer here but deep learning with its neural networks is clearly statistical which can be seen as an opposite to rule-based systems where you could prove that you arrive at the right answer 100 percent of the time one more reason to care about neural networks solving symbolic mathematics is exactly the symbolic nature of this problem neural networks had huge success on domains where representations are continuous in the first place like in computer vision where pixels take continuous values dark spots take low values while bright pixels have large values but while neural networks perform well on continuous problems they are still struggling with symbolic reasoning so with exactly the mathematical equation this paper deals with which are in forms like x plus y which could stand for any two numbers being added up while the numerical expression is continuous as there are several other numbers between two and three x and y are the opposite of continuous they are symbols that can stand for numbers or even for other symbols and there's nothing in between the symbol x and the symbol y so to say the only relationship defined here is stated in the equation but that's about it and the neural network so far have been applied on numeric tasks where the network gets as inputs two numbers and the output is again a number like adding or multiplying two numbers and there they perform quite poorly but this paper considers the symbolic type of calculations where the input is an equation describing the relationship of symbols and the output is again in symbolic form but this time representing the solution if you remember from mathematics classes depicted here is integration another set of symbolic mathematics problems are first and second order ordinary differential equations or short odes miss coffee bean thinks that it is here in the problem choice where the genius of the paper is condensed for this video we will stick with the integration since we want to stay at the beginner level and odes are renowned to scare people for everyone interested in odes check out the paper linked in the video description disclaimer it works very similar to integration so we think we agreed that both integration and odes are hard for humans and now we want to solve symbolic integration with neural networks but neural networks require training data so pairs of functions and their integrals where to get that from we could generate some functions randomly and take systems like matlab and mathematica to predict the integral but these systems are not very fast and sometimes fail even for integrable functions so forward generation is a good start for some training data but won't get us far and we won't be better than mathematica therefore a little out of the box thinking is required integration is hard but the backward operation to it is differentiation which is fast easy to compute following simple rules so let's use this lesson and additionally to forward generation use something else by generating random functions on the output side instead of the input side and then computing their derivatives these function pairs can be then added to the training pool this is cool and it's called backward generation and it can generate a lot of hard examples but has the following problem it is very unlikely to randomly generate very complicated examples in the first place and even more unlikely is to have them simplify to a very simple function in the derivative it is important to have these long short combinations too otherwise the network could become biased to long equations to address this the authors use another clever trick two random functions f and g are generated their derivatives small f and small g are computed and these pairs are added to the training set but now with the rule of integration by parts one can also add the training set combinations of these f and g functions like this f times g is already known because it is already in the training set only that the second term is unknown we know small f and the integral of g but we do not know the integral of f times g but with enough luck it might already be in the training set if this already contains a lot of stuff generated with forward and backward generation beforehand so every time the authors discover something like this in the training set they generate a new data point by backward generation with integration by parts but miss coffee bean wait a little the authors have now three means of generating as much training data as they want with forward and backward generation also using integration by parts but all this relies on randomly generating functions which are then either integrated or differentiated so three questions remain how to generate functions randomly and how to feed them into a neural network and what is this mysterious neural network anyway okay we will answer these questions one at a time to randomly generate functions we first need a way to represent mathematical expressions the choice here falls to trees where operators and functions are internal nodes the operands are children and constants and variables are leaps the choice falls to trees because of lots of reasons trees represent the order in which operations are executed associativity is clear and parentheses become obsolete and because of this they are very easy to grow on the other hand expanding expressions in the sequential representation is very clear at both ends but adding something in between might get a little messy so yeah if one wants to grow an expression randomly one just has to randomly select functions operators and operands randomly and just check that unary operands like sinuses or exponentials get only one child while binary operands like addition or multiplication get two children then on to the second question how to process mathematical expressions now in form of trees by a neural network well we don't process the trees wait what the authors linearize the trees to a sequence again by using prefix notation where each node is written before its children from left to right like in this example the advantage over the common way of writing mathematical expressions is that the prefix notation does not require any parenthesis which would make the life of the neural network harder by having to remember to close them after they have been opened so no trees neural networks working on trees were very unvogue in natural language processing c3 lstms but now nlp is dominated by transformers and guess what the neural network employs to solve symbolic integration and odes is a transformer you did not expect that to happen right transformers are now everywhere yeah so the authors use a transformer architecture to translate from the input sequence to the output sequence and with the awesome data generation procedure that can generate virtually infinite amounts of it there is enough data for the transformer to learn the problem it does not really matter that symbolic integration is hard i think we all agree here because there is enough data which proves to be in this case 20 million examples yeah they generated 20 million data samples just like that this is why miss coffee bean thinks that this is genius take this equation you want to maximize research results you know transformers can do anything so of course you use them you have your problem but you realize it's limited data is too little for the transformer to learn something meaningful but you still want to do good research so you choose an application where you have virtually infinite amounts of data and the recipe to maximizing success is set well don't take miss coffee bean so seriously but it seems like this is kind of the recipe for success nowadays and this is what partially happens in self-supervised learning because there you have almost infinite amounts of training data so yeah in any case the 20 million samples that the authors here have seem to be very effective because their transformer network is able to solve the integration examples with around 99 accuracy in less than a second for example while mathematica given 30 seconds of compute time can only solve 84 percent maple 67 and matlab only 65 but what are the remaining one percent in integration performance of the transformer well we remember that we are in the statistical world of neural networks so until we meet you next time watch out for outliers if you wander out alone in this world and hey do not forget to like and subscribe you
Original Description
"Neural Nets are inexact beasts that will never solve exact problems", right? Wrong. Ms. Coffee Bean explains, draws and animates how neural networks can solve symbolic mathematics problems, e.g. integration, ODEs. It can even tackle integrals that Mathematica fails to solve. Do not worry, Mathematica, you are still awesome!
Amazing work by Guillaume Lample and François Charton @AIatMeta .
➡️ AI Coffee Break Merch! 🛍️ https://aicoffeebreak.creator-spring.com/
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to boost our Coffee Bean production! ☕
Patreon: https://www.patreon.com/AICoffeeBreak
Ko-fi: https://ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
📄 Lample, Guillaume, and François Charton. "Deep learning for symbolic mathematics." arXiv preprint arXiv:1912.01412 (2019). https://arxiv.org/pdf/1912.01412.pdf
📺 Ms. Coffee Bean explains the Transformer: https://youtu.be/FWFA4DGuzSc
Outline:
* 00:00 Neural networks integrate and solve ODEs. So what?
* 03:55 Generating training data
* 06:42 Representing and generating random functions
* 07:56 Symbolic equations with neural nets
Music 🎵 : Pretty Boy by DJ Freedem
-----------------
🔗 Links:
YouTube: https://www.youtube.com/AICoffeeBreak
Twitter: https://twitter.com/AICoffeeBreak
Reddit: https://www.reddit.com/r/AICoffeeBreak/
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research #mathematics
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from AI Coffee Break with Letitia · AI Coffee Break with Letitia · 33 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
▶
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
AI Coffee Break - Channel Trailer
AI Coffee Break with Letitia
How to check if a neural network has learned a specific phenomenon?
AI Coffee Break with Letitia
A brief history of the Transformer architecture in NLP
AI Coffee Break with Letitia
Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop
AI Coffee Break with Letitia
The Transformer neural network architecture EXPLAINED. “Attention is all you need”
AI Coffee Break with Letitia
Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision
AI Coffee Break with Letitia
Pre-training of BERT-based Transformer architectures explained – language and vision!
AI Coffee Break with Letitia
GPT-3 explained with examples. Possibilities, and implications.
AI Coffee Break with Letitia
Adversarial Machine Learning explained! | With examples.
AI Coffee Break with Letitia
BERTology meets Biology | Solving biological problems with Transformers
AI Coffee Break with Letitia
Can a neural network tell if an image is mirrored? – Visual Chirality
AI Coffee Break with Letitia
The ultimate intro to Graph Neural Networks. Maybe.
AI Coffee Break with Letitia
Can language models understand? Bender and Koller argument.
AI Coffee Break with Letitia
GANs explained | Generative Adversarial Networks video with showcase!
AI Coffee Break with Letitia
What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.
AI Coffee Break with Letitia
Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS
AI Coffee Break with Letitia
Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES
AI Coffee Break with Letitia
An image is worth 16x16 words: ViT | Vision Transformer explained
AI Coffee Break with Letitia
AI understanding language!? A roadmap to natural language understanding.
AI Coffee Break with Letitia
"What Can We Do to Improve Peer Review in NLP?" 👀
AI Coffee Break with Letitia
The curse of dimensionality. Or is it a blessing?
AI Coffee Break with Letitia
PCA explained with intuition, a little math and code
AI Coffee Break with Letitia
Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper
AI Coffee Break with Letitia
OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.
AI Coffee Break with Letitia
Leaking training data from GPT-2. How is this possible?
AI Coffee Break with Letitia
OpenAI’s CLIP explained! | Examples, links to code and pretrained model
AI Coffee Break with Letitia
Transformers can do both images and text. Here is why.
AI Coffee Break with Letitia
UMAP explained | The best dimensionality reduction?
AI Coffee Break with Letitia
NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean
AI Coffee Break with Letitia
Transformer in Transformer: Paper explained and visualized | TNT
AI Coffee Break with Letitia
[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?
AI Coffee Break with Letitia
Pattern Exploiting Training explained! | PET, iPET, ADAPET
AI Coffee Break with Letitia
Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED
AI Coffee Break with Letitia
FNet: Mixing Tokens with Fourier Transforms – Paper Explained
AI Coffee Break with Letitia
Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained
AI Coffee Break with Letitia
"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.
AI Coffee Break with Letitia
Scaling Vision Transformers? How much data can a transformer get? #Shorts
AI Coffee Break with Letitia
How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]
AI Coffee Break with Letitia
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained
AI Coffee Break with Letitia
Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.
AI Coffee Break with Letitia
Adding vs. concatenating positional embeddings & Learned positional encodings
AI Coffee Break with Letitia
Self-Attention with Relative Position Representations – Paper explained
AI Coffee Break with Letitia
Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes
AI Coffee Break with Letitia
Is today's AI smarter than YOU? #Shorts
AI Coffee Break with Letitia
Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts
AI Coffee Break with Letitia
Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
What is tokenization and how does it work? Tokenizers explained.
AI Coffee Break with Letitia
Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”
AI Coffee Break with Letitia
How modern search engines work – Vector databases explained! | Weaviate open-source
AI Coffee Break with Letitia
Eyes tell all: How to tell that an AI generated a face?
AI Coffee Break with Letitia
Swin Transformer paper animated and explained
AI Coffee Break with Letitia
Data BAD | What Will it Take to Fix Benchmarking for NLU?
AI Coffee Break with Letitia
SimVLM explained | What the paper doesn’t tell you
AI Coffee Break with Letitia
Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?
AI Coffee Break with Letitia
Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz
AI Coffee Break with Letitia
The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?
AI Coffee Break with Letitia
More on: Research Methods
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
🎓
Tutor Explanation
DeepCamp AI