Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

AI Coffee Break with Letitia · Advanced ·📐 ML Fundamentals ·5y ago

Key Takeaways

The video discusses the reasons behind the problems in integrating vision and language with deep learning methods, specifically in multimodal machine learning models, highlighting issues with data biases and neural network architectures.

Full Transcript

[Music] disclaimer this video highlights the not so favorable aspects of integrating vision and language for information on risk and side effects please read your favorite publication and ask your machine learning researcher or engineer hey there did you watch miss coffee beans previous video if not check it out right now and then come back to this video last time she told us about the symptoms that are a clear hint that something fishy is going on when integrating vision and language with deep learning methods it seems that one modality is usually almost forgotten even though the tasks are defined to require the understanding of both modalities to deliver the right answer in this video miss coffeebean is going to tell us what she found further on her journey through the dark paper forest miss coffee bean what are the reasons for problems in vision and language fusion on my journey i encountered two sources of all evil some say the source of problems when integrating vision and language is in the data while others say it is in the neural network based models let's start with the accusations against the data the vqa data set was released in 2015 and a lot of work was invested by researchers to get on top of the challenge leaderboard then 2017 a version 2 of the vqa dataset had to be released since obvious biases in the data were detected these biases were making the undesired happen the model could focus only on the text ignoring the image it's in the title making the v in vqa matter for example far too numerous how many questions have the answer to if the question is about sport then usually the correct answer is tennis also questions like is there a special something in the image bear the answer yes since humans contributing for the data set collection usually do not ask if there is a horse in the image if there is none to be seen how would they even get the idea of a horse in the first place so these obvious biases were removed but there are suspicions that neural networks can also rely on not so obvious biases and many models still seem to ignore the image far too much what holds for vqa usually holds for visual dialogue too as visual dialogue is simply put vqa with history this paper shows a very simple statistical method not even neural networks without accessing the image or the sequence of the dialogue and this method performs on par on some metrics with extremely complex neural models of course authors of the original visual dialogue papers did not take the critique well if you want to read how researchers fight check the links to these papers and the response in the description below but whatever side you are on i think that one must take this alarm seriously it is always tricky to produce clean training data because the fingerprints of the data collectors are everywhere on it and neural networks are extremely good at exploiting every hint and bias to get to the right answer even if for the wrong reasons but apart from neural networks exploiting bias in the data it seems to be something fundamentally wrong with neural networks trying to fuse modalities so it is time for us to discuss the problems that models might inherently have we think this question is still very very poorly investigated but we have one research product to show you this paper conducts a very simple yet fundamental experiment given an image that can be described by only one word a neural network has to construct a word that is describing the image not the classification task where the network has to choose the right index of the right class to which we humans have assigned the labels it is rather a regression task where the model has to come up with the right word embedding that best fits to the picture simply put they try to translate images to single words rather than a whole sentence like in image captioning what do we expect it should be an easy task right image captioning is a much harder task and neural networks can do it so why should it not be possible to translate an image to a single word and you're right it is possible the neural network performs with an impressive accuracy only that the paper looks a little bit closer it also measures how far the neural network can transform the input space to make it look like the output space and here the results are surprising it turns out that the neural network is really bad at that it stays much closer to the input modality which is the image even though the network is trained to fully transform the image space to the textual word embedding space it preserves the neighborhoods of the input space rather than construct the output space why because neural nets are guaranteeing continuity and preserved topology we think that there is still very much to investigate along this line especially the case when both visual and textual modalities are mapped by neural network into a common space because this was only a special case where the image only was transformed but even this special case analysis is a strong alarm signal when talking about neural networks in multimodal research as they are universal approximators given infinite training time and data we expect them to be the solution to all our multimodal problems but perhaps we expect too much from them in our real setting with not so much training data or training time like in the example seen before the neural network was able to transform single data points from the image to the textual modality but was not able to sensibly transform the space in between with infinitely many training samples this would not have been a problem since the network would have had a training sample for every point in the space but even if we would have these infinite resources we have seen how problematic data and biases can be when integrating vision and language and that we can insert multi-modal fusion difficulties when choosing a model or architecture we could end our video here since this is what's miss coffee bean found on her search for hints and reasons on why current vision and language models fail where they fail but we want to spend a minute on multimodal models that are coming out right now beating the previous state of the art but have not been yet more thoroughly investigated for problems yes we are talking about the multi-modal transformers which are all variations of the same theme of processing images and text and combining them with transformer modules and cross-modal attention for more details on how this architecture works check out our previous video on wilbert link below multimodal transformers show through their better accuracy on multi-modal tasks and attention visualization that they are better at combining the visual and textual modalities but still a recent publication suggests that for multimodal transformers it seems again that quote the textual modality plays a more important role than image in making final decisions so here we are again at the same problem that existed even before multiple transformers the inequality of modalities and the curse of biased data sets towards the textual modality so what do we do now more research the relevant papers for this video are in the description below if you are interested to investigate the root of the problem yourself this video is of course not an exhaustive literature enumeration of all problems in multimodal integration it's rather just a hint of where to look when searching for trouble i mean in an academic sense anyway what do you think let us know in the comments what your thoughts and observations are see you next time okay bye [Music] [Music] you

Original Description

Do you want to know the REASONS for problems in integrating images and text with deep learning? This is the second part of a two-videos series. The first part of the series: 📺 https://youtu.be/P23EWdiPWDw, where Ms. Coffee Bean talks about the SYMPTOMS. ➡️ AI Coffee Break Merch! 🛍️ https://aicoffeebreak.creator-spring.com/ ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 🔥 Optionally, pay us a coffee to boost our Coffee Bean production! ☕ Patreon: https://www.patreon.com/AICoffeeBreak Ko-fi: https://ko-fi.com/aicoffeebreak ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 📺 Ms. Coffee Bean explains a Multimodal Transformer: https://youtu.be/dd7nE4nbxN0 Outline of this video: * 00:00 Previously about symptoms * 01:17 The data * 03:47 The model * 07:17 Multimodal Transformers 📄 Goyal, Yash, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. "Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904-6913. 2017. https://openaccess.thecvf.com/content_cvpr_2017/papers/Goyal_Making_the_v_CVPR_2017_paper.pdf 📄 Massiceti, Daniela, Puneet K. Dokania, N. Siddharth, and Philip HS Torr. "Visual dialogue without vision or dialogue." arXiv preprint arXiv:1812.06417 (2018). https://arxiv.org/pdf/1812.06417.pdf 📄 Das, Abhishek, Devi Parikh, and Dhruv Batra. "Response to" Visual Dialogue without Vision or Dialogue"(Massiceti et al., 2018)." arXiv preprint arXiv:1901.05531 (2019). https://arxiv.org/pdf/1901.05531.pdf 📄 (not in the video, but relevant for the dataset bias problem): Agarwal, Shubham, et al. "History for Visual Dialog: Do we really need it?." arXiv preprint arXiv:2005.07493 (2020). 📄 Collell, G., & Moens, M. F. (2018, July). Do Neural Network Cross-Modal Mappings Really Bridge Modalities?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 462-468). https://arxiv.org/pdf/1805.07616.p
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Coffee Break with Letitia · AI Coffee Break with Letitia · 17 of 60

1 AI Coffee Break - Channel Trailer
AI Coffee Break - Channel Trailer
AI Coffee Break with Letitia
2 How to check if a neural network has learned a specific phenomenon?
How to check if a neural network has learned a specific phenomenon?
AI Coffee Break with Letitia
3 A brief history of the Transformer architecture in NLP
A brief history of the Transformer architecture in NLP
AI Coffee Break with Letitia
4 Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop
Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop
AI Coffee Break with Letitia
5 The Transformer neural network architecture EXPLAINED. “Attention is all you need”
The Transformer neural network architecture EXPLAINED. “Attention is all you need”
AI Coffee Break with Letitia
6 Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision
Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision
AI Coffee Break with Letitia
7 Pre-training of BERT-based Transformer architectures explained – language and vision!
Pre-training of BERT-based Transformer architectures explained – language and vision!
AI Coffee Break with Letitia
8 GPT-3 explained with examples. Possibilities, and implications.
GPT-3 explained with examples. Possibilities, and implications.
AI Coffee Break with Letitia
9 Adversarial Machine Learning explained! | With examples.
Adversarial Machine Learning explained! | With examples.
AI Coffee Break with Letitia
10 BERTology meets Biology | Solving biological problems with Transformers
BERTology meets Biology | Solving biological problems with Transformers
AI Coffee Break with Letitia
11 Can a neural network tell if an image is mirrored? – Visual Chirality
Can a neural network tell if an image is mirrored? – Visual Chirality
AI Coffee Break with Letitia
12 The ultimate intro to Graph Neural Networks. Maybe.
The ultimate intro to Graph Neural Networks. Maybe.
AI Coffee Break with Letitia
13 Can language models understand? Bender and Koller argument.
Can language models understand? Bender and Koller argument.
AI Coffee Break with Letitia
14 GANs explained | Generative Adversarial Networks video with showcase!
GANs explained | Generative Adversarial Networks video with showcase!
AI Coffee Break with Letitia
15 What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.
What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.
AI Coffee Break with Letitia
16 Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS
Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS
AI Coffee Break with Letitia
Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES
Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES
AI Coffee Break with Letitia
18 An image is worth 16x16 words: ViT | Vision Transformer explained
An image is worth 16x16 words: ViT | Vision Transformer explained
AI Coffee Break with Letitia
19 AI understanding language!? A roadmap to natural language understanding.
AI understanding language!? A roadmap to natural language understanding.
AI Coffee Break with Letitia
20 "What Can We Do to Improve Peer Review in NLP?" 👀
"What Can We Do to Improve Peer Review in NLP?" 👀
AI Coffee Break with Letitia
21 The curse of dimensionality. Or is it a blessing?
The curse of dimensionality. Or is it a blessing?
AI Coffee Break with Letitia
22 PCA explained with intuition, a little math and code
PCA explained with intuition, a little math and code
AI Coffee Break with Letitia
23 Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper
Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper
AI Coffee Break with Letitia
24 OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.
OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.
AI Coffee Break with Letitia
25 Leaking training data from GPT-2. How is this possible?
Leaking training data from GPT-2. How is this possible?
AI Coffee Break with Letitia
26 OpenAI’s CLIP explained! | Examples, links to code and pretrained model
OpenAI’s CLIP explained! | Examples, links to code and pretrained model
AI Coffee Break with Letitia
27 Transformers can do both images and text. Here is why.
Transformers can do both images and text. Here is why.
AI Coffee Break with Letitia
28 UMAP explained | The best dimensionality reduction?
UMAP explained | The best dimensionality reduction?
AI Coffee Break with Letitia
29 NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean
NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean
AI Coffee Break with Letitia
30 Transformer in Transformer: Paper explained and visualized | TNT
Transformer in Transformer: Paper explained and visualized | TNT
AI Coffee Break with Letitia
31 [RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?
[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?
AI Coffee Break with Letitia
32 Pattern Exploiting Training explained! | PET, iPET, ADAPET
Pattern Exploiting Training explained! | PET, iPET, ADAPET
AI Coffee Break with Letitia
33 Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED
Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED
AI Coffee Break with Letitia
34 FNet: Mixing Tokens with Fourier Transforms – Paper Explained
FNet: Mixing Tokens with Fourier Transforms – Paper Explained
AI Coffee Break with Letitia
35 Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained
Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained
AI Coffee Break with Letitia
36 "Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.
"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.
AI Coffee Break with Letitia
37 Scaling Vision Transformers? How much data can a transformer get? #Shorts
Scaling Vision Transformers? How much data can a transformer get? #Shorts
AI Coffee Break with Letitia
38 How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]
How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]
AI Coffee Break with Letitia
39 Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained
AI Coffee Break with Letitia
40 Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.
Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.
AI Coffee Break with Letitia
41 Adding vs. concatenating positional embeddings & Learned positional encodings
Adding vs. concatenating positional embeddings & Learned positional encodings
AI Coffee Break with Letitia
42 Self-Attention with Relative Position Representations – Paper explained
Self-Attention with Relative Position Representations – Paper explained
AI Coffee Break with Letitia
43 Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts
Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
44 What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts
What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
45 Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes
Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes
AI Coffee Break with Letitia
46 Is today's AI smarter than YOU? #Shorts
Is today's AI smarter than YOU? #Shorts
AI Coffee Break with Letitia
47 Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts
Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts
AI Coffee Break with Letitia
48 Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts
Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
49 The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts
The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
50 How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts
How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
51 What is tokenization and how does it work? Tokenizers explained.
What is tokenization and how does it work? Tokenizers explained.
AI Coffee Break with Letitia
52 Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”
Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”
AI Coffee Break with Letitia
53 How modern search engines work – Vector databases explained! | Weaviate open-source
How modern search engines work – Vector databases explained! | Weaviate open-source
AI Coffee Break with Letitia
54 Eyes tell all: How to tell that an AI generated a face?
Eyes tell all: How to tell that an AI generated a face?
AI Coffee Break with Letitia
55 Swin Transformer paper animated and explained
Swin Transformer paper animated and explained
AI Coffee Break with Letitia
56 Data BAD | What Will it Take to Fix Benchmarking for NLU?
Data BAD | What Will it Take to Fix Benchmarking for NLU?
AI Coffee Break with Letitia
57 SimVLM explained | What the paper doesn’t tell you
SimVLM explained | What the paper doesn’t tell you
AI Coffee Break with Letitia
58 Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?
Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?
AI Coffee Break with Letitia
59 Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz
Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz
AI Coffee Break with Letitia
60 The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?
The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?
AI Coffee Break with Letitia

The video explores the challenges in integrating vision and language with deep learning, discussing data biases and neural network limitations, and highlighting the need for further research in multimodal learning.

Key Takeaways
  1. Investigate data biases in vision and language datasets
  2. Evaluate neural network architectures for multimodal learning
  3. Research multimodal transformers and cross-modal attention
  4. Analyze the role of textual modality in multimodal decision-making
💡 Neural networks can exploit biases in the data and may not effectively transform the input space to the output space, leading to limitations in multimodal learning.

Related AI Lessons

Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →