Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES
Key Takeaways
The video discusses the reasons behind the problems in integrating vision and language with deep learning methods, specifically in multimodal machine learning models, highlighting issues with data biases and neural network architectures.
Full Transcript
[Music] disclaimer this video highlights the not so favorable aspects of integrating vision and language for information on risk and side effects please read your favorite publication and ask your machine learning researcher or engineer hey there did you watch miss coffee beans previous video if not check it out right now and then come back to this video last time she told us about the symptoms that are a clear hint that something fishy is going on when integrating vision and language with deep learning methods it seems that one modality is usually almost forgotten even though the tasks are defined to require the understanding of both modalities to deliver the right answer in this video miss coffeebean is going to tell us what she found further on her journey through the dark paper forest miss coffee bean what are the reasons for problems in vision and language fusion on my journey i encountered two sources of all evil some say the source of problems when integrating vision and language is in the data while others say it is in the neural network based models let's start with the accusations against the data the vqa data set was released in 2015 and a lot of work was invested by researchers to get on top of the challenge leaderboard then 2017 a version 2 of the vqa dataset had to be released since obvious biases in the data were detected these biases were making the undesired happen the model could focus only on the text ignoring the image it's in the title making the v in vqa matter for example far too numerous how many questions have the answer to if the question is about sport then usually the correct answer is tennis also questions like is there a special something in the image bear the answer yes since humans contributing for the data set collection usually do not ask if there is a horse in the image if there is none to be seen how would they even get the idea of a horse in the first place so these obvious biases were removed but there are suspicions that neural networks can also rely on not so obvious biases and many models still seem to ignore the image far too much what holds for vqa usually holds for visual dialogue too as visual dialogue is simply put vqa with history this paper shows a very simple statistical method not even neural networks without accessing the image or the sequence of the dialogue and this method performs on par on some metrics with extremely complex neural models of course authors of the original visual dialogue papers did not take the critique well if you want to read how researchers fight check the links to these papers and the response in the description below but whatever side you are on i think that one must take this alarm seriously it is always tricky to produce clean training data because the fingerprints of the data collectors are everywhere on it and neural networks are extremely good at exploiting every hint and bias to get to the right answer even if for the wrong reasons but apart from neural networks exploiting bias in the data it seems to be something fundamentally wrong with neural networks trying to fuse modalities so it is time for us to discuss the problems that models might inherently have we think this question is still very very poorly investigated but we have one research product to show you this paper conducts a very simple yet fundamental experiment given an image that can be described by only one word a neural network has to construct a word that is describing the image not the classification task where the network has to choose the right index of the right class to which we humans have assigned the labels it is rather a regression task where the model has to come up with the right word embedding that best fits to the picture simply put they try to translate images to single words rather than a whole sentence like in image captioning what do we expect it should be an easy task right image captioning is a much harder task and neural networks can do it so why should it not be possible to translate an image to a single word and you're right it is possible the neural network performs with an impressive accuracy only that the paper looks a little bit closer it also measures how far the neural network can transform the input space to make it look like the output space and here the results are surprising it turns out that the neural network is really bad at that it stays much closer to the input modality which is the image even though the network is trained to fully transform the image space to the textual word embedding space it preserves the neighborhoods of the input space rather than construct the output space why because neural nets are guaranteeing continuity and preserved topology we think that there is still very much to investigate along this line especially the case when both visual and textual modalities are mapped by neural network into a common space because this was only a special case where the image only was transformed but even this special case analysis is a strong alarm signal when talking about neural networks in multimodal research as they are universal approximators given infinite training time and data we expect them to be the solution to all our multimodal problems but perhaps we expect too much from them in our real setting with not so much training data or training time like in the example seen before the neural network was able to transform single data points from the image to the textual modality but was not able to sensibly transform the space in between with infinitely many training samples this would not have been a problem since the network would have had a training sample for every point in the space but even if we would have these infinite resources we have seen how problematic data and biases can be when integrating vision and language and that we can insert multi-modal fusion difficulties when choosing a model or architecture we could end our video here since this is what's miss coffee bean found on her search for hints and reasons on why current vision and language models fail where they fail but we want to spend a minute on multimodal models that are coming out right now beating the previous state of the art but have not been yet more thoroughly investigated for problems yes we are talking about the multi-modal transformers which are all variations of the same theme of processing images and text and combining them with transformer modules and cross-modal attention for more details on how this architecture works check out our previous video on wilbert link below multimodal transformers show through their better accuracy on multi-modal tasks and attention visualization that they are better at combining the visual and textual modalities but still a recent publication suggests that for multimodal transformers it seems again that quote the textual modality plays a more important role than image in making final decisions so here we are again at the same problem that existed even before multiple transformers the inequality of modalities and the curse of biased data sets towards the textual modality so what do we do now more research the relevant papers for this video are in the description below if you are interested to investigate the root of the problem yourself this video is of course not an exhaustive literature enumeration of all problems in multimodal integration it's rather just a hint of where to look when searching for trouble i mean in an academic sense anyway what do you think let us know in the comments what your thoughts and observations are see you next time okay bye [Music] [Music] you
Original Description
Do you want to know the REASONS for problems in integrating images and text with deep learning? This is the second part of a two-videos series.
The first part of the series: 📺 https://youtu.be/P23EWdiPWDw, where Ms. Coffee Bean talks about the SYMPTOMS.
➡️ AI Coffee Break Merch! 🛍️ https://aicoffeebreak.creator-spring.com/
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to boost our Coffee Bean production! ☕
Patreon: https://www.patreon.com/AICoffeeBreak
Ko-fi: https://ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
📺 Ms. Coffee Bean explains a Multimodal Transformer: https://youtu.be/dd7nE4nbxN0
Outline of this video:
* 00:00 Previously about symptoms
* 01:17 The data
* 03:47 The model
* 07:17 Multimodal Transformers
📄 Goyal, Yash, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. "Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904-6913. 2017. https://openaccess.thecvf.com/content_cvpr_2017/papers/Goyal_Making_the_v_CVPR_2017_paper.pdf
📄 Massiceti, Daniela, Puneet K. Dokania, N. Siddharth, and Philip HS Torr. "Visual dialogue without vision or dialogue." arXiv preprint arXiv:1812.06417 (2018). https://arxiv.org/pdf/1812.06417.pdf
📄 Das, Abhishek, Devi Parikh, and Dhruv Batra. "Response to" Visual Dialogue without Vision or Dialogue"(Massiceti et al., 2018)." arXiv preprint arXiv:1901.05531 (2019). https://arxiv.org/pdf/1901.05531.pdf
📄 (not in the video, but relevant for the dataset bias problem): Agarwal, Shubham, et al. "History for Visual Dialog: Do we really need it?." arXiv preprint arXiv:2005.07493 (2020).
📄 Collell, G., & Moens, M. F. (2018, July). Do Neural Network Cross-Modal Mappings Really Bridge Modalities?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 462-468). https://arxiv.org/pdf/1805.07616.p
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from AI Coffee Break with Letitia · AI Coffee Break with Letitia · 17 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
▶
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
AI Coffee Break - Channel Trailer
AI Coffee Break with Letitia
How to check if a neural network has learned a specific phenomenon?
AI Coffee Break with Letitia
A brief history of the Transformer architecture in NLP
AI Coffee Break with Letitia
Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop
AI Coffee Break with Letitia
The Transformer neural network architecture EXPLAINED. “Attention is all you need”
AI Coffee Break with Letitia
Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision
AI Coffee Break with Letitia
Pre-training of BERT-based Transformer architectures explained – language and vision!
AI Coffee Break with Letitia
GPT-3 explained with examples. Possibilities, and implications.
AI Coffee Break with Letitia
Adversarial Machine Learning explained! | With examples.
AI Coffee Break with Letitia
BERTology meets Biology | Solving biological problems with Transformers
AI Coffee Break with Letitia
Can a neural network tell if an image is mirrored? – Visual Chirality
AI Coffee Break with Letitia
The ultimate intro to Graph Neural Networks. Maybe.
AI Coffee Break with Letitia
Can language models understand? Bender and Koller argument.
AI Coffee Break with Letitia
GANs explained | Generative Adversarial Networks video with showcase!
AI Coffee Break with Letitia
What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.
AI Coffee Break with Letitia
Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS
AI Coffee Break with Letitia
Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES
AI Coffee Break with Letitia
An image is worth 16x16 words: ViT | Vision Transformer explained
AI Coffee Break with Letitia
AI understanding language!? A roadmap to natural language understanding.
AI Coffee Break with Letitia
"What Can We Do to Improve Peer Review in NLP?" 👀
AI Coffee Break with Letitia
The curse of dimensionality. Or is it a blessing?
AI Coffee Break with Letitia
PCA explained with intuition, a little math and code
AI Coffee Break with Letitia
Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper
AI Coffee Break with Letitia
OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.
AI Coffee Break with Letitia
Leaking training data from GPT-2. How is this possible?
AI Coffee Break with Letitia
OpenAI’s CLIP explained! | Examples, links to code and pretrained model
AI Coffee Break with Letitia
Transformers can do both images and text. Here is why.
AI Coffee Break with Letitia
UMAP explained | The best dimensionality reduction?
AI Coffee Break with Letitia
NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean
AI Coffee Break with Letitia
Transformer in Transformer: Paper explained and visualized | TNT
AI Coffee Break with Letitia
[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?
AI Coffee Break with Letitia
Pattern Exploiting Training explained! | PET, iPET, ADAPET
AI Coffee Break with Letitia
Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED
AI Coffee Break with Letitia
FNet: Mixing Tokens with Fourier Transforms – Paper Explained
AI Coffee Break with Letitia
Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained
AI Coffee Break with Letitia
"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.
AI Coffee Break with Letitia
Scaling Vision Transformers? How much data can a transformer get? #Shorts
AI Coffee Break with Letitia
How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]
AI Coffee Break with Letitia
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained
AI Coffee Break with Letitia
Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.
AI Coffee Break with Letitia
Adding vs. concatenating positional embeddings & Learned positional encodings
AI Coffee Break with Letitia
Self-Attention with Relative Position Representations – Paper explained
AI Coffee Break with Letitia
Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes
AI Coffee Break with Letitia
Is today's AI smarter than YOU? #Shorts
AI Coffee Break with Letitia
Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts
AI Coffee Break with Letitia
Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
What is tokenization and how does it work? Tokenizers explained.
AI Coffee Break with Letitia
Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”
AI Coffee Break with Letitia
How modern search engines work – Vector databases explained! | Weaviate open-source
AI Coffee Break with Letitia
Eyes tell all: How to tell that an AI generated a face?
AI Coffee Break with Letitia
Swin Transformer paper animated and explained
AI Coffee Break with Letitia
Data BAD | What Will it Take to Fix Benchmarking for NLU?
AI Coffee Break with Letitia
SimVLM explained | What the paper doesn’t tell you
AI Coffee Break with Letitia
Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?
AI Coffee Break with Letitia
Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz
AI Coffee Break with Letitia
The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?
AI Coffee Break with Letitia
More on: Reading ML Papers
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How to Learn a Hard Technical Skill Without Burning Out
Dev.to · Anas Kalthoum | FreeBrain
After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.
Medium · Machine Learning
How AI Learns with Less Labeled Data
Medium · Machine Learning
Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Medium · JavaScript
🎓
Tutor Explanation
DeepCamp AI