Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

AI Coffee Break with Letitia · Advanced ·📐 ML Fundamentals ·5y ago

Skills: Reading ML Papers90%Multimodal LLMs80%

Key Takeaways

The video discusses the reasons behind the problems in integrating vision and language with deep learning methods, specifically in multimodal machine learning models, highlighting issues with data biases and neural network architectures.

Full Transcript

[Music] disclaimer this video highlights the not so favorable aspects of integrating vision and language for information on risk and side effects please read your favorite publication and ask your machine learning researcher or engineer hey there did you watch miss coffee beans previous video if not check it out right now and then come back to this video last time she told us about the symptoms that are a clear hint that something fishy is going on when integrating vision and language with deep learning methods it seems that one modality is usually almost forgotten even though the tasks are defined to require the understanding of both modalities to deliver the right answer in this video miss coffeebean is going to tell us what she found further on her journey through the dark paper forest miss coffee bean what are the reasons for problems in vision and language fusion on my journey i encountered two sources of all evil some say the source of problems when integrating vision and language is in the data while others say it is in the neural network based models let's start with the accusations against the data the vqa data set was released in 2015 and a lot of work was invested by researchers to get on top of the challenge leaderboard then 2017 a version 2 of the vqa dataset had to be released since obvious biases in the data were detected these biases were making the undesired happen the model could focus only on the text ignoring the image it's in the title making the v in vqa matter for example far too numerous how many questions have the answer to if the question is about sport then usually the correct answer is tennis also questions like is there a special something in the image bear the answer yes since humans contributing for the data set collection usually do not ask if there is a horse in the image if there is none to be seen how would they even get the idea of a horse in the first place so these obvious biases were removed but there are suspicions that neural networks can also rely on not so obvious biases and many models still seem to ignore the image far too much what holds for vqa usually holds for visual dialogue too as visual dialogue is simply put vqa with history this paper shows a very simple statistical method not even neural networks without accessing the image or the sequence of the dialogue and this method performs on par on some metrics with extremely complex neural models of course authors of the original visual dialogue papers did not take the critique well if you want to read how researchers fight check the links to these papers and the response in the description below but whatever side you are on i think that one must take this alarm seriously it is always tricky to produce clean training data because the fingerprints of the data collectors are everywhere on it and neural networks are extremely good at exploiting every hint and bias to get to the right answer even if for the wrong reasons but apart from neural networks exploiting bias in the data it seems to be something fundamentally wrong with neural networks trying to fuse modalities so it is time for us to discuss the problems that models might inherently have we think this question is still very very poorly investigated but we have one research product to show you this paper conducts a very simple yet fundamental experiment given an image that can be described by only one word a neural network has to construct a word that is describing the image not the classification task where the network has to choose the right index of the right class to which we humans have assigned the labels it is rather a regression task where the model has to come up with the right word embedding that best fits to the picture simply put they try to translate images to single words rather than a whole sentence like in image captioning what do we expect it should be an easy task right image captioning is a much harder task and neural networks can do it so why should it not be possible to translate an image to a single word and you're right it is possible the neural network performs with an impressive accuracy only that the paper looks a little bit closer it also measures how far the neural network can transform the input space to make it look like the output space and here the results are surprising it turns out that the neural network is really bad at that it stays much closer to the input modality which is the image even though the network is trained to fully transform the image space to the textual word embedding space it preserves the neighborhoods of the input space rather than construct the output space why because neural nets are guaranteeing continuity and preserved topology we think that there is still very much to investigate along this line especially the case when both visual and textual modalities are mapped by neural network into a common space because this was only a special case where the image only was transformed but even this special case analysis is a strong alarm signal when talking about neural networks in multimodal research as they are universal approximators given infinite training time and data we expect them to be the solution to all our multimodal problems but perhaps we expect too much from them in our real setting with not so much training data or training time like in the example seen before the neural network was able to transform single data points from the image to the textual modality but was not able to sensibly transform the space in between with infinitely many training samples this would not have been a problem since the network would have had a training sample for every point in the space but even if we would have these infinite resources we have seen how problematic data and biases can be when integrating vision and language and that we can insert multi-modal fusion difficulties when choosing a model or architecture we could end our video here since this is what's miss coffee bean found on her search for hints and reasons on why current vision and language models fail where they fail but we want to spend a minute on multimodal models that are coming out right now beating the previous state of the art but have not been yet more thoroughly investigated for problems yes we are talking about the multi-modal transformers which are all variations of the same theme of processing images and text and combining them with transformer modules and cross-modal attention for more details on how this architecture works check out our previous video on wilbert link below multimodal transformers show through their better accuracy on multi-modal tasks and attention visualization that they are better at combining the visual and textual modalities but still a recent publication suggests that for multimodal transformers it seems again that quote the textual modality plays a more important role than image in making final decisions so here we are again at the same problem that existed even before multiple transformers the inequality of modalities and the curse of biased data sets towards the textual modality so what do we do now more research the relevant papers for this video are in the description below if you are interested to investigate the root of the problem yourself this video is of course not an exhaustive literature enumeration of all problems in multimodal integration it's rather just a hint of where to look when searching for trouble i mean in an academic sense anyway what do you think let us know in the comments what your thoughts and observations are see you next time okay bye [Music] [Music] you

Original Description

Do you want to know the REASONS for problems in integrating images and text with deep learning? This is the second part of a two-videos series. The first part of the series: 📺 https://youtu.be/P23EWdiPWDw, where Ms. Coffee Bean talks about the SYMPTOMS. ➡️ AI Coffee Break Merch! 🛍️ https://aicoffeebreak.creator-spring.com/ ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 🔥 Optionally, pay us a coffee to boost our Coffee Bean production! ☕ Patreon: https://www.patreon.com/AICoffeeBreak Ko-fi: https://ko-fi.com/aicoffeebreak ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 📺 Ms. Coffee Bean explains a Multimodal Transformer: https://youtu.be/dd7nE4nbxN0 Outline of this video: * 00:00 Previously about symptoms * 01:17 The data * 03:47 The model * 07:17 Multimodal Transformers 📄 Goyal, Yash, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. "Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904-6913. 2017. https://openaccess.thecvf.com/content_cvpr_2017/papers/Goyal_Making_the_v_CVPR_2017_paper.pdf 📄 Massiceti, Daniela, Puneet K. Dokania, N. Siddharth, and Philip HS Torr. "Visual dialogue without vision or dialogue." arXiv preprint arXiv:1812.06417 (2018). https://arxiv.org/pdf/1812.06417.pdf 📄 Das, Abhishek, Devi Parikh, and Dhruv Batra. "Response to" Visual Dialogue without Vision or Dialogue"(Massiceti et al., 2018)." arXiv preprint arXiv:1901.05531 (2019). https://arxiv.org/pdf/1901.05531.pdf 📄 (not in the video, but relevant for the dataset bias problem): Agarwal, Shubham, et al. "History for Visual Dialog: Do we really need it?." arXiv preprint arXiv:2005.07493 (2020). 📄 Collell, G., & Moens, M. F. (2018, July). Do Neural Network Cross-Modal Mappings Really Bridge Modalities?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 462-468). https://arxiv.org/pdf/1805.07616.p

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Coffee Break with Letitia · AI Coffee Break with Letitia · 17 of 60

← Previous Next →

AI Coffee Break - Channel Trailer

AI Coffee Break - Channel Trailer

AI Coffee Break with Letitia

How to check if a neural network has learned a specific phenomenon?

How to check if a neural network has learned a specific phenomenon?

AI Coffee Break with Letitia

A brief history of the Transformer architecture in NLP

A brief history of the Transformer architecture in NLP

AI Coffee Break with Letitia

Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop

Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop

AI Coffee Break with Letitia

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

AI Coffee Break with Letitia

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

AI Coffee Break with Letitia

Pre-training of BERT-based Transformer architectures explained – language and vision!

Pre-training of BERT-based Transformer architectures explained – language and vision!

AI Coffee Break with Letitia

GPT-3 explained with examples. Possibilities, and implications.

GPT-3 explained with examples. Possibilities, and implications.

AI Coffee Break with Letitia

Adversarial Machine Learning explained! | With examples.

Adversarial Machine Learning explained! | With examples.

AI Coffee Break with Letitia

BERTology meets Biology | Solving biological problems with Transformers

BERTology meets Biology | Solving biological problems with Transformers

AI Coffee Break with Letitia

Can a neural network tell if an image is mirrored? – Visual Chirality

Can a neural network tell if an image is mirrored? – Visual Chirality

AI Coffee Break with Letitia

The ultimate intro to Graph Neural Networks. Maybe.

The ultimate intro to Graph Neural Networks. Maybe.

AI Coffee Break with Letitia

Can language models understand? Bender and Koller argument.

Can language models understand? Bender and Koller argument.

AI Coffee Break with Letitia

GANs explained | Generative Adversarial Networks video with showcase!

GANs explained | Generative Adversarial Networks video with showcase!

AI Coffee Break with Letitia

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

AI Coffee Break with Letitia

Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS

Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS

AI Coffee Break with Letitia

Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

AI Coffee Break with Letitia

An image is worth 16x16 words: ViT | Vision Transformer explained

An image is worth 16x16 words: ViT | Vision Transformer explained

AI Coffee Break with Letitia

AI understanding language!? A roadmap to natural language understanding.

AI understanding language!? A roadmap to natural language understanding.

AI Coffee Break with Letitia

"What Can We Do to Improve Peer Review in NLP?" 👀

"What Can We Do to Improve Peer Review in NLP?" 👀

AI Coffee Break with Letitia

The curse of dimensionality. Or is it a blessing?

The curse of dimensionality. Or is it a blessing?

AI Coffee Break with Letitia

PCA explained with intuition, a little math and code

PCA explained with intuition, a little math and code

AI Coffee Break with Letitia

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

AI Coffee Break with Letitia

OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.

OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.

AI Coffee Break with Letitia

Leaking training data from GPT-2. How is this possible?

Leaking training data from GPT-2. How is this possible?

AI Coffee Break with Letitia

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

AI Coffee Break with Letitia

Transformers can do both images and text. Here is why.

Transformers can do both images and text. Here is why.

AI Coffee Break with Letitia

UMAP explained | The best dimensionality reduction?

UMAP explained | The best dimensionality reduction?

AI Coffee Break with Letitia

NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean

NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean

AI Coffee Break with Letitia

Transformer in Transformer: Paper explained and visualized | TNT

Transformer in Transformer: Paper explained and visualized | TNT

AI Coffee Break with Letitia

[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?

[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?

AI Coffee Break with Letitia

Pattern Exploiting Training explained! | PET, iPET, ADAPET

Pattern Exploiting Training explained! | PET, iPET, ADAPET

AI Coffee Break with Letitia

Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

AI Coffee Break with Letitia

FNet: Mixing Tokens with Fourier Transforms – Paper Explained

FNet: Mixing Tokens with Fourier Transforms – Paper Explained

AI Coffee Break with Letitia

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained

AI Coffee Break with Letitia

"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.

"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.

AI Coffee Break with Letitia

Scaling Vision Transformers? How much data can a transformer get? #Shorts

Scaling Vision Transformers? How much data can a transformer get? #Shorts

AI Coffee Break with Letitia

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]

AI Coffee Break with Letitia

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained

AI Coffee Break with Letitia

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

AI Coffee Break with Letitia

Adding vs. concatenating positional embeddings & Learned positional encodings

Adding vs. concatenating positional embeddings & Learned positional encodings

AI Coffee Break with Letitia

Self-Attention with Relative Position Representations – Paper explained

Self-Attention with Relative Position Representations – Paper explained

AI Coffee Break with Letitia

Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts

Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts

What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes

Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes

AI Coffee Break with Letitia

Is today's AI smarter than YOU? #Shorts

Is today's AI smarter than YOU? #Shorts

AI Coffee Break with Letitia

Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts

Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts

AI Coffee Break with Letitia

Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts

Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts

The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts

How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

What is tokenization and how does it work? Tokenizers explained.

What is tokenization and how does it work? Tokenizers explained.

AI Coffee Break with Letitia

Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”

Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”

AI Coffee Break with Letitia

How modern search engines work – Vector databases explained! | Weaviate open-source

How modern search engines work – Vector databases explained! | Weaviate open-source

AI Coffee Break with Letitia

Eyes tell all: How to tell that an AI generated a face?

Eyes tell all: How to tell that an AI generated a face?

AI Coffee Break with Letitia

Swin Transformer paper animated and explained

Swin Transformer paper animated and explained

AI Coffee Break with Letitia

Data BAD | What Will it Take to Fix Benchmarking for NLU?

Data BAD | What Will it Take to Fix Benchmarking for NLU?

AI Coffee Break with Letitia

SimVLM explained | What the paper doesn’t tell you

SimVLM explained | What the paper doesn’t tell you

AI Coffee Break with Letitia

Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?

Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?

AI Coffee Break with Letitia

Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

AI Coffee Break with Letitia

The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

AI Coffee Break with Letitia

The video explores the challenges in integrating vision and language with deep learning, discussing data biases and neural network limitations, and highlighting the need for further research in multimodal learning.

Key Takeaways

Investigate data biases in vision and language datasets
Evaluate neural network architectures for multimodal learning
Research multimodal transformers and cross-modal attention
Analyze the role of textual modality in multimodal decision-making

💡 Neural networks can exploit biases in the data and may not effectively transform the input space to the output space, leading to limitations in multimodal learning.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related AI Lessons

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data by encoding and scaling features for better machine learning model performance

Medium · Machine Learning

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training

Medium · Data Science

The Python Dictionary Trick That Makes Interviewers Smile

Learn the Python dictionary trick that impresses interviewers and improves your coding skills

Dev.to · Ameer Abdullah

I Compared 50 Python Courses. Here Are My Top 5 Recommendations for 2026

Discover the top 5 Python courses for 2026, curated from a comparison of 50 courses, to enhance your programming skills and career prospects

Medium · Python

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB