Vision Language Models Architecture - CLIP | Flamingo | VisualBert | VisualGPT | SimVLM | ViLD
➡️ Contrastive Learning
▸ This approach trains models to differentiate between matching and non-matching image-text pairs by computing similarity scores. The objective is to minimize the distance between related pairs and maximize it for unrelated ones, fostering a semantic space where similar concepts are closely aligned.
➡️ Prefix Language Modeling (PrefixLM)
▸ Images are treated as prefixes to textual input, guiding subsequent text generation. Vision Transformers (ViTs) process images by dividing them into patch sequences, allowing the model to predict text based on visual context.
➡️ Fro…
Watch on YouTube ↗
(saves to browser)
DeepCamp AI