Foundations

Computer Vision

Object detection, segmentation, YOLO, CLIP, and vision-language models

2,353
lessons
Skills in this topic
View full skill map →
CV Basics
beginner
Classify images with a pre-trained CNN
Modern CV Models
intermediate
Run YOLO for real-time object detection
Generative CV
advanced
Build a Stable Diffusion inference pipeline
All Reads (1,208) Articles (385)Blog Posts (260)Tutorials (78)Research Papers (469)News (16)
Reddit r/deeplearning 👁️ Computer Vision ⚡ AI Lesson 1mo ago
MediVigil: Hospital Patient Facial Monitoring System
MediVigil is a real-time hospital bedside monitoring system. It fuses multi-modal facial dynamics and kinematics to track patient well-being, detecting distress
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
A World Model of Radiologist Reading for Medical Image Representation Learning
arXiv:2605.23992v1 Announce Type: cross Abstract: Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence du
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
MASt3R-Nav: WayPixel Navigation in Relative 3D Maps
arXiv:2605.24111v1 Announce Type: cross Abstract: Visual navigation ability is strongly tied to its underlying representation of the world. Unlike classical 3D
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer
arXiv:2605.24243v1 Announce Type: cross Abstract: In 3D scene understanding, deep learning models rely on large models and extensive training to capture basic g
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction
arXiv:2605.24562v1 Announce Type: cross Abstract: Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving syst
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection
arXiv:2605.24965v1 Announce Type: cross Abstract: The rapid evolution of generative models has enabled the creation of hyper-realistic facial deepfakes, exposin
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
TinyFormer: Preserving Tiny Objects in YOLO-DETRHybridReal-time Detectors
arXiv:2605.25046v1 Announce Type: cross Abstract: YOLO-series and DETR-based detectors struggle with tiny-object detection. YOLO-style models benefit from effic
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
K-U-KAN: Koopman-Enhanced U-KAN for 3D Dental Reconstruction from a Single Panoramic X-ray Radiograph
arXiv:2605.25163v1 Announce Type: cross Abstract: A panoramic X-ray compresses a 3D jaw into a 2D strip; we aim to recover the missing depth cleanly and fast. E
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images
arXiv:2605.25861v1 Announce Type: cross Abstract: 3D human mesh recovery and 3D clothed human reconstruction are inherently related, yet they have long been stu
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution
arXiv:2605.26032v1 Announce Type: cross Abstract: Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolu
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models
arXiv:2605.26038v1 Announce Type: cross Abstract: Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in den
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval
arXiv:2209.11572v3 Announce Type: replace-cross Abstract: As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos
arXiv:2303.07863v3 Announce Type: replace-cross Abstract: Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target moment semanticall
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
What Happens Next? Anticipating Future Motion by Generating Point Trajectories
arXiv:2509.21592v2 Announce Type: replace-cross Abstract: We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the
Akıllı Ulaşım Sistemlerinde Görüntü İşleme Teknolojisi Kullanılarak Araç Hız Tespiti Nasıl…
Medium · Python 👁️ Computer Vision ⚡ AI Lesson 1mo ago
Akıllı Ulaşım Sistemlerinde Görüntü İşleme Teknolojisi Kullanılarak Araç Hız Tespiti Nasıl…
Bir trafik kamerası size bir aracın kaç km/h hızla geçtiğini söyleyebilir mi? Yazılım katmanı olmadan hayır. Bu yazı, bu yazılım katmanını… Continue reading on
C Programming — Double Pointers and Function Pointers
Medium · Programming 👁️ Computer Vision ⚡ AI Lesson 1mo ago
C Programming — Double Pointers and Function Pointers
This article covers more advanced use of pointers, including double pointers and function pointers. Also include when and how to use them Continue reading on Me
How I deployed YOLOv8 on Raspberry Pi for real-time blind assistance
Medium · Deep Learning 👁️ Computer Vision ⚡ AI Lesson 1mo ago
How I deployed YOLOv8 on Raspberry Pi for real-time blind assistance
Most computer vision projects work well on powerful GPUs and cloud servers, but deploying them on small low-power devices is a completely… Continue reading on M
Powering Sports Analytics with High-Quality Image Annotation
Medium · Machine Learning 👁️ Computer Vision ⚡ AI Lesson 1mo ago
Powering Sports Analytics with High-Quality Image Annotation
The world of sports is rapidly transforming through the power of artificial intelligence and computer vision. From player tracking and… Continue reading on Medi
Why Choose a Camera Design Engineering Company for Your Project
Dev.to · Silicon Signals 👁️ Computer Vision ⚡ AI Lesson 1mo ago
Why Choose a Camera Design Engineering Company for Your Project
Most camera systems deployed in the field today were not designed with deployment in mind. They were...
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations
arXiv:2605.22904v1 Announce Type: cross Abstract: Understanding and monitoring human behavior in metro stations play an important role in supporting suicide pre
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
The TIME Machine: On The Power of Motion for Efficient Perception
arXiv:2605.23045v1 Announce Type: cross Abstract: Video representation learning has seen tremendous progress in recent years. This has been driven by many facto
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
Weierstrass Positional Encoding for Vision Transformers
arXiv:2605.23719v1 Announce Type: cross Abstract: Vision Transformers have achieved remarkable success in computer vision, but their common use of learnable one
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
arXiv:2605.23892v1 Announce Type: cross Abstract: Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joi
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
A drone-based framework for coral habitat mapping via weakly supervised segmentation
arXiv:2508.18958v2 Announce Type: replace-cross Abstract: Obtaining pixel-level annotations over large spatial extents remains a major bottleneck for deploying
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
Anatomy-Guided Vision-Language Learning with Angular Prototype Separation for Multi-Label Video Capsule Endoscopy Classification Under Class Imbalance
arXiv:2603.17879v2 Announce Type: replace-cross Abstract: This work presents a multi-label temporal event detection framework for video capsule endoscopy (VCE)
Medium · Programming 👁️ Computer Vision ⚡ AI Lesson 1mo ago
Shot detection is the cheap feature everyone underestimates
A friend of mine spent two months trying to add a “smart preview” feature to a video product, the kind of thing you see on every modern… Continue reading on Med
Medium · Python 👁️ Computer Vision ⚡ AI Lesson 1mo ago
Shot detection is the cheap feature everyone underestimates
A friend of mine spent two months trying to add a “smart preview” feature to a video product, the kind of thing you see on every modern… Continue reading on Med
Computer Vision Yolculuğu — Gün 7: OpenCV ve MediaPipe ile Gesture Mapping ve Smoothing Sistemleri
Medium · Python 👁️ Computer Vision ⚡ AI Lesson 1mo ago
Computer Vision Yolculuğu — Gün 7: OpenCV ve MediaPipe ile Gesture Mapping ve Smoothing Sistemleri
Computer Vision projelerinde yalnızca hand tracking yapmak çoğu zaman yeterli değildir. Gerçek sistemlerde önemli olan şey, elde edilen… Continue reading on Med
Real-time video classification with PaliGemma: architecture patterns for low-latency VLM inference
Dev.to · Pasquale Molinaro 👁️ Computer Vision ⚡ AI Lesson 1mo ago
Real-time video classification with PaliGemma: architecture patterns for low-latency VLM inference
In a previous article, we benchmarked three open-source Vision-Language Models on zero-shot object...
Apple Research Releases LiTo: An Image to 3D Generator
Medium · LLM 👁️ Computer Vision ⚡ AI Lesson 1mo ago
Apple Research Releases LiTo: An Image to 3D Generator
LiTo is a Surface Light Field Tokenization model that generates 3D geometry and viewpoints from a 2D image Continue reading on Mac O’Clock »
I Built a Text-to-Image Search Engine That Runs Entirely in the Browser
Dev.to · Devanshu Biswas 👁️ Computer Vision ⚡ AI Lesson 1mo ago
I Built a Text-to-Image Search Engine That Runs Entirely in the Browser
Day 38 of TechFromZero. CLIP, the model behind half of modern computer vision, runs in your browser today. No server, no API key, no upload. Type a phrase, find
cv3 — make OpenCV pythonic again
Medium · AI 👁️ Computer Vision ⚡ AI Lesson 1mo ago
cv3 — make OpenCV pythonic again
TL;DR cv3 is a Pythonic wrapper for OpenCV that simplifies computer vision tasks by providing more intuitive interfaces and eliminating… Continue reading on Med
cv3 — make OpenCV pythonic again
Medium · Machine Learning 👁️ Computer Vision ⚡ AI Lesson 1mo ago
cv3 — make OpenCV pythonic again
TL;DR cv3 is a Pythonic wrapper for OpenCV that simplifies computer vision tasks by providing more intuitive interfaces and eliminating… Continue reading on Med
cv3 — make OpenCV pythonic again
Medium · Deep Learning 👁️ Computer Vision ⚡ AI Lesson 1mo ago
cv3 — make OpenCV pythonic again
TL;DR cv3 is a Pythonic wrapper for OpenCV that simplifies computer vision tasks by providing more intuitive interfaces and eliminating… Continue reading on Med
Per-pixel bounding-box regression + DBSCAN for handwritten word detection - visual walkthrough of WordDetectorNet [P]
Reddit r/MachineLearning 👁️ Computer Vision ⚡ AI Lesson 1mo ago
Per-pixel bounding-box regression + DBSCAN for handwritten word detection - visual walkthrough of WordDetectorNet [P]
<img src="https://preview.redd.it/qnfoh3sqjx2h1.png?width=140&height=94&auto=webp&s=e72cb3f3e061a1362a9bd5111d9e919341d48acb" alt="Per-pixel boundin
Tile Extractor
Dev.to · somyabhalani 👁️ Computer Vision ⚡ AI Lesson 1mo ago
Tile Extractor
Parsing the Unparsable: Building a Layout-Aware Computer Vision Pipeline for 50,000+ Stone...
Build a Poker Hand Scanner: Card Recognition API Guide
Medium · Python 👁️ Computer Vision ⚡ AI Lesson 1mo ago
Build a Poker Hand Scanner: Card Recognition API Guide
Integrating a dedicated card recognition api into your workflow empowers software teams to inject production-ready computer vision into… Continue reading on Obj
SentinelML
Medium · AI 👁️ Computer Vision ⚡ AI Lesson 1mo ago
SentinelML
A modular, open-source framework for real-time firearm detection and alerting using YOLOv8 and cloud-native infrastructure. Continue reading on Medium »
SentinelML
Medium · Machine Learning 👁️ Computer Vision ⚡ AI Lesson 1mo ago
SentinelML
A modular, open-source framework for real-time firearm detection and alerting using YOLOv8 and cloud-native infrastructure. Continue reading on Medium »
Medium · Python 👁️ Computer Vision ⚡ AI Lesson 1mo ago
sen2p: Download Sentinel-2 Imagery Without API Keys or Extra Setup
A lightweight Python library that makes Sentinel-2 imagery easier to search and download. Continue reading on GeoAI »
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
A Camera-Cooperative ISAC Framework for Multimodal Non-Cooperative UAVs Sensing
arXiv:2605.22090v1 Announce Type: new Abstract: The detection of non-cooperative unmanned aerial vehicles (UAVs) presents significant challenges for Integrated
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
Echo4DIR: 4D Implicit Heart Reconstruction from 2D Echocardiography Videos
arXiv:2605.22066v1 Announce Type: cross Abstract: Reconstructing 4D (3D+t) cardiac geometry from sparse 2D echocardiography is highly desirable yet fundamentall
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction
arXiv:2605.22420v1 Announce Type: cross Abstract: Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving develo
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
Making the Discrete Continuous: Synthetic RAW Augmentations for Fine-Grained Evaluation of Person Detection Performance in Low Light
arXiv:2605.22455v1 Announce Type: cross Abstract: Real-world deployment of AI vision models is both fueled and limited by the data available for training and te
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
SceneAligner: 3D-Grounded Floorplan Localization in the Wild
arXiv:2605.22581v1 Announce Type: cross Abstract: Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. F
ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 1mo ago
Swift Sampling: Selecting Temporal Surprises via Taylor Series
arXiv:2605.22678v1 Announce Type: cross Abstract: While most frames in long-form video are redundant, the critical information resides in temporal surprises: mo
OpenLiDARViewer: A Browser-Based LiDAR and Point-Cloud Viewer
Dev.to · Alex U 👁️ Computer Vision ⚡ AI Lesson 1mo ago
OpenLiDARViewer: A Browser-Based LiDAR and Point-Cloud Viewer
Rendering LiDAR Scans in the Browser Without Uploading Anything Most point-cloud workflows...
Stop retraining YOLO: a developer’s guide to zero-shot object detection with generative VLMs
Dev.to · Pasquale Molinaro 👁️ Computer Vision ⚡ AI Lesson 1mo ago
Stop retraining YOLO: a developer’s guide to zero-shot object detection with generative VLMs
If you have ever maintained a computer vision pipeline in a factory, warehouse, or construction site,...