📰 Dev.to · Ingero Team
13 articles · Updated every 3 hours · View all reads
All
Articles 66,559Blog Posts 99,451Tech Tutorials 16,028Research Papers 13,806News 12,499
⚡ AI Lessons

Dev.to · Ingero Team
1d ago
From Kernel Scheduler to Python Source Line: Tracing a GPU Stall End to End
TL;DR A GPU that reports 97% utilization can still be the slowest part of a training step,...

Dev.to · Ingero Team
3d ago
AllReduce Stalls Are Network Stalls. Most Tools See Neither.
A slow AllReduce on rank 5 lines up against TCP retransmits on rank 5’s NIC, four ms before the...

Dev.to · Ingero Team
5d ago
What GitHub Uses eBPF For (and the Layer They Have Not Ported Yet)
Three eBPF patterns hyperscalers run in production today, mapped to the equivalent patterns on the...

Dev.to · Ingero Team
1w ago
GPU Observability for Workloads That Cannot Phone Home
For an air-gapped GPU host, the trace is only useful if collection, storage, and query all happen...

Dev.to · Ingero Team
1w ago
One Kernel, Zero Sidecars: Tracing AI Workloads Without an Agent on Every Host
Per-host overhead multiplied across N hosts, vs. one kernel-level instrumentation per host. The...

Dev.to · Ingero Team
2w ago
Same eBPF, Different Vendor: Tracing libhip Calls on AMD ROCm
libhip.so is to ROCm what libcudart.so is to CUDA: the user-mode runtime API the framework calls...

Dev.to · Ingero Team
2w ago
What Inference-Platform Benchmark Posts Leave Out
DCGM stops at host-level GPU counters. Kernel-side eBPF adds the per-rank, per-tenant signals...

Dev.to · Ingero Team
☁️ DevOps & Cloud
⚡ AI Lesson
2w ago
MCP Shows What the Agent Did. eBPF Shows Why the GPU Stalled.
MCP exposes the agent’s tool calls. eBPF exposes the kernel events that explain why those tool...

Dev.to · Ingero Team
3w ago
MCP Tools Are New API Surfaces. eBPF Sees What They Actually Touch.
An MCP tool call is a tiny line of agent code that fans out to syscalls, library calls, and kernel...

Dev.to · Ingero Team
3w ago
A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.
Eight ranks on two hosts. Every per-host metric reads healthy. Rank 5 enters the barrier 290ms...

Dev.to · Ingero Team
1mo ago
26 Seconds to Find a Straggler: Fleet v0.10 End-to-End on A100 and GH200
TL;DR Ingero Fleet v0.10 FOSS is live. We validated the full pipeline end-to-end on two...

Dev.to · Ingero Team
1mo ago
124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level
TL;DR: PyTorch's DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU...

Dev.to · Ingero Team
2mo ago
Tracing a 13x PyTorch Slowdown to a Hidden NumPy Synchronization
TL;DR: A .cpu().numpy() call buried inside a forward pass was forcing a full CPU-GPU synchronization...
DeepCamp AI