📰 Dev.to · Ingero Team
16 articles · Updated every 3 hours · View all reads
All
Articles 87,466Blog Posts 108,014Tech Tutorials 21,659Research Papers 18,894News 14,428
⚡ AI Lessons

Dev.to · Ingero Team
2d ago
Hybrid Mamba-Transformer MoEs Hide Their Stalls in Places Dashboards Do Not Look
A trace of a hybrid Mamba-Transformer MoE inference run, broken down by layer type. The MoE...

Dev.to · Ingero Team
1w ago
GPU Tracing With cgroup Awareness: Per-Tenant Investigation on Shared Hosts
On a shared GPU host, the kernel knows which container generated which event. The trace stays...

Dev.to · Ingero Team
2w ago
Auto-Generated CUDA Kernels Need Kernel-Level Validation
An LLM-written kernel benchmarked 38% faster on a microbench. Here is what kernel-level validation...

Dev.to · Ingero Team
2w ago
From Kernel Scheduler to Python Source Line: Tracing a GPU Stall End to End
TL;DR A GPU that reports 97% utilization can still be the slowest part of a training step,...

Dev.to · Ingero Team
2w ago
AllReduce Stalls Are Network Stalls. Most Tools See Neither.
A slow AllReduce on rank 5 lines up against TCP retransmits on rank 5’s NIC, four ms before the...

Dev.to · Ingero Team
3w ago
What GitHub Uses eBPF For (and the Layer They Have Not Ported Yet)
Three eBPF patterns hyperscalers run in production today, mapped to the equivalent patterns on the...

Dev.to · Ingero Team
3w ago
GPU Observability for Workloads That Cannot Phone Home
For an air-gapped GPU host, the trace is only useful if collection, storage, and query all happen...

Dev.to · Ingero Team
1mo ago
One Kernel, Zero Sidecars: Tracing AI Workloads Without an Agent on Every Host
Per-host overhead multiplied across N hosts, vs. one kernel-level instrumentation per host. The...

Dev.to · Ingero Team
1mo ago
Same eBPF, Different Vendor: Tracing libhip Calls on AMD ROCm
libhip.so is to ROCm what libcudart.so is to CUDA: the user-mode runtime API the framework calls...

Dev.to · Ingero Team
1mo ago
What Inference-Platform Benchmark Posts Leave Out
DCGM stops at host-level GPU counters. Kernel-side eBPF adds the per-rank, per-tenant signals...

Dev.to · Ingero Team
☁️ DevOps & Cloud
⚡ AI Lesson
1mo ago
MCP Shows What the Agent Did. eBPF Shows Why the GPU Stalled.
MCP exposes the agent’s tool calls. eBPF exposes the kernel events that explain why those tool...

Dev.to · Ingero Team
1mo ago
MCP Tools Are New API Surfaces. eBPF Sees What They Actually Touch.
An MCP tool call is a tiny line of agent code that fans out to syscalls, library calls, and kernel...

Dev.to · Ingero Team
1mo ago
A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.
Eight ranks on two hosts. Every per-host metric reads healthy. Rank 5 enters the barrier 290ms...

Dev.to · Ingero Team
1mo ago
26 Seconds to Find a Straggler: Fleet v0.10 End-to-End on A100 and GH200
TL;DR Ingero Fleet v0.10 FOSS is live. We validated the full pipeline end-to-end on two...

Dev.to · Ingero Team
2mo ago
124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level
TL;DR: PyTorch's DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU...

Dev.to · Ingero Team
2mo ago
Tracing a 13x PyTorch Slowdown to a Hidden NumPy Synchronization
TL;DR: A .cpu().numpy() call buried inside a forward pass was forcing a full CPU-GPU synchronization...
DeepCamp AI