✕ Clear all filters
13 articles

📰 Dev.to · Ingero Team

13 articles · Updated every 3 hours · View all reads

All Articles 66,559Blog Posts 99,451Tech Tutorials 16,028Research Papers 13,806News 12,499 ⚡ AI Lessons
One Kernel, Zero Sidecars: Tracing AI Workloads Without an Agent on Every Host
Dev.to · Ingero Team 1w ago
One Kernel, Zero Sidecars: Tracing AI Workloads Without an Agent on Every Host
Per-host overhead multiplied across N hosts, vs. one kernel-level instrumentation per host. The...
Same eBPF, Different Vendor: Tracing libhip Calls on AMD ROCm
Dev.to · Ingero Team 2w ago
Same eBPF, Different Vendor: Tracing libhip Calls on AMD ROCm
libhip.so is to ROCm what libcudart.so is to CUDA: the user-mode runtime API the framework calls...
What Inference-Platform Benchmark Posts Leave Out
Dev.to · Ingero Team 2w ago
What Inference-Platform Benchmark Posts Leave Out
DCGM stops at host-level GPU counters. Kernel-side eBPF adds the per-rank, per-tenant signals...
MCP Shows What the Agent Did. eBPF Shows Why the GPU Stalled.
Dev.to · Ingero Team ☁️ DevOps & Cloud ⚡ AI Lesson 2w ago
MCP Shows What the Agent Did. eBPF Shows Why the GPU Stalled.
MCP exposes the agent’s tool calls. eBPF exposes the kernel events that explain why those tool...
MCP Tools Are New API Surfaces. eBPF Sees What They Actually Touch.
Dev.to · Ingero Team 3w ago
MCP Tools Are New API Surfaces. eBPF Sees What They Actually Touch.
An MCP tool call is a tiny line of agent code that fans out to syscalls, library calls, and kernel...
A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.
Dev.to · Ingero Team 3w ago
A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.
Eight ranks on two hosts. Every per-host metric reads healthy. Rank 5 enters the barrier 290ms...
26 Seconds to Find a Straggler: Fleet v0.10 End-to-End on A100 and GH200
Dev.to · Ingero Team 1mo ago
26 Seconds to Find a Straggler: Fleet v0.10 End-to-End on A100 and GH200
TL;DR Ingero Fleet v0.10 FOSS is live. We validated the full pipeline end-to-end on two...
124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level
Dev.to · Ingero Team 1mo ago
124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level
TL;DR: PyTorch's DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU...
Tracing a 13x PyTorch Slowdown to a Hidden NumPy Synchronization
Dev.to · Ingero Team 2mo ago
Tracing a 13x PyTorch Slowdown to a Hidden NumPy Synchronization
TL;DR: A .cpu().numpy() call buried inside a forward pass was forcing a full CPU-GPU synchronization...