📰 Dev.to · Ingero Team

16 articles · Updated every 3 hours · View all reads

All Articles 87,466 Blog Posts 108,014 Tech Tutorials 21,659 Research Papers 18,894 News 14,428 ⚡ AI Lessons

Hybrid Mamba-Transformer MoEs Hide Their Stalls in Places Dashboards Do Not Look

Dev.to · Ingero Team 2d ago

Hybrid Mamba-Transformer MoEs Hide Their Stalls in Places Dashboards Do Not Look

A trace of a hybrid Mamba-Transformer MoE inference run, broken down by layer type. The MoE...

GPU Tracing With cgroup Awareness: Per-Tenant Investigation on Shared Hosts

Dev.to · Ingero Team 1w ago

GPU Tracing With cgroup Awareness: Per-Tenant Investigation on Shared Hosts

On a shared GPU host, the kernel knows which container generated which event. The trace stays...

Auto-Generated CUDA Kernels Need Kernel-Level Validation

Dev.to · Ingero Team 2w ago

Auto-Generated CUDA Kernels Need Kernel-Level Validation

An LLM-written kernel benchmarked 38% faster on a microbench. Here is what kernel-level validation...

From Kernel Scheduler to Python Source Line: Tracing a GPU Stall End to End

Dev.to · Ingero Team 2w ago

From Kernel Scheduler to Python Source Line: Tracing a GPU Stall End to End

TL;DR A GPU that reports 97% utilization can still be the slowest part of a training step,...

AllReduce Stalls Are Network Stalls. Most Tools See Neither.

Dev.to · Ingero Team 2w ago

AllReduce Stalls Are Network Stalls. Most Tools See Neither.

A slow AllReduce on rank 5 lines up against TCP retransmits on rank 5’s NIC, four ms before the...

What GitHub Uses eBPF For (and the Layer They Have Not Ported Yet)

Dev.to · Ingero Team 3w ago

What GitHub Uses eBPF For (and the Layer They Have Not Ported Yet)

Three eBPF patterns hyperscalers run in production today, mapped to the equivalent patterns on the...

GPU Observability for Workloads That Cannot Phone Home

Dev.to · Ingero Team 3w ago

GPU Observability for Workloads That Cannot Phone Home

For an air-gapped GPU host, the trace is only useful if collection, storage, and query all happen...

One Kernel, Zero Sidecars: Tracing AI Workloads Without an Agent on Every Host

Dev.to · Ingero Team 1mo ago

One Kernel, Zero Sidecars: Tracing AI Workloads Without an Agent on Every Host

Per-host overhead multiplied across N hosts, vs. one kernel-level instrumentation per host. The...

Same eBPF, Different Vendor: Tracing libhip Calls on AMD ROCm

Dev.to · Ingero Team 1mo ago

Same eBPF, Different Vendor: Tracing libhip Calls on AMD ROCm

libhip.so is to ROCm what libcudart.so is to CUDA: the user-mode runtime API the framework calls...

What Inference-Platform Benchmark Posts Leave Out

Dev.to · Ingero Team 1mo ago

What Inference-Platform Benchmark Posts Leave Out

DCGM stops at host-level GPU counters. Kernel-side eBPF adds the per-rank, per-tenant signals...

MCP Shows What the Agent Did. eBPF Shows Why the GPU Stalled.

Dev.to · Ingero Team ☁️ DevOps & Cloud ⚡ AI Lesson 1mo ago

MCP Shows What the Agent Did. eBPF Shows Why the GPU Stalled.

MCP exposes the agent’s tool calls. eBPF exposes the kernel events that explain why those tool...

MCP Tools Are New API Surfaces. eBPF Sees What They Actually Touch.

Dev.to · Ingero Team 1mo ago

MCP Tools Are New API Surfaces. eBPF Sees What They Actually Touch.

An MCP tool call is a tiny line of agent code that fans out to syscalls, library calls, and kernel...

A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.

Dev.to · Ingero Team 1mo ago

A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.

Eight ranks on two hosts. Every per-host metric reads healthy. Rank 5 enters the barrier 290ms...

26 Seconds to Find a Straggler: Fleet v0.10 End-to-End on A100 and GH200

Dev.to · Ingero Team 1mo ago

26 Seconds to Find a Straggler: Fleet v0.10 End-to-End on A100 and GH200

TL;DR Ingero Fleet v0.10 FOSS is live. We validated the full pipeline end-to-end on two...

124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level

Dev.to · Ingero Team 2mo ago

124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level

TL;DR: PyTorch's DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU...

Tracing a 13x PyTorch Slowdown to a Hidden NumPy Synchronization

Dev.to · Ingero Team 2mo ago

Tracing a 13x PyTorch Slowdown to a Hidden NumPy Synchronization

TL;DR: A .cpu().numpy() call buried inside a forward pass was forcing a full CPU-GPU synchronization...