Patch-Effect Graph Kernels for LLM Interpretability

📰 ArXiv cs.AI

arXiv:2605.06480v1 Announce Type: new Abstract: Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically. We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles a

Published 9 May 2026
Read full paper → ← Back to Reads