We built a pre-generation LLM guardrail that blocks prompt injection at the residual stream level, before the model outputs anything [Mistral 7B, 0% FP, 100% detection]
📰 Reddit r/deeplearning
Most LLM monitors work like this: the model generates a response, you check if it’s bad, you log it. By the time you alert, the output already exists. We built something different. Arc Sentry hooks into the residual stream of open source LLMs and scores the model’s internal decision state before calling generate(). Injections get blocked before a single token is produced. How it works: 1. Compute layer delta Δh = h\[30\] − h\[29\] at
DeepCamp AI