Test-Time Safety Alignment
📰 ArXiv cs.AI
arXiv:2604.26167v1 Announce Type: cross Abstract: Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion models on the relatively simple objective of reducing surface-level profanity in short continuations. A natural and practically important question is how well input embeddings can control aligned models,
DeepCamp AI