Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization [R]
📰 Reddit r/MachineLearning
Paper: https://arxiv.org/abs/2603.21676 I found this interesting as another iteration of the TRM approach: Shows decent OOD generalization in 2/3 tasks (but why does this fail >2x? and why is unstructured text so much worse?) Explains why intermediate step supervision can hurt generalization. This makes
DeepCamp AI