Large Byte Model: Teaching Language Models About Compiled Code

📰 ArXiv cs.AI

arXiv:2606.02834v1 Announce Type: cross Abstract: Malware analysis starts with the raw bytes of an executable program, and tools to "lift" these to higher-level representations, such as assembly, are expensive and subject to error. Large Language Models (LLMs) cannot process raw byte representations and answer questions about them. To this end, we present the first byte-native LLM. Based on a vocabulary expansion technique using a bespoke byte tokenizer, such a model is capable of responding to

Published 3 Jun 2026

Read full paper → ← Back to Reads