3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

📰 ArXiv cs.AI

3DCity-LLM is a framework for 3D city-scale vision-language perception and understanding using multi-modality large language models

advanced Published 25 Mar 2026

Action Steps

Employ a coarse-to-fine feature encoding strategy
Use three parallel branches for target object, inter-object relationship, and global context encoding
Integrate multi-modality large language models for 3D city-scale perception and understanding
Evaluate the framework's performance on various 3D city-scale tasks and datasets

Who Needs to Know This

AI engineers and researchers working on computer vision and natural language processing tasks can benefit from this framework as it enables them to scale their models to 3D city-scale environments, and product managers can leverage this technology to develop innovative applications

Key Insight

💡 3DCity-LLM bridges the gap between multi-modality large language models and 3D city-scale environments