PlanVLM-bench

Multimodal Multi-image Understanding for Evaluating Multimodal Large Language Models

Minxin Chen†,1,2 , He Zhu†,1,2 , Junyou Su1,2 , Wen Wang1,2 , Yijie Deng1,2 , Wenjia Zhang*,1,3

1Behavioral and Spatial AI Lab, Tongji University 2Behavioral and Spatial AI Lab, Peking University 3College of Architecture and Urban Planning, Tongji University

*Corresponding Author: wenjiazhang@tongji.edu.cn
†Equal Contribution

🔔News

[2025-05-20] The arxiv article and dataset along with the code will be launched soon..

Introduction

Planning maps of territorial and spatial planning visually present the concepts, objectives, strategies, and specific measures of territorial and spatial planning in map form, serving to guide and coordinate various development, protection, and utilization activities across territorial space. They not only constitute a critical basis for planning decisions but also act as essential tools for public participation and oversight of plan implementation. Given the complexity and specialized nature of planning work, fully grasping planning maps requires not only grasping fine-grained elements (such as symbols, legends, and geographic features) but also possessing the ability to perform comprehensive analysis and judgment in conjunction with relevant policies. This complexity renders the interpretation of planning maps particularly challenging.With the rapid advancement of multimodal large language models (MLLMs), we have established a benchmark for territorial and spatial planning maps to assess MLLMs’ map-understanding capabilities. Our contributions are as follows:

(1) Data: We constructed the Spatial Planning Map Database (SPMD), an expert-annotated repository characterized by diverse map content and high-quality annotations provided by planning domain specialists.
(2) Framework: We proposed a comprehensive, planning-discipline–based evaluation standard that measures MLLMs’ planning-map comprehension from four perspectives—perception, reasoning, association, and application—comprising eight fine-grained subcategories.
(3) Experiments: By designing question–answer tasks grounded in authoritative question banks (specifically, the practice exam questions for the Chinese Registered Urban Planner qualification), we significantly reduced the incidence of hallucinated normative references by the models.
(4) Results: All models exhibited their weakest performance in the application dimension, while Qwen2.5-VL-32B-Instruct achieved the highest overall score across all four evaluated dimensions.

Overview

pipeline

We propose a conceptual framework tailored to the domain of urban planning visualization.This framework comprises the following 4 dimensions and 8 categories:

Perception

The questions of the perception subset consist of 2 categories as follows:
Element Recognition evaluates models' ability to identify layout configurations, textual annotations, basic geographic features, and drawing elements in planning maps. It facilitates the establishment of semantic alignment between image content and natural language.
Caption,in this study, refers to extracting as many details from the image as possible. The model generates descriptions based solely on the image itself, rather than identifying elements in response to a specific question. The quality of the caption reflects whether the model has truly "seen" the image clearly.
Example Question: "Please describe this planning map in detail."

Reasoning

This dimension encompasses classification, spatial‐relation reasoning, and domain‐specific reasoning.
Classification focuses on the ability to recognize different types of planning maps. Based on China's five-level, three-category planning system, the maps are categorized into master plans, detailed plans (including regulatory and site plans), and specialized planning maps.
Spatial Relationship Reasoning, as shown in Table 2, assesses the understanding of spatial relationships between geographic elements in planning maps, including: topological spatial relations, sequential spatial relations, and metric spatial relations.
Professional Reasoning, which measures mastery of planning knowledge such as layout morphology, functional organization, transportation systems, and environmental ecology, distinct from general logical inference.

Association

This dimension assesses the ability to collect and relate background policies and contextual documents relevant to planning maps. At a fine scale, it examines policy, regulations, and planning indicators.

Implementation

This dimension addresses the capacity for comparing, critiquing, and optimizing planning proposals. Given the highly integrative and wide‐ranging nature of urban planning, the ability to identify critical issues and emphasize key priorities is paramount.
The questions of the Implementation subset consist of 3 categories as follows:
Task Abstraction: This assesses the ability to identify and extract key information from the question text. In the Certified Urban-Rural Planner Qualification Examination, practical questions often include a long background passage containing irrelevant constraints that need to be filtered and summarized.
Example Question: “A county currently has a population of 980,000, including 520,000 urban residents. By 2035, the plan aims to increase the urbanization rate to 80%. Three central towns—A, B, and C—and the old district of the county seat will undergo upgrading and renovation. Sixty percent of the population will be relocated to newly built residential areas on the outskirts of the county seat. The plan also includes the development of a food industrial park and rural tourism, as shown in the figure. What are the main issues presented in the map? Please extract the key information from the question.”
Example Answer: The key points in this question are urbanization rate, central town, population migration, and food industrial park.
Task-Oriented Image Summarization assesses the ability to identify and extract key information from an image. The image often contains rich information, and some parts are highly relevant to the correct answer and need to be summarized accordingly.
Example Question: “Identify the main issues shown in the image.”
Example Answer: “The main issues in the image include the spatial relationship between Central Town A, the food industrial park, and the floodplain; the topological relationship between the expressway and the nationally protected wetland; the distance between the planned interchange on the west side of the expressway and the existing one; and the overall distribution of central towns.”

Performance

radar_image

Acknowledgements

VQA Data Annotators: Siqi Zha, Yeyang Fu, Chuang Deng, Fenghong An, Hanying Li, Jiayi Fan

Internal Testers: Jialu Yu, Yingqi Guo


Note: This content is part of a manuscript under submission. Please do not cite until it is officially published. The arxiv article and dataset along with the code will be launched soon.