PlanVLM-bench

🔔News

[2025-05-20] The arxiv article and dataset along with the code will be launched soon..

Introduction

Planning maps of territorial and spatial planning visually present the concepts, objectives, strategies, and specific measures of territorial and spatial planning in map form, serving to guide and coordinate various development, protection, and utilization activities across territorial space. They not only constitute a critical basis for planning decisions but also act as essential tools for public participation and oversight of plan implementation. Given the complexity and specialized nature of planning work, fully grasping planning maps requires not only grasping fine-grained elements (such as symbols, legends, and geographic features) but also possessing the ability to perform comprehensive analysis and judgment in conjunction with relevant policies. This complexity renders the interpretation of planning maps particularly challenging.With the rapid advancement of multimodal large language models (MLLMs), we have established a benchmark for territorial and spatial planning maps to assess MLLMs’ map-understanding capabilities. Our contributions are as follows:

(1) Data: We constructed the Spatial Planning Map Database (SPMD), an expert-annotated repository characterized by diverse map content and high-quality annotations provided by planning domain specialists.
(2) Framework: We proposed a comprehensive, planning-discipline–based evaluation standard that measures MLLMs’ planning-map comprehension from four perspectives—perception, reasoning, association, and application—comprising eight fine-grained subcategories.
(3) Experiments: By designing question–answer tasks grounded in authoritative question banks (specifically, the practice exam questions for the Chinese Registered Urban Planner qualification), we significantly reduced the incidence of hallucinated normative references by the models.
(4) Results: All models exhibited their weakest performance in the application dimension, while Qwen2.5-VL-32B-Instruct achieved the highest overall score across all four evaluated dimensions.

Overview

We propose a conceptual framework tailored to the domain of urban planning visualization.This framework comprises the following 4 dimensions and 8 categories:

Perception

The questions of the perception subset consist of 2 categories as follows:
Element Recognition evaluates models' ability to identify layout configurations, textual annotations, basic geographic features, and drawing elements in planning maps. It facilitates the establishment of semantic alignment between image content and natural language.
Caption，in this study, refers to extracting as many details from the image as possible. The model generates descriptions based solely on the image itself, rather than identifying elements in response to a specific question. The quality of the caption reflects whether the model has truly "seen" the image clearly.
Example Question: "Please describe this planning map in detail."

Reasoning

This dimension encompasses classification, spatial‐relation reasoning, and domain‐specific reasoning.
Classification focuses on the ability to recognize different types of planning maps. Based on China's five-level, three-category planning system, the maps are categorized into master plans, detailed plans (including regulatory and site plans), and specialized planning maps.
Spatial Relationship Reasoning, as shown in Table 2, assesses the understanding of spatial relationships between geographic elements in planning maps, including: topological spatial relations, sequential spatial relations, and metric spatial relations.
Professional Reasoning, which measures mastery of planning knowledge such as layout morphology, functional organization, transportation systems, and environmental ecology, distinct from general logical inference.

Association

This dimension assesses the ability to collect and relate background policies and contextual documents relevant to planning maps. At a fine scale, it examines policy, regulations, and planning indicators.

Implementation

This dimension addresses the capacity for comparing, critiquing, and optimizing planning proposals. Given the highly integrative and wide‐ranging nature of urban planning, the ability to identify critical issues and emphasize key priorities is paramount.
The questions of the Implementation subset consist of 3 categories as follows:
Task Abstraction: This assesses the ability to identify and extract key information from the question text. In the Certified Urban-Rural Planner Qualification Examination, practical questions often include a long background passage containing irrelevant constraints that need to be filtered and summarized.

Example Question: “A county currently has a population of 980,000, including 520,000 urban residents. By 2035, the plan aims to increase the urbanization rate to 80%. Three central towns—A, B, and C—and the old district of the county seat will undergo upgrading and renovation. Sixty percent of the population will be relocated to newly built residential areas on the outskirts of the county seat. The plan also includes the development of a food industrial park and rural tourism, as shown in the figure. What are the main issues presented in the map? Please extract the key information from the question.”
Example Answer: The key points in this question are urbanization rate, central town, population migration, and food industrial park.

Task-Oriented Image Summarization assesses the ability to identify and extract key information from an image. The image often contains rich information, and some parts are highly relevant to the correct answer and need to be summarized accordingly.

Example Question: “Identify the main issues shown in the image.”
Example Answer: “The main issues in the image include the spatial relationship between Central Town A, the food industrial park, and the floodplain; the topological relationship between the expressway and the nationally protected wetland; the distance between the planned interchange on the west side of the expressway and the existing one; and the overall distribution of central towns.”

我们提出了一个针对国土空间规划可视化领域的概念框架。该框架包括以下4个维度和8个类别：

感知

感知维度中的问题包括以下2个类别：
元素识别评估模型对规划图上图幅配置、文字、基础地理要素、图纸要素的识别能力。它有助于模型建立图像内容与自然语言之间的语义映射。
描述，在本研究中指扒出图像中尽可能多的细节。模型仅基于图像本身，而不是基于问题去从图中定位要素。Caption 的质量可以反映模型是否真正“看清”了图像。
示例问题：“请详细描述这张规划图。”

推理

这个维度包括分类、空间关系推理和专业推理。
分类关注模型对不同类型的规划图的识别能力。目前按中国的五级三类体系，分为总体规划，详细规划（控制性、修建性）和专项规划图。
空间关系推理，评估规划图中地理元素之间的空间关系理解，包括：拓扑空间关系、顺序空间关系和度量空间关系。
专业推理，考察规划图专业知识掌握的能力。包括布局形态、功能组织、交通体系、环境生态等，区别于常规逻辑推断。

关联

该维度评估收集过程中指与规划图相关的背景政策、文件上下文的问题。小尺度注规真题：政策、规范、规划指标。

应用

该维度涉及比较、批评和优化规划方案的能力。由于城市规划的高度综合性和广泛性，识别关键问题和强调重点优先事项的能力至关重要。应用维度中的问题包括以下3个类别：
题目归纳 考察识别归纳提取题目文本重点的能力。注册规划师实务考试题目通常会提供一段背景信息的长文本，有一些和答案无关的约束条件，需要进行过滤和归纳。

示例问题：“一个县目前有98万人口，其中52万人为城市居民。到2035年，计划将城市化率提高到80%。三个中心城镇A、B和C以及县城的旧区将进行升级和改造。60%的人口将迁移到县城郊区新建的住宅区。该计划还包括建设食品工业园区和乡村旅游，如图所示。图中主要问题是什么？请提取问题中的关键信息。”
示例答案：这个问题的关键点是城市化率、中心城镇、人口迁移和食品工业园区。

题图归纳考察识别归纳提取图中重点的能力,图中承载着丰富的信息，有一些和答案强相关的部分，需要进行归纳。

示例问题：“提取图片中存在的主要问题。”
示例答案：“图片中的主要问题是中心镇A、食品产业园与河堤行洪的关系；高速公路与国家重要湿地的拓扑关系；规划高速西侧立交口与现状高速立交口的距离；中心镇分布情况。”

Performance

Note: This content is part of a manuscript under submission. Please do not cite until it is officially published. The arxiv article and dataset along with the code will be launched soon.