Urban Planning Bench

🔔News

🔥[2025-05-20] The arxiv article and dataset along with the code will be launched soon.

Introduction

Urban planning, as a highly interdisciplinary and practice-oriented field, requires testing beyond mere factual recall. It involves complex situational judgment, policy interpretation, spatial reasoning, and value assessment. Planning texts are characterized by dense terminology, complex structure, and long reasoning chains. Building a benchmark helps enhance large models’ domain-specific adaptability in the following aspects:

Ability in Disciplinary Terminology and Regulatory Systems (e.g., "Urban and Rural Planning Law", "Control Planning Indicators")
Ability in Multi-level Spatial Governance Logic (Nation - City - Community)
Ability in Contextualized Policy Judgement and Solution Generation (e.g., Site Selection, Land Use Allocation, Industrial Recommendations)

Text-based benchmarks serve as the linguistic foundation for “multimodal urban intelligence.” In subsequent integration with maps, charts, and spatial models, strong textual understanding is the core enabler of coordinated “text-visual-policy” intelligence.

Framework Design

We began by reviewing urban planning curricula from leading institutions in both China and the United States, including Peking University, Tongji University, MIT, and Harvard GSD. This analysis identified core knowledge domains and learning objectives in contemporary planning education.

1. Syllabus Reference: We further examined professional qualification examinations for urban planners across multiple countries, drawing from China’s Registered Urban Planner Exam, the U.S. AICP (American Institute of Certified Planners) certification, the UK’s RTPI (Royal Town Planning Institute) accreditation, Australia’s PIA (Planning Institute of Australia), and Canada’s CIP (Canadian Institute of Planners) to establish a foundational disciplinary framework.

2. Knowledge Classification: By integrating disciplinary knowledge systems and exam syllabi, we classified the dataset into 4 major categories, 24 intermediate classes, and 81 subcategories. Using Content Validity Index (CVI) and Scale-Level CVI (S-CVI=1.0), we confirmed strong alignment between our classification system and international planner certification frameworks.

3. Competency Dimensions: Adopting Bloom’s revised taxonomy, we developed a structured assessment matrix covering five cognitive levels of planning competencies. Each knowledge point was mapped to specific cognitive tasks:

Remember: Recalling facts, terminology, and fundamental concepts.
Understand: Interpreting, paraphrasing, and explaining key information.
Apply: Synthesizing and contextualizing planning scenarios.
Analyze: Deconstructing text structures, identifying implicit assumptions, and diagnosing issues.
Evaluate: Comparing standards, assessing judgments, and critiquing solutions. Higher-order tasks also incorporated preliminary Create-level challenges, such as generating actionable recommendations for urban development issues.

4. Reasoning Design: To enhance model performance evaluation, particularly for complex reasoning tasks, we systematically embedded Chain-of-Thought (CoT) principles into question design. Scenarios included contextual preconditions, guided prompts, and deliberate logical fallacies, accompanied by discipline-specific analytical pathways to support CoT-match scoring.

5. Assessment Procedure: For validation, we convened a panel of expert urban planners from academia and practice to workshop core knowledge domains and review assessment items. Each question was paired with detailed scoring rubrics specifying evaluation criteria and acceptable response elements. We employed a dual human-machine scoring system, with algorithmic metrics calibrated against human-annotated benchmarks to ensure reliability.

Figure 1: PlanBench-Text Architecture.

框架设计

我们首先对中国和美国的领先城市规划学校进行了深入的研究，包括北京大学、同济大学、麻省理工学院和哈佛大学。这一分析揭示了当代规划教育中的核心知识领域和学习目标。

1.大纲参考：我们进一步研究了多个国家的城市规划师专业资格考试，借鉴了中国的注册城市规划师考试、美国的AICP（美国注册城市规划师协会）认证、英国的RTPI（皇家城镇规划协会）认证、澳大利亚的PIA（澳大利亚规划协会）和加拿大的CIP（加拿大规划师协会），以建立一个基础学科框架。

2.知识分类：通过整合学科知识体系和考试大纲，我们将数据集划分为4个主要类别、24个中级类和81个子类别。使用内容有效性指数（CVI）和尺度级CVI（S-CVI=1.0），我们确认了我们的分类系统与国际规划师认证框架之间的强一致性。

3.能力维度：采用布鲁姆修订版分类法，我们开发了一个结构化的评估矩阵，涵盖了五个认知水平的规划能力。每个知识点都映射到特定的认知任务：

记忆：回忆事实、术语和基本概念。
理解：解释、复述、比较和解释关键信息。
应用：调用文本理解，识别隐含假设和诊断问题。
分析：结合具体业务场景给出解决方案。
评估：比较标准、评估判断，调用综合能力，锁定答案。更高阶的任务还包含初步的创建级挑战，例如为城市发展问题生成可操作的建议。

4.推理设计：为了增强模型性能评估，特别是对于复杂推理任务，我们系统地嵌入了Chain-of-Thought（CoT）原则到问题设计中。场景包括情境前提、引导提示和故意的逻辑谬误，伴随有特定学科的分析路径，以支持CoT匹配评分。

5.评估程序：为验证，我们召集了来自学术界和实践的专业城市规划师进行核心知识领域的研讨会和评估项目的评审。每个问题都与详细的评分标准配对，指定了评估标准和可接受的响应元素。我们采用了人类-机器双重评分体系，算法指标经过人类标注基准的校准，以确保可靠性。

Figure 1: PlanBench-Text Architecture.

Dataset

The benchmark items span key categories of urban planning knowledge, including land use planning, transportation planning, urban design, housing and community development, environmental planning, and infrastructure planning. Each TextBench item consists of a query, context information (when applicable), and evaluation criteria. The queries assess different cognitive levels based on Bloom’s revised taxonomy, with balanced distribution across Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating levels.

Leaderboard

Quantitative results for 24 LVLMs across 52 tasks are summarized. Accuracy is the metric, and the Overall score is computed across all tasks. The maximum value of each task is bolded. Notice that although InternVL1.5-chat supports multiple image inputs, its training phase did not incorporate multi-image data. The full term of task abbreviation can be found in the paper.

DeepSeek Family LLaMa Family Qwen Family Other Open-source LLMs

Model names	Cognitive Levels					Model names
Model names	Remember	Understand	Apply	Analyze	Evaluate	Score
DeepSeek-R1-Distill-Llama-8B	93.8	64.2	75.3	78.8	28.4	68.1
DeepSeek-R1-Distill-Qwen-7B	96.3	69.1	77.8	73.4	23.5	68.0
Meta-Llama-3-8B-Instruct	95.1	58.0	72.8	78.8	48.1	70.6
Llama-3.1-Tulu-3-8B	60.5	56.8	30.9	80.8	16.0	49.0
Qwen3-32B	97.5	86.4	95.1	86.1	39.5	80.9
Qwen3-14B	97.5	77.8	92.6	86.8	48.1	80.6
QwQ-32B	95.1	85.2	91.4	91.9	38.3	80.4
Qwen3-8B	93.8	80.2	90.1	90.4	45.7	80.0
Qwen3-4B	95.1	72.8	90.1	89.3	46.9	78.8
Qwen3-30B-A3B	97.5	79.0	88.9	89.5	37.0	78.4
Qwen3-1.7B	95.1	79.0	76.5	85.1	34.6	74.1
Qwen2-5-3B-Instruct	98.8	66.7	92.6	64.0	29.6	70.3
Qwen2-5-7B-Instruct	98.8	70.4	81.5	65.9	30.9	69.5
Qwen2-VL-7B-Instruct	93.8	65.4	76.5	65.7	39.5	68.2
Qwen3-0.6B	90.1	55.6	46.9	74.8	12.3	55.9
Qwen2.5-0.5B-Instruct	65.4	21.0	25.9	69.4	14.8	39.3
glm-4-9b-chat	91.4	72.8	84.0	79.9	38.3	73.3
Gemma-2-9B-it	96.3	75.3	90.1	67.3	33.3	72.5
Yi-6B-Chat	93.8	48.1	75.3	85.6	26.2	65.8
Gemma-2-2B-it	87.7	44.4	75.3	69.0	28.4	61.0
chatglm3-6b	80.2	37.5	44.4	58.3	21.0	48.3
Gemma-7B-it	33.3	6.2	33.3	70.8	6.2	30.0

Performance

Note: This content is part of a manuscript under submission. Please do not cite until it is officially published. The arxiv article and dataset along with the code will be launched soon.

BibTeX


@misc{deng2025urban,
    title = {Urban Planning Bench: A Comprehensive Benchmark for Evaluating Urban Planning Capabilities in Large Language Models},
    author = {Yijie Deng and He Zhu and Wen Wang and Minxin Chen and Junyou Su and Wenjia Zhang},
    year = {2025},
    institution = {Behavioral and Spatial AI Lab, Tongji University and Peking University; College of Architecture and Urban Planning, Tongji University},
    note = {†Equal contribution. *Corresponding author: wenjiazhang@tongji.edu.cn},
}