Urban Planning Bench

A Comprehensive Benchmark for Evaluating Urban Planning Capabilities in Large Language Models

Yijie Deng†,1,2 , He Zhu†,1,2 , Wen Wang1,2 , Minxin Chen1,2 , Junyou Su1,2 , Wenjia Zhang*,1,3

1Behavioral and Spatial AI Lab, Tongji University 2Behavioral and Spatial AI Lab, Peking University 3College of Architecture and Urban Planning, Tongji University

†Corresponding Author: wenjiazhang@tongji.edu.cn

🔔News

🔥[2025-05-20] The arxiv article and dataset along with the code will be launched soon.

Introduction

Urban planning, as a highly interdisciplinary and practice-oriented field, requires testing beyond mere factual recall. It involves complex situational judgment, policy interpretation, spatial reasoning, and value assessment. Planning texts are characterized by dense terminology, complex structure, and long reasoning chains. Building a benchmark helps enhance large models’ domain-specific adaptability in the following aspects:

  • Ability in Disciplinary Terminology and Regulatory Systems (e.g., "Urban and Rural Planning Law", "Control Planning Indicators")
  • Ability in Multi-level Spatial Governance Logic (Nation - City - Community)
  • Ability in Contextualized Policy Judgement and Solution Generation (e.g., Site Selection, Land Use Allocation, Industrial Recommendations)
Text-based benchmarks serve as the linguistic foundation for “multimodal urban intelligence.” In subsequent integration with maps, charts, and spatial models, strong textual understanding is the core enabler of coordinated “text-visual-policy” intelligence.

Framework Design

We began by reviewing urban planning curricula from leading institutions in both China and the United States, including Peking University, Tongji University, MIT, and Harvard GSD. This analysis identified core knowledge domains and learning objectives in contemporary planning education.

1. Syllabus Reference: We further examined professional qualification examinations for urban planners across multiple countries, drawing from China’s Registered Urban Planner Exam, the U.S. AICP (American Institute of Certified Planners) certification, the UK’s RTPI (Royal Town Planning Institute) accreditation, Australia’s PIA (Planning Institute of Australia), and Canada’s CIP (Canadian Institute of Planners) to establish a foundational disciplinary framework.

2. Knowledge Classification: By integrating disciplinary knowledge systems and exam syllabi, we classified the dataset into 4 major categories, 24 intermediate classes, and 81 subcategories. Using Content Validity Index (CVI) and Scale-Level CVI (S-CVI=1.0), we confirmed strong alignment between our classification system and international planner certification frameworks.

3. Competency Dimensions: Adopting Bloom’s revised taxonomy, we developed a structured assessment matrix covering five cognitive levels of planning competencies. Each knowledge point was mapped to specific cognitive tasks:

  • Remember: Recalling facts, terminology, and fundamental concepts.
  • Understand: Interpreting, paraphrasing, and explaining key information.
  • Apply: Synthesizing and contextualizing planning scenarios.
  • Analyze: Deconstructing text structures, identifying implicit assumptions, and diagnosing issues.
  • Evaluate: Comparing standards, assessing judgments, and critiquing solutions. Higher-order tasks also incorporated preliminary Create-level challenges, such as generating actionable recommendations for urban development issues.

4. Reasoning Design: To enhance model performance evaluation, particularly for complex reasoning tasks, we systematically embedded Chain-of-Thought (CoT) principles into question design. Scenarios included contextual preconditions, guided prompts, and deliberate logical fallacies, accompanied by discipline-specific analytical pathways to support CoT-match scoring.

5. Assessment Procedure: For validation, we convened a panel of expert urban planners from academia and practice to workshop core knowledge domains and review assessment items. Each question was paired with detailed scoring rubrics specifying evaluation criteria and acceptable response elements. We employed a dual human-machine scoring system, with algorithmic metrics calibrated against human-annotated benchmarks to ensure reliability.

framework

Figure 1: PlanBench-Text Architecture.

Dataset

The benchmark items span key categories of urban planning knowledge, including land use planning, transportation planning, urban design, housing and community development, environmental planning, and infrastructure planning. Each TextBench item consists of a query, context information (when applicable), and evaluation criteria. The queries assess different cognitive levels based on Bloom’s revised taxonomy, with balanced distribution across Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating levels.

framework

Figure 2: Task Flow Diagram.

framework

Figure 3: Subject System Diagram.

framework

Figure 4: National Distribution Map Based on Subject Syllabi.

Leaderboard

Quantitative results for 24 LVLMs across 52 tasks are summarized. Accuracy is the metric, and the Overall score is computed across all tasks. The maximum value of each task is bolded. Notice that although InternVL1.5-chat supports multiple image inputs, its training phase did not incorporate multi-image data. The full term of task abbreviation can be found in the paper.

DeepSeek Family LLaMa Family Qwen Family Other Open-source LLMs
Model names Cognitive Levels Model names
Remember Understand Apply Analyze Evaluate Score
DeepSeek-R1-Distill-Llama-8B 93.8 64.2 75.3 78.8 28.4 68.1
DeepSeek-R1-Distill-Qwen-7B 96.3 69.1 77.8 73.4 23.5 68.0
Meta-Llama-3-8B-Instruct 95.1 58.0 72.8 78.8 48.1 70.6
Llama-3.1-Tulu-3-8B 60.5 56.8 30.9 80.8 16.0 49.0
Qwen3-32B 97.5 86.4 95.1 86.1 39.5 80.9
Qwen3-14B 97.5 77.8 92.6 86.8 48.1 80.6
QwQ-32B 95.1 85.2 91.4 91.9 38.3 80.4
Qwen3-8B 93.8 80.2 90.1 90.4 45.7 80.0
Qwen3-4B 95.1 72.8 90.1 89.3 46.9 78.8
Qwen3-30B-A3B 97.5 79.0 88.9 89.5 37.0 78.4
Qwen3-1.7B 95.1 79.0 76.5 85.1 34.6 74.1
Qwen2-5-3B-Instruct 98.8 66.7 92.6 64.0 29.6 70.3
Qwen2-5-7B-Instruct 98.8 70.4 81.5 65.9 30.9 69.5
Qwen2-VL-7B-Instruct 93.8 65.4 76.5 65.7 39.5 68.2
Qwen3-0.6B 90.1 55.6 46.9 74.8 12.3 55.9
Qwen2.5-0.5B-Instruct 65.4 21.0 25.9 69.4 14.8 39.3
glm-4-9b-chat 91.4 72.8 84.0 79.9 38.3 73.3
Gemma-2-9B-it 96.3 75.3 90.1 67.3 33.3 72.5
Yi-6B-Chat 93.8 48.1 75.3 85.6 26.2 65.8
Gemma-2-2B-it 87.7 44.4 75.3 69.0 28.4 61.0
chatglm3-6b 80.2 37.5 44.4 58.3 21.0 48.3
Gemma-7B-it 33.3 6.2 33.3 70.8 6.2 30.0

Performance

framework

Figure 5 Radar Chart of Model Error Type Distribution.

Acknowledgements

Data Annotators Yijie Deng, Siqi Zha, Fenghong An, Hanying Li, Chuang Deng

Data Test Engineer He Zhu, Wen Wang, Junyou Su


Note: This content is part of a manuscript under submission. Please do not cite until it is officially published. The arxiv article and dataset along with the code will be launched soon.

BibTeX


@misc{deng2025urban,
    title = {Urban Planning Bench: A Comprehensive Benchmark for Evaluating Urban Planning Capabilities in Large Language Models},
    author = {Yijie Deng and He Zhu and Wen Wang and Minxin Chen and Junyou Su and Wenjia Zhang},
    year = {2025},
    institution = {Behavioral and Spatial AI Lab, Tongji University and Peking University; College of Architecture and Urban Planning, Tongji University},
    note = {†Equal contribution. *Corresponding author: wenjiazhang@tongji.edu.cn},
}