VisionEV: multimodal large language models for spatially aware electric vehicle charging demand prediction using satellite imagery

Research output: Contribution to journalArticlepeer-review

Abstract

This manuscript presents VisionEV, a multimodal large language model (LLM) framework designed to predict electric vehicle (EV) charging demand by leveraging satellite imagery and structured textual data. Spatial context—including parking availability, land use, and nearby amenities—is critical for accurate demand estimation. However, site selection in EV infrastructure planning remains both labor-intensive and inconsistent, requiring human experts to conduct in-person audits and manually define spatial features. To overcome these limitations, VisionEV introduces an automated spatial reasoning pipeline that integrates satellite imagery of candidate locations as visual inputs, allowing the model to learn nuanced spatial patterns directly from imagery, without relying on predefined descriptors. Complementary station-level attributes, including traffic flow and temporal indicators, are embedded into domain-informed textual prompts to simulate planner reasoning. A core technical challenge lies in enabling coherent reasoning across semantically distinct inputs—structured textual data and perceptual visual context. VisionEV addresses this by reformulating the task as multimodal text generation, aligning both modalities within a shared embedding space through vision-informed prompting and lightweight domain-adaptive fine-tuning. We evaluate VisionEV using a real-world dataset of 22,852 training samples and 2,858 test samples collected from 189 public EV charging stations in Kansas City, Missouri. In the full-shot setting, VisionEV achieves superior accuracy (RMSE: 2.87, MAE: 1.98), outperforming the strongest baseline, LightGBM, by 1.0% and 5.3%, respectively. Few-shot, within-city zero-shot and cross-region spatial hold-out experiments demonstrate VisionEV's ability to generalize across unseen scenarios, and ablation studies confirm the contributions of visual input, prompt design, and fine-tuning. These results underscore the promise of multimodal LLMs in supporting scalable, data-driven EV infrastructure planning through automated spatial understanding.

Original languageEnglish (US)
Article number105069
JournalTransportation Research Part D: Transport and Environment
Volume150
DOIs
StatePublished - Jan 2026

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 15 - Life on Land
    SDG 15 Life on Land

All Science Journal Classification (ASJC) codes

  • Civil and Structural Engineering
  • Transportation
  • General Environmental Science

Fingerprint

Dive into the research topics of 'VisionEV: multimodal large language models for spatially aware electric vehicle charging demand prediction using satellite imagery'. Together they form a unique fingerprint.

Cite this