With the development of urbanization, the energy-intensive building environment in cities is becoming increasingly responsible for energy consumption and greenhouse emissions in the United States. As a result, great efforts have been put forth to develop tools and methodologies to forecast urban building energy consumption in a spatial and temporal dimension. However, existing physics-based and data-driven models are insufficient to consider the impacts of building dependencies and micro-climates efficiently, which can significantly affect model utility and accuracy. Due to configurations and characteristics of modern cities, the interdependencies among buildings, e.g., heat transfer between buildings and solar impacts, are most often non-linear, high-dimension, and highly dynamic, which increase the difficulties to model them. To address those challenges, a novel urban building energy model (UBEM) based on spatio-temporal graph convolutional network (STGCN) algorithm was proposed to predict temporal urban-level building energy consumption in cities and better understand the interactions between buildings. In particular, we took a campus in Atlanta, Georgia, as a case study to validate the accuracy of UBEM. Results indicate that the UBEM tool has significant improvement in simulation accuracy, and model explanation compared with physics-based models and pure data-driven models.