Increasing urbanization across the globe has coincided with greater access to urban data; this enables researchers and city administrators with better tools to understand urban dynamics, such as crime, traffic, and living standards. In this paper, we study the Learning an Embedding Space for Regions (LESR) problem, wherein we aim to produce vector representations of discrete regions. Recent studies have shown that embedding geospatial regions in a latent vector space can be useful in a variety of urban computing tasks. However, previous studies do not consider regions across multiple modalities in an end-to-end framework. We argue that doing so facilitates the learning of greater semantic relationships among regions. We propose a novel method, RegionEncoder, that jointly learns region representations from satellite image, point-of-interest, human mobility, and spatial graph data. We demonstrate that these region embeddings are useful as features in two regression tasks and across two distinct urban environments. Additionally, we perform an ablation study that evaluates each major architectural component. Finally, we qualitatively explore the learned embedding space, and show that semantic relationships are discovered across modalities.