The ever-growing complexity and popularity of machine learning and deep learning applications have motivated an urgent need of effective and efficient support for these applications on contemporary computing systems. In this paper, we thoroughly analyze the various DNN algorithms on three widely used architectures (CPU, GPU, and Xeon Phi). The DNN algorithms we choose for evaluation include i) Unet-for biomedical image segmentation, based on Convolutional Neural Network (CNN), ii) NMT-for neural machine translation based on Recurrent Neural Network (RNN), iii) ResNet-50, and iv) DenseNet-both for image processing based on CNNs. The ultimate goal of this paper is to answer four fundamental questions: i) whether the different DNN networks exhibit similar behavior on a given execution platform? ii) whether, across different platforms, a given DNN network exhibits different behaviors? iii) for the same execution platform and the same DNN network, whether different execution phases have different behaviors? and iv) are the current major general-purpose platforms tuned sufficiently well for different DNN algorithms? Motivated by these questions, we conduct an in-depth investigation of running DNN applications on modern systems. Specifically, we first identify the most time-consuming functions (hotspot functions) across different networks and platforms. Next, we characterize performance bottlenecks and discuss them in detail. Finally, we port selected hotspot functions to a cycle-accurate simulator, and use the results to direct architectural optimizations to better support DNN applications.