Chip multiprocessors are becoming increasingly popular in embedded domain since they have important advantages over their single core counterparts from the parallelism, power efficiency, validation, and verification perspectives. However, extracting maximum performance from these multiprocessors requires compiler support in form of effective code parallelization. The goal of this paper is to present and experimentally evaluate a locality aware dynamic loop scheduling strategy that implements both locality aware loop iteration distribution across parallel processors and dynamic load balancing at runtime. This hybrid scheme has been implemented and tested along with four other previously-proposed loop scheduling schemes, including a locality aware one. Our experimental analysis reveals that the proposed approach generates better results than all other scheduling schemes (static or dynamic) tested. Our results also show that the improvements brought by the proposed scheduling scheme are consistent across experiments with different values of our major simulation parameters such as the number of processors and cache size per processor.