We address a recommendation task for next likely flight destination to customers of a major international airline company. We compare performance using historical flight data and an actual user evaluation. Using two years of historical flight data consisting of tens of millions of flights, an ensemble and a collaborative filtering approach obtained an accuracy of 47% and 20% using a test set of 100,000 customers, respectively, highlighting the challenge of the domain. We then evaluated our recommendations on 10,000 actual customers, with a 45-45-10 split among ensemble, collaborative filtering, and control group. The overall predictive power employed with real users was 23%, with the ensemble method having a predictive power of 19% and 30% for collaborative filtering. Results indicate that, in complex and shifting domains such as this one, one cannot rely solely on historical data for evaluating the impact of user recommendations. We discuss implications for recommendation systems and future research in this and related domains.