Context: Software Engineering (SE) community has empirically investigated software defect prediction as a proxy to benchmark it as a process improvement activity to assure software quality. In the domain of software fault prediction, the performance of classification algorithms is highly provoked with the residual effects attributed to feature irrelevance and data redundancy issues. Problem: The meta-learning-based ensemble methods are usually carried out to mitigate these noise effects and boost the software fault prediction performance. However, there is a need to benchmark the performance of meta-learning ensemble methods (as fault predictor) to assure software quality control and aid developers in their decision making. Method: We conduct an empirical and comparative study to evaluate and benchmark the improvement in the fault prediction performance via meta-learning ensemble methods as compared to their component base-level fault predictors. In this study, we perform a series of experiments with four well-known meta-level ensemble methods Vote, StackingC (i.e., Stacking), MultiScheme, and Grading. We also use five high-performance fault predictors Logistic (i.e., Logistic Regression), J48 (i.e., Decision Tree), IBK (i.e. k-nearest neighbor), NaiveBayes, and Decision Table (DT). Subsequently, we performed these experiments on public defect datasets with k-fold (k=10) cross-validation. We used F-measure and ROC-AUC (Receiver Operating Characteristic-Area Under Curve) performance measures and applied the four non-parametric tests to benchmark the fault prediction performance results of meta-learning ensemble methods. Results and Conclusion: we conclude that meta-learning ensemble methods, especially Vote could outperform the base-level fault predictors to tackle the feature irrelevance and redundancy issues in the domain of software fault prediction. Having said that, their performance is highly related to the number of base-level classifiers and the set of software fault prediction metrics.