TY - JOUR
T1 - Species trees from gene trees
T2 - Reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions
AU - Liu, Liang
AU - Pearl, Dennis K.
N1 - Funding Information:
ACKNOWLEDG EMENTS We would like to thank Scott Edwards, Bryan Jennings, and Anthony Tosi for providing the sequences data. Special thanks to Scott Edwards for his insightful comments and suggestions for improving the model. We thank Thomas Buckley, Lacey Knowles, and Roderic Page for their constructive comments on an earlier version of this paper. This work was partially supported by grant NSF DMS-0112050 to the Mathematical Biosciences Institute.
PY - 2007/5
Y1 - 2007/5
N2 - The desire to infer the evolutionary history of a group of species should be more viable now that a considerable amount of multilocus molecular data is available. However, the current molecular phylogenetic paradigm still reconstructs gene trees to represent the species tree. Further, commonly used methods of combining data, such as the concatenation method, are known to be inconsistent in some circumstances. In this paper, we propose a Bayesian hierarchical model to estimate the phylogeny of a group of species using multiple estimated gene tree distributions, such as those that arise in a Bayesian analysis of DNA sequence data. Our model employs substitution models used in traditional phylogenetics but also uses coalescent theory to explain genealogical signals from species trees to gene trees and from gene trees to sequence data, thereby forming a complete stochastic model to estimate gene trees, species trees, ancestral population sizes, and species divergence times simultaneously. Our model is founded on the assumption that gene trees, even of unlinked loci, are correlated due to being derived from a single species tree and therefore should be estimated jointly. We apply the method to two multilocus data sets of DNA sequences. The estimates of the species tree topology and divergence times appear to be robust to the prior of the population size, whereas the estimates of effective population sizes are sensitive to the prior used in the analysis. These analyses also suggest that the model is superior to the concatenation method in fitting these data sets and thus provides a more realistic assessment of the variability in the distribution of the species tree that may have produced the molecular information at hand. Future improvements of our model and algorithm should include consideration of other factors that can cause discordance of gene trees and species trees, such as horizontal transfer or gene duplication.
AB - The desire to infer the evolutionary history of a group of species should be more viable now that a considerable amount of multilocus molecular data is available. However, the current molecular phylogenetic paradigm still reconstructs gene trees to represent the species tree. Further, commonly used methods of combining data, such as the concatenation method, are known to be inconsistent in some circumstances. In this paper, we propose a Bayesian hierarchical model to estimate the phylogeny of a group of species using multiple estimated gene tree distributions, such as those that arise in a Bayesian analysis of DNA sequence data. Our model employs substitution models used in traditional phylogenetics but also uses coalescent theory to explain genealogical signals from species trees to gene trees and from gene trees to sequence data, thereby forming a complete stochastic model to estimate gene trees, species trees, ancestral population sizes, and species divergence times simultaneously. Our model is founded on the assumption that gene trees, even of unlinked loci, are correlated due to being derived from a single species tree and therefore should be estimated jointly. We apply the method to two multilocus data sets of DNA sequences. The estimates of the species tree topology and divergence times appear to be robust to the prior of the population size, whereas the estimates of effective population sizes are sensitive to the prior used in the analysis. These analyses also suggest that the model is superior to the concatenation method in fitting these data sets and thus provides a more realistic assessment of the variability in the distribution of the species tree that may have produced the molecular information at hand. Future improvements of our model and algorithm should include consideration of other factors that can cause discordance of gene trees and species trees, such as horizontal transfer or gene duplication.
UR - http://www.scopus.com/inward/record.url?scp=34548575126&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34548575126&partnerID=8YFLogxK
U2 - 10.1080/10635150701429982
DO - 10.1080/10635150701429982
M3 - Article
C2 - 17562474
AN - SCOPUS:34548575126
SN - 1063-5157
VL - 56
SP - 504
EP - 514
JO - Systematic Biology
JF - Systematic Biology
IS - 3
ER -