TY - JOUR
T1 - EST clustering error evaluation and correction
AU - Wang, Ji Ping Z.
AU - Lindsay, Bruce G.
AU - Leebens-Mack, James
AU - Cui, Liying
AU - Wall, Kerr
AU - Miller, Webb C.
AU - dePamphilis, Claude W.
N1 - Funding Information:
The authors thank three anonymous reviewers for suggestions to help clarify important concepts; Drs Hong Ma and Francesca Chiaromonte for helpful comments and suggestions; Dr Xiangqiu Huang for providing the CAP3 program and Drs John Quackenbush and Geo Pertea for help in EST cleaning. The research was jointly supported by NSF Grant DMS0104443 to B.G.L. and NSF Grant DBI0115684 to C.W.D. at the Pennsylvania State University. This is paper #19 from The Floral Genome Project.
PY - 2004/11/22
Y1 - 2004/11/22
N2 - Motivation: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. Results: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5′ and 3′ EST clustering, the Type I error in the 5′ EST case is ∼10 times higher than the 3′ EST case (30% versus 3%). An over-stringent identity rule, e.g., P ≥ 95%, may even inflate the Type I error in both cases. We demonstrate that ∼80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5′ EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.
AB - Motivation: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. Results: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5′ and 3′ EST clustering, the Type I error in the 5′ EST case is ∼10 times higher than the 3′ EST case (30% versus 3%). An over-stringent identity rule, e.g., P ≥ 95%, may even inflate the Type I error in both cases. We demonstrate that ∼80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5′ EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.
UR - http://www.scopus.com/inward/record.url?scp=10244224129&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=10244224129&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/bth342
DO - 10.1093/bioinformatics/bth342
M3 - Article
C2 - 15189818
AN - SCOPUS:10244224129
SN - 1367-4803
VL - 20
SP - 2973
EP - 2984
JO - Bioinformatics
JF - Bioinformatics
IS - 17
ER -