TY - JOUR
T1 - Highly accurate assembly polishing with DeepPolisher
AU - Human Pangenome Reference Consortium
AU - Mastoras, Mira
AU - Asri, Mobin
AU - Brambrink, Lucas
AU - Hebbar, Prajna
AU - Kolesnikov, Alexey
AU - Cook, Daniel E.
AU - Nattestad, Maria
AU - Lucas, Julian
AU - Won, Taylor S.
AU - Chang, Pi Chuan
AU - Carroll, Andrew
AU - Paten, Benedict
AU - Shafin, Kishwar
AU - Tayoun, Ahmad Abou
AU - Albracht, Derek
AU - Allen, Jamie
AU - Alsheikh-Ali, Alawi A.
AU - Andrews, Casey
AU - Antipov, Dmitry
AU - Antonacci-Fulton, Lucinda
AU - Asri, Mobin
AU - Ayllon, Marcelo
AU - Balacco, Jennifer R.
AU - Belter, Edward A.
AU - Bender, Halle D.
AU - Blair, Andrew P.
AU - Buonaiuto, Silvia
AU - Bolognini, Davide
AU - Bonini, Katherine E.
AU - Boucher, Christina
AU - Bourque, Guillaume
AU - Cao, Shuo
AU - Carroll, Andrew
AU - Mc Cartney, Ann M.
AU - Cechova, Monika
AU - Chang, Pi Chuan
AU - Chang, Xian
AU - Cheema, Jitender
AU - Cheng, Haoyu
AU - Ciofi, Claudio
AU - Cody, Sarah
AU - Colonna, Vincenza
AU - Conwell, Holland C.
AU - Cook-Deegan, Robert
AU - Diekhans, Mark
AU - Diroma, Maria Angela
AU - Doerr, Daniel
AU - Dong, Zheng
AU - Durbin, Richard
AU - Makova, Kateryna D.
N1 - Publisher Copyright:
© 2025 Mastoras et al.
PY - 2025/7
Y1 - 2025/7
N2 - Accurate genome assemblies are essential for biological research, but even the highest-quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over- and underpolishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacific Biosciences (PacBio) HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHAsing Reads in Areas Of Homozygosity (PHARAOH), which uses ultralong Oxford Nanopore Technologies (ONT) data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by approximately half, mostly driven by reductions in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted quality value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.
AB - Accurate genome assemblies are essential for biological research, but even the highest-quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over- and underpolishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacific Biosciences (PacBio) HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHAsing Reads in Areas Of Homozygosity (PHARAOH), which uses ultralong Oxford Nanopore Technologies (ONT) data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by approximately half, mostly driven by reductions in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted quality value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.
UR - https://www.scopus.com/pages/publications/105010352128
UR - https://www.scopus.com/pages/publications/105010352128#tab=citedBy
U2 - 10.1101/gr.280149.124
DO - 10.1101/gr.280149.124
M3 - Article
C2 - 40389286
AN - SCOPUS:105010352128
SN - 1088-9051
VL - 35
SP - 1595
EP - 1608
JO - Genome research
JF - Genome research
IS - 7
ER -