The Human Genome Project, the biggest biomedical project humans have ever endeavored to the date, greatly accelerated the advancement of DNA sequencing technologies. Three generations of DNA sequencing technologies have been developed in the last three decades, and we are at the crossroads of the second and third generations. The upgrade to the third-generation single molecule sequencing technology from the currently prevalent second-generation technology is expected to further lower the sequencing cost and expand its applications in biomedical research and biotechnology development. Nevertheless, arguably the biggest roadblock preventing the transition to the third generation technology has been the computational problem of the genome assembly. Specifically the error detection and correction ‘curse’ returns when we pursue the high throughput long reads which is a best selling point of the third generation technology.
In recently released software, dubbed DBG2OLC (http://sites.google.com/site/dbg2olc/), by a team of scientists including Prof. Sam Ma at the Computational Biology and Medical Ecology Lab of the Chinese Academy of Sciences, and Profs. Chengxi Ye, James Yorke, Aleksey Zimin from the University of Maryland, a novel de novo assembly algorithm was proposed and demonstrated to be ultra-efficient in assembling highly erroneous long reads produced by the third generation of DNA sequencers, in terms of both computational time and memory. The DBG2OLC converts the de novo genome assembly problem from the de Bruijn graph (DBG) to the overlap layout consensus (OLC) framework. For each sequence read, DBG2OLC compresses the regions that lie inside de Bruijn graph contigs, which greatly lowers the complexity of the assembly problem. The compression transforms previously prohibitive tasks such as pair-wise alignment into jobs that require small amounts of time. A compressed overlap graph that preserves all necessary information is constructed with the compressed reads to enable the final-stage assembly. Experiments with the third generation sequencing data produced by PacBio and Oxford Nanopore technologies show that DBG2OLC was able to assemble large genomes two orders of magnitude more efficient than the existing 3rd-generation genome assemblers in terms of computational time and memory space usages. The final assembly results are also two orders of magnitude more contiguous than using the prevalent second generation Illumina sequencing technology. For example, on a large PacBio human genome dataset, it took DBG2OLC only 6 CPU hours to calculate the pair-wise alignment of 54x erroneous long reads and 2000 CPU hours to complete the final assembly on a desktop PC, compared to the 405,000 CPU hours previously reported by Pacific Biosciences on a Google cluster. On a Nanopore dataset, DBG2OLC was able to obtain high quality results (identity rate 99.5%) even the sequencing error rate was over 30%.
With the powerful error detection and correction capabilities, and far more parsimonious resource consumptions (two orders of magnitude improvement over the existing techniques), and the lower to moderate requirement for the sequencing coverage (DBG2OLC was able to get decent assembly quality with only 10x sequencing coverage), it is possible to assemble large genome with DGB2OLC efficiently on an office workstation, rather than using expensive supercomputers or clusters. This breakthrough should significantly accelerate the adoptions of the third generation sequencing technologies in large-scale genomic research and biotechnology development.