Gap statistics for whole genome shotgun DNA sequencing projects.

Bioinformatics. 2004 Feb 12 [Epub ahead of print]


MOTIVATION: Investigators utilize gap estimates for DNA sequencing projects. Standard theories assume sequence is independently and identically distributed, leading to appreciable under-prediction of gaps.
RESULTS: Using a statistical scaling factor and data from 20 representative whole genome shotgun projects, we construct regression equations which relate coverage to a normalized gap measure. Prokaryotic genomes do not correlate to sequence coverage, while eukaryotes show strong correlation if chaff is ignored. Gaps decrease at an exponential rate of only about one-third of that predicted via theory alone. Case studies suggest that departure from theory can largely be attributed to assembly difficulties for repeat-rich genomes, but bias and coverage anomalies are also important when repeats are sparse. Such factors cannot be readily characterized a priori, suggesting upper limits on the accuracy of gap prediction. We also find that diminishing coverage probability discussed in other studies is a theoretical artifact that does not arise for the typical project.


Wendl MC, Yang SP.

Institute Authors

Michael Wendl, Ph.D.