Since the completion of the Human Genome Project, the cost of sequencing has decreased dramatically and the throughput increased exponentially. Despite this, de novo genome assembly remains a significant challenge. Any time new human genomes are sequenced, they are aligned and compared to the reference, which is a critical aspect of successful data analysis. The reference is used as a scaffold and is vital to genomics researchers worldwide. As more clinical decisions are made using genomic data, gaps and errors in the reference present difficulties in interpreting these data.
Several published reports identify regions of the reference assembly that are recalcitrant to analysis due to the limitations of the available sequencing technologies, complex genome architecture, missing sequences, and various errors in the assembly or underlying sequence data. “With trillions of bases of DNA sequence being compared to the human genome reference on a daily basis, it is critical to have the best possible representation of that reference available to ensure accurate interpretation of those comparisons,” says MGI’s Bob Fulton, a leader on the project.
Since its inception, MGI has been a major advocate for a high-quality reference. The Genome Reference Consortium (MGI, WT Sanger Institute, NCBI, EBI) took ownership of the reference assembly in 2007, and has made substantial improvements in data quality, management, and centralization. The GRC provides public access to the bulk of this reference data via APIs, FTP reports, and a website (Genome Reference). The GRC has developed an assembly model that allows for the representation of allelic diversity and is used in the current reference human assembly, GRCh38.
The reference has often been interpreted as a linear representation of the genome, but it includes very little allelic diversity. The reference genome is largely derived from people of central European and African descent and thus does not capture the diversity of humans. Fulton explains the problem of gaps in the reference due to allelic diversity as, “Like a puzzle, but the pieces you are trying to assemble are from different puzzles.” MGI will add substantial allelic diversity to the reference assembly by completely finishing (“platinum”) at least two genomes (CHM1 and CHM13) and performing targeted finishing (“gold”) in additional genomes.
Additionally, MGI plans to engage the bioinformatics community to ensure that the next generation of aligners, variant callers, annotation pipelines, and new bioinformatics tools will be capable of interacting with a multi-allelic reference genome. This collaborative effort will ensure that a pan-reference genome is readily usable in medical discovery analysis pipelines. A further aim of the project is to communicate that the reference is much more than a linear chromosomal assembly through the development and deployment of community outreach and education programs.
Fulton adds, “As we gain a better understanding of the implications of the diversity in individuals' genomes, it is clear that representing those differences in the reference is key to the accurate interpretation of genomic data.”
For more information regarding the Genome Resource Consortium, please visit the GRC website.