Project

Reference Genomes Improvement

MGI's commitment to enhancing and diversifying human reference genomes.

Reference Genomes Improvement Details

The Human Genome Project (HGP) produced the human reference genome assembly, a database of DNA sequence that represents an example of a full human genome. When researchers sequence human genomes, they compare, or “align”, their results to this reference. While this assembly is one of the most frequently utilized resources in biomedical research, de novo genome assembly remains a significant challenge despite increase in throughput and decrease of sequence cost over the past decade. Alignment of human sequence reads to the reference assembly is a critical aspect of successful data analysis, and several published reports identify regions of the reference assembly that were previously impossible to analyze due to the limitations of the available sequencing technologies, complex genome architecture, missing sequences, and various errors in the assembly or the underlying sequence data.

GRCh38 Primary Assembly: % Bases by Library

At the time of publication, the ‘finished’ human genome (NCBI35; GCF_000001405.11) contained 288 assembly gaps. Working to close these gaps, many were determined to be in regions containing structurally variant alleles. Since the reference genome is assembled from sequence information from many donors, it represents a haploid mosaic. For example, in the MAPT region, the allele represented in the assembly was not likely present in any individual human as it had been constructed by mixing the direct and inverted haplotypes present in the RP11 donor.

We can now correct many of the deficiencies in the current human reference by applying the latest advances in sequencing and mapping technologies. One key advance has been the sequencing and finishing of single haplotype human genomes (e.g., CHM1 and CHM13). These data sets can be used to completely resolve both alleles of several genome sequences and improve the GRCh38 reference assembly.

MGI Method for Improving Reference Genomes

The explosion of clinical genome sequencing requires a human reference genome resource that accurately represents the diversity of the human population, thereby facilitating the identification and characterization of disease- associated variants and somatic events. Although the 1000 Genomes (1KG) project provided a valuable foundation, it is now necessary to select additional representative human genomes, or regions of human genomes, for deep sequencing, assembly, and finishing to high quality and contiguity. These new assemblies should be accessible for the scientific community in the context of the existing reference genome, with improved bioinformatics tools that provide intuitive access to alternate paths, alleles, and haplotypes. In addition, the program will provide greater outreach and educational opportunities aimed at empowering users, especially those who may be relatively new to sequencing technology and analysis, to better use the reference human genome resource in discovery projects and clinical sequencing applications.

Sequencing Plan

We will sequence and assemble at least 5 diploid genomes from individuals selected to maximize human genetic diversity (see table below). All sources chosen thus far have BAC libraries available and, whenever possible, we will use samples from a trio (two parents and child). We will sequence the parents within the trio at a lower depth of coverage to enable haplotype phasing of the proband sequence. The samples selected at this time are one Yoruban (NA19240), one Han Chinese from Beijing (HG00514), one CEPH European (NA12878), one Puerto Rican (HG00733), one Luhya from Webuye, Kenya (NA19434), one Colombian (HG01352), one Gambian (HG02818), and one Kinh from Vietnam (HG02059). Other independent efforts to sequence and assemble new reference genomes include two Japanese, one Malaysian, a Han Chinese and an Ashkenazim trio (as part of the Genome in a Bottle Effort).

Improving Reference Genome Quality and Diversity

Specific Aims

We plan to identify and resolve issues (misassemblies, sequence errors, and gaps) within the current reference GRCh38. We will add substantial allelic diversity to the reference to facilitate effective analysis of biomedically important regions across the genome. We will accomplish this by completely finishing (“platinum”) two genomes (CHM1 and CHM13) and performing targeted finishing (“gold”) in additional genomes. We define platinum genome as a contiguous, haplotype-resolved representation of the entire genome. Gold genome is defined as a high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions.

Gold Genome: A high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions.

Platinum Genome: A contiguous, haplotype-resolved representation of the entire genome.

We will engage the bioinformatics community to ensure that the next generation of aligners, variant callers, annotation pipelines, and bioinformatics tools will be capable of interacting with a multi-allelic reference genome. We will facilitate more effective use of the reference for biomedical discovery by providing detailed tutorials of the required complex tool chains. Finally, through the development and deployment of community outreach and education programs, we will convey the importance of the reference as much more than a linear chromosomal assembly.

Platinum, Gold, and Silver Genome Specifications

Assembly and Analysis Details

After long reads are generated from the PacBio, we assemble them using the Falcon algorithm followed by error correction using Quiver. The output of this step is a fasta file of unordered and unoriented contigs. We then align the BioNano genomic map generated from the same individual and clone end sequences (if available) to check for global misassemblies. We make breaks where possible based on these data, and output ordered and oriented contigs based on the map alignments. In addition, there is a file of unaligned contigs. We then use NCBI’s assembly-assembly alignment and chromosome contig generating software to further QC the assembly.

Once the assembly is in ordered and oriented chromosome contigs, we use the NCBI RefSeq gene annotation pipeline, and further annotate with RepeatMasker and Segmental Duplications. After annotation, we can then integrate other data such as Illumina alignments and variant calls, clone based resources, and data from newer technologies such as Dovetail and GemCode to improve the assembly and assess its quality. See http://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/ for more information regarding NCBI pipelines.

Genomes to Be Sequenced and Assembled

Data Source Origin of Samples Quality Status/Links
CHM1 NA Platinum Assembly QC
CHM13 NA Platinum Assembly QC
HG00514 Han Chinese Gold Assembly QC
HG00733 Puerto Rican Gold Assembly QC
NA12878 European Gold Assembly QC
NA19240 Yoruban Gold Manuscript Preparation
NA19434 Luhya Gold Not Started
HG01352 Colombian Gold Assembly QC
HG02818 Gambian Gold Data Generation
HG02059 Vietnamese Gold Not Started