Reference Genome Improvement
The Human Genome Project (HGP) produced the human reference genome assembly, a database of DNA sequence that represents an example of a full human genome. When researchers sequence human genomes, they compare, or “align,” their results to this reference. While this assembly is one of the most frequently utilized resources in biomedical research, de novo genome assembly remains a significant challenge despite increase in throughput and decrease of sequence cost over the past decade.
Alignment of human sequence reads to the reference assembly is a critical aspect of successful data analysis, and several published reports identify regions of the reference assembly that were previously impossible to analyze due to the limitations of the available sequencing technologies, complex genome architecture, missing sequences and various errors in the assembly or the underlying sequence data.
Specific aims
We plan to identify and resolve issues (misassemblies, sequence errors and gaps) within the current reference, GRCh38. We will add substantial allelic diversity to the reference to facilitate effective analysis of biomedically important regions across the genome. We will accomplish this by completely finishing (“platinum”) two genomes (CHM1 and CHM13) and performing targeted finishing (“gold”) in additional genomes. We define platinum genome as a contiguous, haplotype-resolved representation of the entire genome. Gold genome is defined as a high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions.
Gold Genome: A high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions.
Platinum Genome: A contiguous, haplotype-resolved representation of the entire genome.
We will engage the bioinformatics community to ensure that the next generation of aligners, variant callers, annotation pipelines and bioinformatics tools will be capable of interacting with a multi-allelic reference genome. We will facilitate more effective use of the reference for biomedical discovery by providing detailed tutorials of the required complex tool chains. Finally, through the development and deployment of community outreach and education programs, we will convey the importance of the reference as much more than a linear chromosomal assembly.
Assembly and analysis details
After long reads are generated from the PacBio, we assemble them using the Falcon algorithm followed by error correction using Quiver. The output of this step is a fasta file of unordered and unoriented contigs. We then align the BioNano genomic map generated from the same individual and clone end sequences (if available) to check for global misassemblies. We make breaks where possible based on these data, and output ordered and oriented contigs based on the map alignments. In addition, there is a file of unaligned contigs. We then use NCBI’s assembly-assembly alignment and chromosome contig generating software to further QC the assembly.
Once the assembly is in ordered and oriented chromosome contigs, we use the NCBI RefSeq gene annotation pipeline, and further annotate with RepeatMasker and Segmental Duplications. After annotation, we can then integrate other data such as Illumina alignments and variant calls, clone based resources and data from newer technologies such as Dovetail and GemCode to improve the assembly and assess its quality.
Source |
Origin |
Assembly Accession |
Bioproject |
NA19240 | Yoruban | GCA_001524155.2 | PRJNA288807 |
HG00514 | Han Chinese | GCA_002180035.1 | PRJNA300843 |
NA12878 | European | GCA_002077035.1 | PRJNA323611 |
HG00733 | Puerto Rican | GCA_002208065.1 | PRJNA300840 |
HG01352 | Colombian | GCA_002209525.1 | PRJNA339719 |
NA19434 | Luhya | GCA_002872155.1 | PRJNA385272 |
HG02059 | Kinh-Vietnamese | GCA_003070785.1 | PRJNA339726 |
HG03486 | Mende | GCA_003086635.1 | PRJNA438669 |
HG02818 | Gambian | GCA_003574075.1 | PRJNA339722 |
HG03807 | Bengali | GCA_003601015.1 | PRJNA490190 |