Reference Genome Improvement

The Human Genome Project (HGP) produced the human reference genome assembly, a database of DNA sequence that represents an example of a full human genome. When researchers sequence human genomes, they compare, or “align,” their results to this reference. While this assembly is one of the most frequently utilized resources in biomedical research, de novo genome assembly remains a significant challenge despite increase in throughput and decrease of sequence cost over the past decade.

Alignment of human sequence reads to the reference assembly is a critical aspect of successful data analysis, and several published reports identify regions of the reference assembly that were previously impossible to analyze due to the limitations of the available sequencing technologies, complex genome architecture, missing sequences and various errors in the assembly or the underlying sequence data.

Specific aims

We plan to identify and resolve issues (misassemblies, sequence errors and gaps) within the current reference, GRCh38. We will add substantial allelic diversity to the reference to facilitate effective analysis of biomedically important regions across the genome. We will accomplish this by completely finishing (“platinum”) two genomes (CHM1 and CHM13) and performing targeted finishing (“gold”) in additional genomes. We define platinum genome as a contiguous, haplotype-resolved representation of the entire genome. Gold genome is defined as a high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions.

Gold Genome: A high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions.

Platinum Genome: A contiguous, haplotype-resolved representation of the entire genome.

We will engage the bioinformatics community to ensure that the next generation of aligners, variant callers, annotation pipelines and bioinformatics tools will be capable of interacting with a multi-allelic reference genome. We will facilitate more effective use of the reference for biomedical discovery by providing detailed tutorials of the required complex tool chains. Finally, through the development and deployment of community outreach and education programs, we will convey the importance of the reference as much more than a linear chromosomal assembly.

Assembly and analysis details

After long reads are generated from the PacBio, we assemble them using the Falcon algorithm followed by error correction using Quiver. The output of this step is a fasta file of unordered and unoriented contigs. We then align the BioNano genomic map generated from the same individual and clone end sequences (if available) to check for global misassemblies. We make breaks where possible based on these data, and output ordered and oriented contigs based on the map alignments. In addition, there is a file of unaligned contigs. We then use NCBI’s assembly-assembly alignment and chromosome contig generating software to further QC the assembly.

Once the assembly is in ordered and oriented chromosome contigs, we use the NCBI RefSeq gene annotation pipeline, and further annotate with RepeatMasker and Segmental Duplications. After annotation, we can then integrate other data such as Illumina alignments and variant calls, clone based resources and data from newer technologies such as Dovetail and GemCode to improve the assembly and assess its quality.

Source
Origin
Assembly Accession
Bioproject
NA19240YorubanGCA_001524155.2PRJNA288807
HG00514Han ChineseGCA_002180035.1PRJNA300843
NA12878EuropeanGCA_002077035.1PRJNA323611
HG00733Puerto RicanGCA_002208065.1PRJNA300840
HG01352ColombianGCA_002209525.1PRJNA339719
NA19434LuhyaGCA_002872155.1PRJNA385272
HG02059Kinh-VietnameseGCA_003070785.1PRJNA339726
HG03486MendeGCA_003086635.1PRJNA438669
HG02818GambianGCA_003574075.1PRJNA339722
HG03807BengaliGCA_003601015.1PRJNA490190

More information

Related people

Print Friendly, PDF & Email