Below is a list of selected posters from The Genome Institute presented at this year's meeting.
The Genome Modeling System: A Turnkey Genomics Analysis Platform
David J. Dooling, Scott Smith, Ben Oberkfell, Justin Lolofie, Matt Callaway, Nathan Nutter, Brian Derickson, Tom Mooney, Joshua McMichael, James Eldred, Jason Walker, David Larson, Nathan Dees, Chris Harris, Dan Koboldt, William Schierding, Chris Miller, Cyriac Kandoth, George M. Weinstock, Elaine R. Mardis, and Richard K. Wilson.
As the number and complexity of subjects in medical genomics projects continue to increase, managing sample and project information is becoming just as crucial as executing the analysis pipeline. We have developed an integrated analysis information management system, called the Genome Modeling System (GMS), for managing subject data, analysis execution, and results visualization for genomics research projects. Using web-based entry methods, investigators can enter and track individuals and their associated tissue samples, sequencing libraries, sequencing instrument data, and analysis progress and results. All data, including ad hoc user annotation, is indexed into a full-text search engine for easy look up and retrieval. The system also supports viewing the data in tabular format on the web and exporting to spreadsheet formats for sharing internally or with collaborators.
GMS is distributed as a virtual machine image based on Ubuntu Linux, allowing an investigator to immediately begin working with the tools with minimal system administration and bioinformatics expertise. The virtual machine image includes popular genomics software, much of it never officially packaged for the Ubuntu platform, including BWA, VarScan, SAMtools, Picard, Bio::DB::Sam, BreakDancer, and MuSiC. All software is packaged using the native Ubuntu package management system and is served from The Genome Institute’s package repository, allowing facile, efficient upgrades as new versions of tools and the framework are released. Documentation and installation instructions for GMS are available at http://gmt.genome.wustl.edu/.
Variant Validation, Extension, and Interpretation Methods at The Genome Institute at Washington University
Robert Fulton, Ryan Demeter, Vincent Magrini, Michael McLellan, Daniel Koboldt, Li Ding, Todd Wylie, Michelle O’Laughlin, Rachel Maupin, Elaine R. Mardis, and Richard K. Wilson.
With the ever-increasing throughput of next generation sequencing, variant validation is increasingly critical to understanding the mutational spectrum of the sequenced genomes. Validation provides confirmation of putative variant calls, thus helping to improve variant calling algorithms. In addition to confirmation of putative calls, the validation process provides a deeper understanding of variant frequency, and helps with the interpretation of the impact of the variation. For somatic mutations, variant frequencies provide clues to tumor purity, clonality, and help identify likely driver events, or variants critical to the progression or metastasis of this disease.
These methods not only provide validation, but also can be used to extend putative variants across other samples, to identify commonly mutated genes across sample panels. This presentation will outline validation/extension methods and decision processes utilized for large and small-scale variant confirmation.
Automated Profiling of Small RNA Molecules in Acute Myeloid Leukemia Using High-Throughput Next Generation Sequencing
Jasreet Hundal (1,3), Todd Wylie (1,3), Vincent Magrini (1), Jason Walker (1), Maria Trissal (2), Sean D. McGrath (1), Jessica Silva (1), Giridharan Ramsingh (2), Todd A. Fehniger (2), Daniel Link (2), Timothy J. Ley (1,2), Richard K. Wilson (1), and Elaine R. Mardis (1).
(1) The Genome Institute, (2) Department of Medicine, Division of Oncology, Washington University School of Medicine, St. Louis, MO 63108, USA, (3) These authors contributed equally to this work.
Small non-coding RNAs (sncRNAs)—e.g., miRNAs, snoRNAs, piRNAs—can have large-scale and diverse effects on cellular processes by regulating gene expression, protein translation, and genomic organization. There is accumulating evidence that alterations in expression of sncRNAs contribute to human disease. Next generation sequencing (NGS) provides a high-throughput platform for exploring sncRNA populations in samples derived from healthy and diseased individuals.
We have developed an in-house automated pipeline designed to profile and compare reads derived from NGS sncRNA libraries. Our pipeline focuses on three main areas: 1) identification/abundance of previously known sncRNAs; 2) discovery/abundance of putatively novel sncRNAs; 3) tracking of differential expression of sncRNAs between multiple library types, tissues, and/or states. The pipeline locates areas of contiguous alignment in the genome, forming ab initio "clusters" representing sncRNAs. Cluster candidates undergo adaptor trimming, quality filtering, annotation interrogation, coverage modeling, normalized expression calculation, and sncRNA species fractionation into bins based on associated read lengths.
As trial applications of our pipeline, we defined the microRNAomes in a patient with acute myeloid leukemia (AML) and also in Natural Killer (NK) cells of Mus musculus. Our current focus expands assessment of sncRNAs beyond, but inclusive of, miRNA lengths in healthy and neoplastic human tissues. As such, we characterized the small RNA transcriptome in leukemic blasts from 22 patients with AML and CD34+ bone marrow cells from 8 healthy individuals [updated]. RNA species ranging from 17-75 nts were identified. In both AML and normal CD34+ cells, snoRNAs were the most abundant sncRNAs identified, followed by miRNAs. However, a large fraction of sequence reads (30%) mapped to unannotated regions of the genome; size fractionation of these reads suggests most of the novel sncRNAs are not miRNAs. We further identified 16 significantly expressed differentially regulated miRNAs and 38 differentially regulated snoRNAs when comparing control CD34+ cells to AML samples [current as of 02/08/2012].
Integrative Genomic Analysis Methods for Large-Scale Cancer Sequencing Studies
Daniel C. Koboldt, Dong Shen, Mike McLellan, Li Ding, Elaine R. Mardis, Richard K. Wilson, and The Cancer Genome Atlas Network.
Identification of recurrent genetic events driving tumor development and progression is a key goal of cancer genomics. We have developed robust methods for the detection of somatic mutations, germline variants, copy number alterations, and loss of heterozygosity (LOH) events in WGS and exome data. Here, we apply our methods to 507 invasive breast carcinomas that we have characterized as part of The Cancer Genome Atlas (TCGA). We detected over 30,000 somatic coding mutations (~60 per tumor), as well as extensive LOH and copy number changes. Integrating mutation, copy number and clinical data revealed striking differences in the landscape of somatic alterations between the five major expression subtypes of breast cancer.
Combinatorial Data Sets: Pragmatic Applications Derived from Multiple Sequencing Technologies
Vincent Magrini, Jason Walker, Todd Wylie, Sean McGrath, Amy Ly, Jasreet Hundal, Ryan Demeter, Laura Gottschalk, Khaing Soe, Nathan Sander, Lisa Cook, Erica Sodergren, Wes Warren, George Weinstock, Richard K. Wilson, and Elaine R. Mardis.
To explore the benefits of combining next generation sequencing data sets, we constructed various libraries from the bacterium Enterococcus faecalis str. TX0309B and generated a suite of read types. In particular, we used Illumina-based paired-end, mate-pair, and overlapping reads, Ion Torrent and 454 FLX+ WGS reads, and Pacific Biosciences circular consensus reads (CCS) to support error correction of Pacific Biosciences Continuous Long Reads (CLR). In combination, CLRs provide long-range linking information into E. faecalis assembly generated from fragment and/or paired-end reads.