This file describes the contents of an assembly provided by the Genome Sequencing Center at Washington University School of Medicine in St. Louis. This file describes all possible files and directories in an assembly, not all assemblies will have all the files. If you have any questions about the assembly itself, please see the ASSEMBLY file in the top-level assembly directory. That file contains information specific to the assembly. For answers to your questions regarding this assembly or project, or any other GSC genome project, please visit our Genome Groups web page (http://genome.wustl.edu/genome_group_index.cgi), locate the genome of interest, and email the designated contact person. Usage Information ----------------- The data in the assembly are made freely available before scientific publication with the following understanding: * The data may be freely downloaded, used in analysis, and repackaged in databases. * Users are free to use the data in scientific papers analyzing particular genes and regions if the providers of these data are properly acknowledged. See for credit information. * The centers producing the data reserve the right to publish the initial large-scale analysis of the data set, including large-scale identification of regions of evolutionary conservation and large-scale genomic assembly. Large-scale refers to regions with size on the order of a chromosome. * This is in accordance with, and with the understandings in the Fort Lauderdale meeting discussing Community Resource Projects (see ) and the resulting NHGRI policy statement . * Any redistribution of the data should carry this notice Data Description ---------------- Each assembly directory may contain the following files and directories: ASSEMBLY describes the assembly in detail README describes the files and directories in the assembly directory ace consed file directory db annotation database directory genbank directory of files submitted to GenBank input input FASTA sequence and quality file directory output output FASTA sequence and quality file directory The contents of each of the above directories are described below. Most files in an assembly are compressed using GZIP. You can download programs that decompress the files from the GZIP web page . 1. output The output directory should be present in all assemblies. It contains the actual assembly output data, namely FASTA quality and sequence files, AGP files, and read files. Below is a description of each of the files found in the output directory. 1.1 reads.placed.gz The reads.placed file provides the locations of reads which were placed in the assembly. Each line of the files contains the following information about a read. Column Description ------ ---------------------------------------------------------- 1 NCBI ti number for read (or *, if none known) 2 read name 3 left trimmed position on the original read 4 number of bases in trimmed read 5 orientation on contig (0 = forward, 1 = reverse) 6 contig name 7 supercontig name 8 approximate start position of the trimmed read in the contig 9 approximate start position of the trimmed read on supercontig 1.2 reads.unplaced.gz The reads.unplaced file provides the names of reads which were not placed in the assembly, and a short explanation. 1.3 supercontigs.gz The supercontigs.gz file contains information on how supercontigs are constructed from the contigs. The format of the file is below. supercontig supercontig_name contig contig_name gap gap_length gap_length_deviation * number_of_read_pairs contig contig_name gap gap_length gap_length_deviation * number_of_read_pairs contig contig_name ... supercontig supercontig_name contig contig_name gap gap_length gap_length_deviation * number_of_read_pairs contig contig_name ... The contigs before and after the gap are linked by read pairs. From the linking read pairs, the average and standard deviation of the gap lengths are estimated 1.4 FASTA Sequence and Quality The following FASTA sequence and quality files are provided. reads.unplaced.fa.gz contigs.fa.gz contigs.fa.qual.gz supercontigs.fa.gz ultracontigs.fa.gz chromosomes.fa.gz consensus.fa.gz 1.5 AGP The following AGP files are provided. supercontigs.agp.gz ultracontigs.agp.gz chromosomes.agp.gz AGP files that describe the scaffolding of the contigs. See for details on the file format. 1.6 removed_data.fa.gz This optional file contains contigs removed from the assembly prior to release, either due to the contig originating from a suspected contaminant or as a result of removing very small contigs from the assembly. 1.7 contig_vector_coordinates.gz This optional file lists the contig coordinates of sequence likely to be unclipped vector sequence. 2. ace The ace directory contains ace assembly file formats for viewing in the Consed program. These files are usually only provided for assemblies of smaller genomes. For viewing use the -nophd option. See . The following ace files are provided. scaffold.ace.gz tagged scaffolds singleton.ace.gz singleton (unplaced) reads 3. input This optional directory contains the assembly input FASTA sequence and quality files. This data is usually not provided for large genomes, since for large genomes the input files are large and the data is already available at GenBank. 4. db This directory is optional and contains annotation databases.