This file describes all possible files, annotations and directories provided by The Genome Institute at Washington University School of Medicine in St. Louis. Not all annotations will have all the files. If you have any questions about the annotation itself, please see the ANNOTATION file in the top-level annotation directory. That file contains information specific to the annotation. This README also contains multiple READMEs associated with annotation over many years. Please refer to the Data description release date associated with the files you are viewing for the correct information. For answers to your questions regarding this annotation or project, or any other genome project, please visit our Genome Groups web page (http://genome.wustl.edu/genomes/), locate the genome of interest, and email the designated contact person. Usage information The data in the annotation are made freely available before scientific publication with the following understanding: * The data may be freely downloaded, used in analysis, and repackaged in databases. * Users are free to use the data in scientific papers analyzing particular genes and regions if the providers of these data are properly acknowledged. See * The centers producing the data reserve the right to publish the initial large-scale analysis of the data set, including large-scale identification of regions of evolutionary conservation and large-scale genomic assembly. Large-scale refers to regions with size on the order of a chromosome. * This is in accordance with, and with the understandings in the Fort Lauderdale meeting discussing Community Resource Projects and the resulting NHGRI policy statement . * Any redistribution of the data should carry this notice Data Description for genomes processed before 02-26-10 ------------------------------------------------------ Each annotation directory may contain the following files and directories: ANNOTATION describes the annotation in detail README describes the files and directories in the annotation directory dna.gz contains DNA sequences of all the protein coding genes pep.gz contains protein translations of all the coding genes rna.gz contains gene sequences of all the non-coding RNA genes. Some of the RNA genes are named as misc_feature genes. See the criteria for naming genes below sctg.xls.gz contains gene data in super contig co-ordinates ctg.xls.gz contains gene data in contig co-ordinates blastp.tar.gz contains BLASTP sequence alignment results of each protein sequence against the NR database BLASTP is not run separately any more in our annotation pipeline because BLASTP is run in BER to pick the product name after Feb 26th, 2010. BER_product_name is added in ctg.xls file interpro.raw.gz contains the results of running Interpro on the protein sequences which contain information about the possible function of the gene kegg_report.ks.gz contains the results of running KEGG on the protein sequences. Has information about every possible pathway information for the gene. Most files in an annotation are compressed using GZIP. You can download programs that decompress the files from the GZIP web page . Criteria for naming genes: There are 2 types of gene names 1) With locus_tag: These genes have a locus_tag qualifier that matches the genbank entry. All coding genes and most of the non-coding RNA genes fall under this category. 2) Without a locus_tag: These genes do not get a locus_tag qualifier and are named as misc_feature genes. A given RNA gene is a misc_feature if it is a riboswitch, leader, alpha operon ribosome binding site, RNase E 5' UTR element, DNAX ribosomal frameshift element, group I catalytic intron or group II catalytic intron. The contents of the above excel files are described below sctg.xls.gz Each line of this file contains the following information about the gene Column Description ------- ------------------------------------------------------------------ 1 Name of the gene prediction in WU format 2 Gene name in genbank format 3 Start of the gene with respect to super contig co-ordinates 4 End of the gene with respect to super contig co-ordinates 5 Strand (+/-) 6 Cellular localization of the gene using psort-b 7 Best hit to COG database 8 Best hit to KEGG database 9 Best blast hit against bacterial NR 10 HMMPfam hits 11 Gene name based on the below "Gene naming" criteria ctg.xls.gz Each line of this file contains the following information about the gene Column Description ------- ------------------------------------------------------------------ 1 Name of the gene prediction in WU format 2 Gene name in genbank format 3 Start of the gene with respect to contig co-ordinates 4 End of the gene with respect to contig co-ordinates 5 Strand (+/-) 6 Cellular localization of the gene using psort-b 7 Best hit to COG database 8 Best hit to KEGG database 9 Best blast hit against bacterial NR 10 HMMPfam hits 11 Gene name based on the below "Gene naming" criteria Gene naming: There are currently 5 types of gene names that fall into 3 categories 1. NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein Assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are: * At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect <= 1e-10), with * >=80% identity and >= 80% coverage of both the query and subject sequence. The name will follow one of these three formats: * conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise * NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise: * hypothetical protein similar to NAME Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity x the amount of overlap to the target gene. In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra whitespace, etc. 2. Hypothetical protein Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are: * BLASTP hit to NR (complexity filtering off, -F F, expect <= 1e-10) 3. Predicted protein Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set. Data Description for genomes processed after 02-26-10 ----------------------------------------------------- Each annotation directory may contain the following files and directories: ANNOTATION describes the annotation in detail README describes the files and directories in the annotation directory dna.gz contains DNA sequences of all the protein coding genes pep.gz contains protein translations of all the coding genes rna.gz contains gene sequences of all the non-coding RNA genes. Some of the RNA genes are named as misc_feature genes. See the criteria for naming genes below ctg.xls.gz contains gene data in contig co-ordinates interpro.raw.gz contains the results of running Interpro on the protein sequences which contain information about the possible function of the gene kegg_report.ks.gz contains the results of running KEGG on the protein sequences. Has information about every possible pathway information for the gene Most files in an annotation are compressed using GZIP. You can download programs that decompress the files from the GZIP web page . Criteria for naming genes: There are 2 types of gene names 1) With locus_tag: These genes have a locus_tag qualifier that matches the genbank entry. All coding genes and most of the non-coding RNA genes fall under this category 2) Without a locus_tag: These genes do not get a locus_tag qualifier and are named as misc_feature genes. A given RNA gene is a misc_feature if it is a riboswitch, leader, alpha operon ribosome binding site, RNase E 5' UTR element, DNAX ribosomal frameshift element, group I catalytic intron or group II catalytic intron The contents of the above excel files are described below ctg.xls.gz Each line of this file contains the following information about the gene Column Description ------- ------------------------------------------------------------------ 1 Name of the gene prediction in WU format 2 Gene name in genbank format 3 Start of the gene with respect to contig co-ordinates 4 End of the gene with respect to contig co-ordinates 5 Strand (+/-) 6 Cellular localization of the gene using psort-b 7 Best hit to KEGG database 8 HMMPfam hits 9 Gene name based on JCVI BER product name pipeline (http://sourceforge.net/projects/ber/files/ber/praze-2005-08-09/)