The genome sequence of the free-living nematiode Caenorhabditis elegans is complete and represents the first genome of a multicellular organism to be sequenced in its entirety. The genome is approximately 100 Mb in size, and encodes over 25,000 proteins. The sequencing project was a collaborative effort between The Genome Institute in St. Louis and the Sanger Centre in Hinxton, England. Analysis of the genome sequence is ongoing, but access to the sequence has revolutionized C. elegans biology.
In the 1980's, construction of a clone-based physical map of the C. elegans genome was undertaken by Sulston, Coulson, Waterston and colleagues. The map was originally based on cosmid clones using a fingerprinting technique devised by Sulston and Coulson. Later, yeast artificial chromosomes (YACs) were used to bridge gaps in the comid map, and provided 20% of the genome not represented in the cosmid libraries. By 1990, the physical map consisted of less than 20 contigs. The excellent cooperation of the entire worm community led to the alignment of the physical map with the genetic map.
In 1989, with a nearly complete physical map available, the next obvious step was to sequence the entire C. elegans genome. In 1990, with the enthusiatic support of Dr. James Watson, the proposal to sequence the genome as a model system and part of the Human Genome Project was funded jointly by the NIH and MRC as a three-year pilot project with the objective to ramp up to a capacity of one megabase of finished sequence per year by the end of the grant period. The two sequencing centers chose initially to concentrate on chromosome III. Washington University proceeded to the left from the central region and the Sanger Institute proceeded to the right. By May, 1993, the two groups had produced 1 Mb of finished C. elegans genomic sequence. August, 1993, brought an increase to over 2 Mb, and by December, 1994, over 10 Mb had been completed. Although some of the success can be attributed to implementation of high throughput devices, semi-automated methods for DNA purification and sequencing, "finishing" problem solving, and software development, major components of the success were organization and planning on both sides of the Atlantic.
With additional funding beginning in 1994 to complete the genome sequence of C. elegans, the 50 Mb milestone was passed in August of 1996. Small gaps were filled with long range PCR or fosmid clones. YAC DNA purified by PFGE was sequenced to cover chromosomal areas not represented in cosmids. The final genome sequence of the worm is a composite from cosmids, fosmids, YACs and PCR products. Some tandem repeats in the larger YACs are of unknown size, and there will be no effort to resolve them further, except for population studies, since they are difficult to clone and likely to be variable.
Analysis has revealed almost 20,500 protein-coding genes, each with an average of five introns. Local clusters of genes appear to be more abundant on the chromosome arms. Exons comprise 27% of the genome. Approximately 42% of predicted protein products match those of organisms in other phyla, providing putative functional information. There are clues that there is a bias for conservation of "housekeeping genes." The non-coding RNAs include widely dispersed transfer RNA genes, tRNA- derived pseudogenes, spliceosomal RNA genes, and ribosomal RNA genes. Tandem repeats and inverted repeats are more frequent on the autosomes. There also are simple sequence repeats, and simple duplications ranging from a few hundred bases to 108 kb. The chromosomes have a GC content of 36% and have no localized centromeres. Gene density is generally high across the chromosomes with some differences between the centers of the autosomes, the autosome arms, and the X chromsome. For a more detailed analysis see the publication in the journal Science, Dec. 11, 1998 282:2012-2018.