C. elegans Single Nucleotide Polymorphism Data

Details of the analysis of this data and methodology for mapping has been described in a recent publication: Stephen R. Wicks, Raymond T. Yeh, Warren R. Gish, Robert H. Waterston, Ronald H.A. Plasterk. Rapid gene mapping in Caenorhabditis elegans using a high density polymorphism map. Nature Genetics 28:160-164. Nature Genetics, PubMed

Accessing the Data

SNP data is orgranized by chromosomes. Each chromosome is divided into 20 segments. Each chromosome view consists of 3 regions:

  • Region 1: Clicking on each of the numbered segments will bring up the SNPs for that segment in the second frame. The tick marks to the right of the chromosome indicates the relative SNP density in that region. To the right of the chromosome are the positions along the chromosome in basepairs

  • Region 2: In the view, the clones are labelled to the right of the chromosome. Each SNP is identified by the clone and the position within that clone. Polymorphisms are labelled in the following format:

    • Substitutions: S=AG One strain has the 'A' allele and the other strain has the 'G' allele.

    • Insertions: I2=CT CB4856 has inserted bases relative to N2 (ie. N2 has deleted bases). In this case the length of the insertion is 2 bases and the inserted bases are CT.

    • Deletions: D3=AAA CB4856 has deleted bases relative to N2 (ie. N2 has inserted bases). In this case the length of the deletion is 3 bases and the deleted bases are AAA.

  • Region 3: Probability of the SNP, CB4856 read name, position in the read, and PHRED quality for the nucleotide in the read are given.

    • Restriction Digest info: First column is the strain for which the digest is performed. Second column is the recognition site. Third column has all the isoschizomers which recognizes with the patter in column two. Fourth column are the positions cut by the enzyme (relative to the sequence below)

    • Sequence around the SNP with the polymorphism marked in red. N2 allele is always displayed first within the brackets.

Release Notes

Update: October 4, 2001

The CB4856 reads have been re-analyzed using WormBase Release WS53. Here is a summary of the changes:

  • New SNPs are found in regions of the genome which were not finished at the time of the previous analysis.

  • Coordinates of a number of SNPs have changed due to changes in the length of cosmid sequences. Here is a list of the SNPs that have changed coordinates.

  • A few SNPs have now been discarded as false positives, mainly because of better repeat masking.

  • Verified SNPs have been marked in RED. Primer and restriction digestion information also included (if available).

  • The complete list of verified SNPs can also be obtained at Stephen Wicks’ page. Also contains useful tips for mapping.

Brief Description of the Analysis Pipeline

  • N2 cosmid sequences from WormBase release WS53 were masked of repeats using RepeatMasker with MaskerAid enhancement

  • Sequencing traces from CB4856 were called with PHRED.

  • CB4856 sequences were aligned to N2 cosmid sequences using WU-BLAST

  • To filter out any paralogous sequences the following procedure was used:

    • For each CB4856 read, all the hits to the N2 genome were collected and the top scoring hit was considered to be the corresponding N2 locus for the CB4856 sequence.

    • For CB4856 reads that aligned to two more more locations in N2 with the same BLAST score:

    • If the reads fell in a cosmid overlap region, one cosmid was chosen for subsequent analysis.

    • If not, we assume that recent duplication(s) has occurred in the N2 genome and one there is not enough sequence divergence to tell the different paralogs apart. The CB4856 read is NOT used for further analysis.

  • Collect all the reads that have been assigned to a given N2 locus

  • Using the N2 sequence as anchor, all the reads are multiply aligned using the POLYBAYES anchor alignment.

  • POLYBAYES SNP probability (Psnp) calculation for each column of the multiple alignement that contains a discrepancy.

  • N2 sequences are assumed to be of PHRED quality 40 (error rate 1 in 10,000).

  • CB4856 sequences have quality values from the PHRED base calls of the traces.

  • To calculate the probability for an insertion or deletion, the lower of the two scores of the bases surrounding the deleted sequence were used since there is no direct measurement for a quality of missing sequence.

  • Any column with a Psnp value of greater than 0.4 was flagged as a SNP. This corresponds to a PHRED quality of 26 for a CB4856 read in a pair-wise alignment to the N2 genome.

  • 500 bps flanking the SNP are extracted and sequence is digested to reveal any difference in restriction digest patterns due to the SNP (a “snip-SNP”). The list of enzymes used and their recognition sites can be found here.