Home > Tools > GSC Software > PolyScan > polyScan-3.0 usage
PolyScan-3.0 Usage

Updates:

Polyscan-3.0 now supports homozygous indel detection. Heterozygous indel and SNP identification remains unchanged from polyscan-2.2.
Homozygous indels are found as gaps in the high quality region of the reads.  A set of criteria is implemented to reduce the false positive rates by filtering against phred quality, sequence identity in the flanking regions of gaps.  For 1bp indels, the score increases if there are multiple reads supporting it, and reduces if otherwise.  Our test indicates the polyscan-3.0 detects homozygous indels with around 80% specificity and 80% sensitity.

New Options:

-hiq <minimum avg phred quality score for homozygous indels >
Specify the cutoff for 1bp indels.  Cut-off for longer indels is 0.9 smaller/per bp because longer gap size are less likely to occur by chance.  default: 35
-him <fraction of flanking bases that match the reference sequence>
default: 0.95, which requires at least 95% of the flanking bases match the reference.
-nc <a linking table file>
Linking table can be used to specify the detailed information of reads and the complex relationship among reads.  With linking table, arbitrary reads names can be used.  This version supports a simple .csv or tab-delimited file with four columns:
  1. Read name, which can be arbitrary non-redundant strings
  2. Sample ID, a string describing the DNA origin of the reads which affects the statistical test in SNP identification.  This is a generalization of the "-source" option used to SNP detection
  3. IndelGroup ID, a string describes how reads should be initially grouped together for joint heterozgyous indel detection.  This is a generalization of the "-indelsource" option used from Het Indel detection
  4. Amplicon ID, a string denoting the amplicon that the current read was sequenced from
Example:

H_P00001PCR3029_029a.g1, P00001, 1, 029a
H_P00002PCR3029_032a.b1, P00002, 1, 032a
This linking table will group the two reads together for joint het indel analysis although they are sequenced from two different amplicons (029a vs 032a), and from two different samples (P00001 vs P00002).

Usage

1. Required flags:

-pd <project directory>
Specify the path of the project to be analyzed. PolyScan assumes that the project directory is organized in the following structure:
project_dir/edit_dir, is directory for analysis.
project_dir/chromat_dir, contains the chromatogram files in scf format
project_dir/phd_dir, contains the phd files created by Phred
project_dir/poly_dir, contains the poly files created by Phred
-ace <input ace file>
Specify the Consed ace file that contains the reference sequence, the alignment of reads to the reference and the links to the phd files. ace file must following the format described in Consed documentation http://bozeman.mbt.washington.edu/consed/distributions/README.15.0.txt
Please name your reference sequence *.c1 in the ace file so polyscan can recognize it.
-refseq <refseq id>
Specify which read in the ace file represents the reference sequence, usually obtained from sequence databases such as NCBI GenBank.  The reading-in of refseq-associated scf, phd, poly files are automatically skipped.

Example usage:

polyscan -pd /human/EGFR -ace /human/EGFR/edit_dir/EGFR.c1.ace.1 -refseq EGFR.c1
Or equivalently if you are running polyscan in the /human/EGFR/edit_dir,
polyscan -pd .. -ace EGFR.c1.ace.1 -refseq EGFR.c1

2. Optional flags:

2.1 Flags that affects both indel and SNP identification

-o <output file>
Specify where the analysis result is stored. If omitted, a random file name will be assigned.
-contig <input contig name>
Choose a single contig to analyze. If omitted, all contigs in the ace file will be analyzed.
-pp <input polyphred file>
Analyze the consensus SNP sites listed in the polyphred file. This enables PolyScan to be used as a filtering (rescoring) tool for PolyPhred
-sd <input snpdetector file>
Analyze the consensus SNP sites listed in the SNPdetector file. This enables PolyScan to be used as a filtering (rescoring) tool for SNPdetector v2.0
Example usage:
polyscan -pd .. -ace EGFR.c1.ace.1 -refseq EGFR.c1 -pp day1.polyphred.out -sd day1.goodSNPgenotype-a5.csv 
This command forces polyscan to genotype consensus SNP sites listed in day1.polyphred.out and day1.goodSNPgenotype-a5.csv
-ofp <output fpoly directory>
Dump the fpoly files into a specified directory. "fpoly" files are similarly formatted as phred "poly" file. Each row represents a called base position and contains 24 columns representing:
1. the polyscan called primary base
2. its pixel position in the trace
3. the polyscan called secondary base
4. its pixel position in the trace
5-9. Channel A, peak position, amplitude, sharpness, regularity
10-14. Channel C, peak position, amplitude, sharpness, regularity
15-19. Channel G, peak position, amplitude, sharpness, regularity
20-24. Channel T, peak position, amplitude, sharpness, regularity

The definitions of sharpness and regularity can be found in our paper.
Example:
A 717  G 720  A 717 804.00 0.757839 0.990878  C 717 0.00 -0.000000 0.522925
G 720 66.00 0.242392 0.904314 T 717 4.00 0.370325 0.833227
-ifp <input fpoly directory>
Read the fpoly files from a specified directory. This gives time saving from recalculating the fpoly files and is especially useful when running polyscan multiple times on the same project
-pr <secondary/primary peak ratio>
range [0.0,1.0], default: 0.15
Specify a cutoff for the primary/secondary peak amplitude ratio. Peaks with amplitudes lower than this ratio will be forced to have a 0 allele quality therefore not called by polyscan. We found that reducing this ratio to 0.10 significantly increases the sensitivity in detecting low-level signals frequently seen in tumor-derived samples without significantly reducing specificity.
-aq <minimal allele quality>
range [0.0,1.0], default: 0.2
Eliminate low quality peaks (alleles) from analysis. Allele quality scores are calculated for peaks in each of the four fluorescence channels based on 7 features:
  • max/min peak spacing ratio in a 7 base window centered around the current
  • average sharpness in 7 bases
  • average regularity in 7 bases
  • ratio of distances between the current peak to the left primary peak and to the right primary peak
  • sharpness of the current peak
  • regularity of the current peak

Some features are similar to phred. Others are unique to PolyScan. An artificial neural network is trained using these 7 features and the corresponding phred quality scores using the phred called peaks. It is then used to estimate allele quality scores for all identified peaks in the four channels. The scale of the allele quality scores is normalized to between 0 and 1.0 with 1.0 corresponding to a phred quality score of 54.
-mpa <minimal peak amplitude>
range [0,100], default: 50
Force peaks that have amplitude smaller than the specified value to have a zero allele quality and are therefore excluded from analysis.
-v display version
-h display usage

2.2 Flags that affect indel detection

Indel Signature Detection:

When there is a frame-shift between the two alleles of a diploid DNA, one could observe a sequence of consecutively overlapping peaks in the sequencing trace. We call such sequences of overlapping peaks Indel Signatures.

Example:
A indel signature of a 5bp AAGTT deletion in both sequencing direction.

-im <percent identity match between secondary seq and refseq>
default: 0.85
Indel signatures are detected through multiple subsequence alignment with the reference sequence (see details in our paper). This option set a threshold that requires the subsequence to be similar to the reference with some mismatch allowed to account for the base-calling errors. Subsequence alignments with percent-identity-match smaller than the specified value are not regarded as a credible evidence for a putative indel signature.
-qs <length of the secondary seq for indel match>
default: 20
Adjust the length of the subsequences for the segmented multiple alignment.
-mis <minimum indel signature size>
default: 20
Specify the minimum size for a valid indel signature.


Indel Detection:

Detecting a heterozygous indel requires summarization of indel signatures from multiple traces (in both directions) and automatically computes a confidence score that reflects the strength of the existing evidences.
-indel
Turn on indel detection
-mms, <n bp>
discard 1bp het indels downstream of a <n bp> homopolymer, default: 8
The indel signatures detected downstream of the poly trucks of <n bp> are frequently sequencing artifacts. This option allows users to specify their favorite <n>.
-maxindelsize <maximum indel size to detect>
default: 100 bp
Only report indels with size smaller than the specified.
-indelsource <pos1> <pos2>
used in conjunction with the -indelgroup option, default: 1 25
Reads having identical substring between <pos1> and <pos2> in their read names are grouped together for indel identification. A clustering algorithm is used to automatically segregate this user-defined group into smaller subgroups, each containing identical frame-shifted patterns within a small genomic region (defined by -indelgroup). Those remaining in the same subgroup after clustering are jointly analyzed as a population. We found that grouping reads together effectively enhanced the accuracy of indel detection.
Example usage:
 For the following reads, where x and y represents sequencing direction:
H_sample1_amplicon1.x.gz
H_sample1_amplicon1.y.gz
H_sample1_amplicon2.x.gz
H_sample1_amplicon2.y.gz
H_sample2_amplicon1.x.gz
H_sample2_amplicon1.y.gz
H_sample2_amplicon2.x.gz
H_sample2_amplicon2.y.gz

-indelsource 1 8, group all reads into the same group
-indelsource 11 19, split reads into two groups: amplicon1 and amplicon2
-indelsource 1 9, split reads into two groups: H_sample1 and H_sample2
-indelsource 1 19, group paired reads
-indelsource 1 24, each read in its own group
-indelgroup <n bp>
used in conjunction with the -indelsource option, default: 50bp
Reads from the same indelsource and having indel signatures within <n bp> are jointly analyzed.
-indelscore <indel confidence score threshold>
range [0,100], default: 0
Report only indel with confidence score greater than the threshold.

2.3 Flags that affect SNP detection

-genotype 0
output the genotypes at all snp sites, this is default
-genotype -1 turns off the SNP detection.
-gtscore <genotype confidence score threshold>
range [0,100], default: 0
Report genotypes whose confidence scores are greater than the threshold.
-quality <minimum average phred quality for analysis>
range [10, 30], default: 20
Only genotyping positions where arithmetic average phred quality scores are greater than the threshold. Average is currently estimated from a 11 base pair window.
-hh <density of het in a 20bp window>
default: 0.3
Only detect SNPs in regions where the density of heterozygous genotypes is smaller than the threshold
-source <pos1> <pos2>
default: 1 25, group reads for snp detection
Similar to -indelsource but applied to SNP detection
-nr <noise reduction factor>
range [0.0,1.0], default: 0.0
Specify noise reduction factor. This option helps remove the background noise due to peak spilling and paralog amplification. For details please read the sections of horizontal and vertical scans in our paper.

All options that take input arguments have pre-selected default values based on our experience. The users are encouraged to find their favorite parameter settings in their environment. The default parameters normally lead to a reasonable but not necessarily optimal outcome. 

Output

The output of PolyScan is in a concise and self-explanatory format. The <PARAMETER> section record the parameters used to this experiment. The indel and snps as reported in the <CONTIG> sections, which contain a section for <INDEL> and a section for <SNP>. Each heterozygous indel is reported in a row of 7 columns: reference position, read position, read name, indel size, indel type, indel sequence, confidence score, and the average ratio of the heights of the secondary peaks versus the primary peaks. Homozygous indel is reported in similar format and the last column represents an auxiliary score which can be used for further assess the confidence. Each genotype is reported in a row of 5 or 6 columns: reference position, read position, read name, genotype, confidence score, and comments.

Example:
<BEGIN_PARAMETERS>
Global Parameters:
Project dir: ..
Fpoly dir: ../fpoly_dir
Ace file: Gene.c1.ace.begin
Refseq: Gene.c1
Program: polyscan 3.0
-pr, minimum secondary/primary peak ratio: 0.1500
-aq, minimum allele probability: 0.2000
-mpa, minimum peak amplitude: 100
Parameters for Indel analysis:
-hiq, homo indel phred quality cutoff: 35
-him, homo indel local identity match ratio: 0.95
-maxindelsize, maximum indel size: 100
-mis, minimum indel signature size: 20
-qs, indel subsequence size: 20
-im, minimum subsequence identity match: 0.85
-mms, discard 1bp het indels downstream of 8 bp homopolymer
-indelscore, confidence score cutoff for het indel: 0
-indelgroupsize, indel group size: 30
Parameters for SNP analysis:
-quality, minimum avg Phred qual: 25.00
-hh, maximum allowed SNP density: 0.30
-gtscore, score cutoff for genotyping: 0
SNP Filtering:
<END_PARAMETERS>

<BEGIN CONTIG>
Gene1.c1-Contig

<BEGIN_INDEL>
74148 150 H_BS-3970tPCR0004869_066a.g1 4 hetDel TCTG 72 0.38
74148 146 H_BS-3975tPCR0004869_066a.g1 4 hetDel TCTG 76 0.52
74151 413 H_BS-4002tPCR0004869_030a.g1 4 hetDel GGCA 57 0.38
74151 140 H_BS-4003tPCR0004869_066a.g1 4 hetDel GGCA 53 0.36
74152 139 H_BS-3948tPCR0004869_066a.g1 3 homoIns CTG 43 0.71
74152 141 H_BS-3958tPCR0004869_066a.g1 4 homoIns TCTG 54 0.72
<END_INDEL>

<BEGIN_SNP>
74060 74060 Gene1.c1 CC
74060 29 H_BS-3921t_030a.b1 CC 99
74060 31 H_BS-3928t_030a.b1 CC 98
74060 31 H_BS-3929t_030a.b1 CC 99
74060 31 H_BS-3932t_030a.b1 CC 84
74060 30 H_BS-3943t_030a.b1 CC 99
74060 31 H_BS-3945t_030a.b1 CC 99
74060 29 H_BS-3946t_030a.b1 CC 97
74060 28 H_BS-3947t_030a.b1 CC 98
74060 31 H_BS-3948t_030a.b1 CC 97
74060 32 H_BS-3953t_030a.b1 CG 99 heterozygous
74060 32 H_BS-3954t_030a.b1 CC 99
74060 31 H_BS-3958t_030a.b1 GG 98 homozygous_rare
74060 31 H_BS-3959t_030a.b1 CC 96
<END_SNP>
<END_CONTIG>