Home > Tools > GSC Software > PolyScan > polyScan-2.2 usage
PolyScan-2.2 Usage

1. Required flags:

-pd <project directory>
Specify the path of the project to be analyzed. PolyScan assumes that the project directory is organized in the following structure:
project_dir/edit_dir, is directory for analysis.
project_dir/chromat_dir, contains the chromatogram files in scf format
project_dir/phd_dir, contains the phd files created by Phred
project_dir/poly_dir, contains the poly files created by Phred
-ace <input ace file>
Specify the Consed ace file that contains the reference sequence, the alignment of reads to the reference and the links to the phd files. ace file must following the format described in Consed documentation http://bozeman.mbt.washington.edu/consed/distributions/README.15.0.txt
Please name your reference sequence *.c1 in the ace file so polyscan can recognize it.

Example usage:

polyscan -pd /human/EGFR -ace /human/EGFR/edit_dir/EGFR.c1.ace.1
Or equivalently if you are running polyscan in the /human/EGFR/edit_dir,
polyscan -pd .. -ace EGFR.c1.ace.1

2. Optional flags:

2.1 Flags that affects both indel and SNP identification

-o <output file>
Specify where the analysis result is stored. If omitted, a random file name will be assigned.
-contig <input contig name>
Choose a single contig to analyze. If omitted, all contigs in the ace file will be analyzed.
-pp <input polyphred file>
Analyze the consensus SNP sites listed in the polyphred file. This enables PolyScan to be used as a filtering (rescoring) tool for PolyPhred
-sd <input snpdetector file>
Analyze the consensus SNP sites listed in the SNPdetector file. This enables PolyScan to be used as a filtering (rescoring) tool for SNPdetector v2.0
Example usage:
polyscan -pd .. -ace EGFR.c1.ace.1 -pp day1.polyphred.out -sd day1. goodSNPgenotype-a5.csv 
This command forces polyscan to genotype consensus SNP sites listed in day1.polyphred.out and day1.goodSNPgenotype-a5.csv
-ofp <output fpoly directory>
Dump the fpoly files into a specified directory. "fpoly" files are similarly formatted as phred "poly" file. Each row represents a called base position and contains 24 columns representing:
1. the polyscan called primary base
2. its pixel position in the trace
3. the polyscan called secondary base
4. its pixel position in the trace
5-9. Channel A, peak position, amplitude, sharpness, regularity
10-14. Channel C, peak position, amplitude, sharpness, regularity
15-19. Channel G, peak position, amplitude, sharpness, regularity
20-24. Channel T, peak position, amplitude, sharpness, regularity

The definitions of sharpness and regularity can be found in our paper.
Example:
A 717  G 720  A 717 804.00 0.757839 0.990878  C 717 0.00 -0.000000 0.522925
G 720 66.00 0.242392 0.904314 T 717 4.00 0.370325 0.833227
-ifp <input fpoly directory>
Read the fpoly files from a specified directory. This gives time saving from recalculating the fpoly files and is especially useful when running polyscan multiple times on the same project
-refseq <refseq id>
Specify which reference sequence to use if there are multiple reference sequences in the assembly.
-pr <secondary/primary peak ratio>
range [0.0,1.0], default: 0.15
Specify a cutoff for the primary/secondary peak amplitude ratio. Peaks with amplitudes lower than this ratio will be forced to have a 0 allele quality therefore not called by polyscan. We found that reducing this ratio to 0.10 significantly increases the sensitivity in detecting low-level signals frequently seen in tumor-derived samples without significantly reducing specificity.
-aq <minimal allele quality>
range [0.0,1.0], default: 0.2
Eliminate low quality peaks (alleles) from analysis. Allele quality scores are calculated for peaks in each of the four fluorescence channels based on 7 features:
  • max/min peak spacing ratio in a 7 base window centered around the current
  • average sharpness in 7 bases
  • average regularity in 7 bases
  • ratio of distances between the current peak to the left primary peak and to the right primary peak
  • sharpness of the current peak
  • regularity of the current peak

Some features are similar to phred. Others are unique to PolyScan. An artificial neural network is trained using these 7 features and the corresponding phred quality scores using the phred called peaks. It is then used to estimate allele quality scores for all identified peaks in the four channels. The scale of the allele quality scores is normalized to between 0 and 1.0 with 1.0 corresponding to a phred quality score of 54.
-mpa <minimal peak amplitude>
range [0,100], default: 50
Force peaks that have amplitude smaller than the specified value to have a zero allele quality and are therefore excluded from analysis.
-v display version
-h display usage

2.2 Flags that affect indel detection

Indel Signature Detection:

When there is a frame-shift between the two alleles of a diploid DNA, one could observe a sequence of consecutively overlapping peaks in the sequencing trace. We call such sequences of overlapping peaks Indel Signatures.

Example:
A indel signature of a 5bp AAGTT deletion in both sequencing direction.

-im <percent identity match between secondary seq and refseq>
default: 0.85
Indel signatures are detected through multiple subsequence alignment with the reference sequence (see details in our paper). This option set a threshold that requires the subsequence to be similar to the reference with some mismatch allowed to account for the base-calling errors. Subsequence alignments with percent-identity-match smaller than the specified value are not regarded as a credible evidence for a putative indel signature.
-qs <length of the secondary seq for indel match>
default: 20
Adjust the length of the subsequences for the segmented multiple alignment.
-mis <minimum indel signature size>
default: 20
Specify the minimum size for a valid indel signature.


Indel Detection:

Detecting a heterozygous indel requires summarization of indel signatures from multiple traces (in both directions) and automatically computes a confidence score that reflects the strength of the existing evidences.
-indel
Turn on indel detection
-mms, <n bp>
discard 1bp het indels downstream of a <n bp> homopolymer, default: 8
The indel signatures detected downstream of the poly trucks of <n bp> are frequently sequencing artifacts. This option allows users to specify their favorite <n>.
-maxindelsize <maximum indel size to detect>
default: 100 bp
Only report indels with size smaller than the specified.
-indelsource <pos1> <pos2>
used in conjunction with the -indelgroup option, default: 1 25
Reads having identical substring between <pos1> and <pos2> in their read names are grouped together for indel identification. A clustering algorithm is used to automatically segregate this user-defined group into smaller subgroups, each containing identical frame-shifted patterns within a small genomic region (defined by -indelgroup). Those remaining in the same subgroup after clustering are jointly analyzed as a population. We found that grouping reads together effectively enhanced the accuracy of indel detection.
Example usage:
 For the following reads, where x and y represents sequencing direction:
H_sample1_amplicon1.x.gz
H_sample1_amplicon1.y.gz
H_sample1_amplicon2.x.gz
H_sample1_amplicon2.y.gz
H_sample2_amplicon1.x.gz
H_sample2_amplicon1.y.gz
H_sample2_amplicon2.x.gz
H_sample2_amplicon2.y.gz

-indelsource 1 8, group all reads into the same group
-indelsource 11 19, split reads into two groups: amplicon1 and amplicon2
-indelsource 1 9, split reads into two groups: H_sample1 and H_sample2
-indelsource 1 19, group paired reads
-indelsource 1 24, each read in its own group
-indelgroup <n bp>
used in conjunction with the -indelsource option, default: 50bp
Reads from the same indelsource and having indel signatures within <n bp> are jointly analyzed.
-indelscore <indel confidence score threshold>
range [0,100], default: 0
Report only indel with confidence score greater than the threshold.

2.3 Flags that affect SNP detection

-genotype 0
output the genotypes at all snp sites, this is default
-genotype -1 turns off the SNP detection.
-gtscore <genotype confidence score threshold>
range [0,100], default: 0
Report genotypes whose confidence scores are greater than the threshold.
-quality <minimum average phred quality for analysis>
range [10, 30], default: 20
Only genotyping positions where arithmetic average phred quality scores are greater than the threshold. Average is currently estimated from a 11 base pair window.
-hh <density of het in a 20bp window>
default: 0.3
Only detect SNPs in regions where the density of heterozygous genotypes is smaller than the threshold
-source <pos1> <pos2>
default: 1 25, group reads for snp detection
Similar to -indelsource but applied to SNP detection
-nr <noise reduction factor>
range [0.0,1.0], default: 0.0
Specify noise reduction factor. This option helps remove the background noise due to peak spilling and paralog amplification. For details please read the sections of horizontal and vertical scans in our paper.

All options that take input arguments have pre-selected default values based on our experience. The users are encouraged to find their favorite parameter settings in their environment. The default parameters normally lead to a reasonable but not necessarily optimal outcome.

Output

The output of PolyScan is in a concise and self-explanatory format. The <PARAMETER> section record the parameters used to this experiment. The indel and snps as reported in the <CONTIG> sections, which contain a section for <INDEL> and a section for <SNP>. Each indel is reported in a row of 7 columns: reference position, read position, read name, indel size, indel type, indel sequence, confidence score, and the average ratio of the heights of the secondary peaks versus the primary peaks. Each genotype is reported in a row of 5 or 6 columns: reference position, read position, read name, genotype, confidence score, and comments.

Example:
<BEGIN_PARAMETERS>
Global Parameters:
Project dir: ..
Fpoly dir: ../fpoly_dir/
Ace file: Gene1.c1.ace.begin
Program: polyscan 2.2
-pr, minimum secondary/primary peak ratio: 0.1500
Parameters for Indel analysis:
-maxindelsize, maximum indel size: 100
-indelgroupsize, indel group size: 100
-qs, indel subsequence size: 20
-im, minimum indelscan identity match: 0.85
-mis, minimum indelscan sigature size: 20
-indelsource, 8, 10
-indelscore, confidence score cutoff for het indel: 0
Parameters for SNP analysis:
-quality, minimum avg Phred qual: 25.00
-hh, maximum allowed SNP density: 0.30
-gtscore, score cutoff for genotyping: 0
SNP Filtering:
<END_PARAMETERS>

<BEGIN CONTIG>
Gene1.c1-Contig

<BEGIN_INDEL>
74338 225 H_BS-4000t_030a.g1 1 deletion C 43 0.66
74151 413 H_BS-4002t_030a.g1 4 deletion TGGC 15 0.38
74151 140 H_BS-4003t_066a.g1 4 insertion GTCG 23 0.34
74338 225 H_BS-4004t_030a.g1 1 deletion C 43 0.81
74144 146 H_BS-4004t_066a.g1 4 deletion AGAT 28 0.77
74106 78 H_BS-4005t_030a.b1 3 deletion TTA 5 0.30
74467 100 H_BS-4005t_030a.g1 1 deletion A 38 0.62
74340 224 H_BS-4006t_030a.g1 1 insertion C 58 0.47
<END_INDEL>

<BEGIN_SNP>
74060 74060 Gene1.c1 CC
74060 29 H_BS-3921t_030a.b1 CC 99
74060 31 H_BS-3928t_030a.b1 CC 98
74060 31 H_BS-3929t_030a.b1 CC 99
74060 31 H_BS-3932t_030a.b1 CC 84
74060 30 H_BS-3943t_030a.b1 CC 99
74060 31 H_BS-3945t_030a.b1 CC 99
74060 29 H_BS-3946t_030a.b1 CC 97
74060 28 H_BS-3947t_030a.b1 CC 98
74060 31 H_BS-3948t_030a.b1 CC 97
74060 32 H_BS-3953t_030a.b1 CG 99 heterozygous
74060 32 H_BS-3954t_030a.b1 CC 99
74060 31 H_BS-3958t_030a.b1 GG 98 homozygous_rare
74060 31 H_BS-3959t_030a.b1 CC 96
<END_SNP>
<END_CONTIG>