|
|
Home > Tools > GSC Software > PolyScan > polyScan-2.0 usage
|
|
|
PolyScan-2.0 Usage
|
1. Required flags:
- -pd <project directory>
- Specify the path of the project to be analyzed. PolyScan
assumes that the project directory is organized in the following
structure:
project_dir/edit_dir, is directory for analysis.
project_dir/chromat_dir, contains the chromatogram files in scf format
project_dir/phd_dir, contains the phd files created by Phred
project_dir/poly_dir, contains the poly files created by Phred
- -ace <input ace file>
- Specify the Consed ace file that contains the reference
sequence, the alignment of reads to the reference and the links to the
phd files. ace file must following the format described in Consed
documentation http://bozeman.mbt.washington.edu/consed/distributions/README.15.0.txt
Please name your reference sequence *.c1 in the ace file so
polyscan can recognize it.
Example usage:
polyscan -pd /human/EGFR -ace /human/EGFR/edit_dir/EGFR.c1.ace.1
Or equivalently if you are running polyscan in the
/human/EGFR/edit_dir,
polyscan -pd .. -ace EGFR.c1.ace.1
2. Optional flags:
2.1 Flags that affects both indel and SNP identification
- -o <output file>
- Specify where the analysis result is stored. If omitted, a
random file
name will be assigned.
- -contig <input contig name>
- Choose a single contig to analyze. If omitted, all contigs
in the ace
file will be analyzed.
- -pp <input polyphred file>
- Analyze the consensus SNP sites listed in the polyphred
file. This
enables PolyScan to be used as a filtering (rescoring) tool for
PolyPhred
- -sd <input snpdetector file>
- Analyze the consensus SNP sites listed in the SNPdetector
file. This
enables PolyScan to be used as a filtering (rescoring) tool for
SNPdetector v2.0
Example usage:
polyscan -pd .. -ace EGFR.c1.ace.1 -pp day1.polyphred.out -sd day1. goodSNPgenotype-a5.csv
This command forces polyscan to genotype consensus SNP sites listed in
day1.polyphred.out and day1.goodSNPgenotype-a5.csv
- -ofp <output fpoly directory>
- Dump the fpoly files into a specified directory. "fpoly"
files are
similarly formatted as phred "poly" file.
Each row represents a called base position and contains 24 columns
representing:
1. the polyscan called primary base
2. its pixel position in the trace
3. the polyscan called secondary base
4. its pixel position in the trace
5-9. Channel A, peak position, amplitude, sharpness, regularity
10-14. Channel C, peak position, amplitude, sharpness, regularity
15-19. Channel G, peak position, amplitude, sharpness, regularity
20-24. Channel T, peak position, amplitude, sharpness, regularity
The definitions of sharpness and regularity can be found in our paper.
Example:
A 717 G 720 A 717 804.00 0.757839 0.990878 C 717 0.00 -0.000000 0.522925 G 720 66.00 0.242392 0.904314 T 717 4.00 0.370325 0.833227
- -ifp <input fpoly directory>
- Read the fpoly files from a specified directory. This gives
time saving
from recalculating the fpoly files and is especially useful when
running polyscan multiple times on the same project
- -refseq <refseq id>
- Specify which reference sequence to use if there are
multiple reference
sequences in the assembly.
- -pr <secondary/primary peak ratio>
- range [0.0,1.0], default: 0.15
Specify a cutoff for the primary/secondary peak amplitude ratio. Peaks
with amplitudes lower than this ratio will be forced to have a 0 allele
quality therefore not called by polyscan. We found that reducing this
ratio to 0.10 significantly increases the sensitivity in detecting
low-level signals frequently seen in tumor-derived samples without
significantly reducing specificity.
- -aq <minimal allele quality>
- range [0.0,1.0], default: 0.2
Eliminate low quality peaks (alleles) from analysis. Allele quality
scores are calculated for peaks in each of the four fluorescence
channels based on 7 features:
- max/min peak spacing ratio in a 7 base window centered
around the
current
- average sharpness in 7 bases
- average regularity in 7 bases
- ratio of distances between the current peak to the left
primary peak
and to the right primary peak
- sharpness of the current peak
- regularity of the current peak
Some features are similar to phred. Others are unique to PolyScan. An
artificial neural network is trained using these 7 features and the
corresponding phred quality scores using the phred called peaks. It is
then used to estimate allele quality scores for all identified peaks in
the four channels. The scale of the allele quality scores is normalized
to between 0 and 1.0 with 1.0 corresponding to a phred quality score of
54.
- -mpa <minimal peak amplitude>
- range [0,100], default: 50
Force peaks that have amplitude smaller than the specified value to
have a zero allele quality and are therefore excluded from analysis.
- -v display version
- -h display usage
2.2 Flags that affect indel detection
Indel Signature Detection:
When there is a frame-shift between the two alleles of a diploid DNA,
one could observe a sequence of consecutively overlapping peaks in the
sequencing trace. We call such sequences of overlapping peaks Indel
Signatures.
Example:
A indel signature of a 5bp AAGTT deletion in both sequencing direction.
- -im <percent identity match between secondary seq and
refseq>
- default: 0.85
Indel signatures are detected through multiple subsequence alignment
with the reference sequence (see details in our paper). This option set
a threshold that requires the subsequence to be similar to the
reference with some mismatch allowed to account for the base-calling
errors. Subsequence alignments with percent-identity-match smaller than
the specified value are not regarded as a credible evidence for a
putative indel signature.
- -qs <length of the secondary seq for indel match>
- default: 20
Adjust the length of the subsequences for the segmented multiple
alignment.
- -mis <minimum indel signature size>
- default: 20
Specify the minimum size for a valid indel signature.
Indel Detection:
Detecting a heterozygous indel requires summarization of indel
signatures from multiple traces (in both directions) and automatically
computes a confidence score that reflects the strength of the existing
evidences.
- -indel
- Turn on indel detection
- -mms, <n bp>
- discard 1bp het indels downstream of a <n bp>
homopolymer, default: 8
The indel signatures detected downstream of the poly trucks of <n
bp> are frequently sequencing artifacts. This option allows users to
specify their favorite <n>.
- -maxindelsize <maximum indel size to detect>
- default: 100 bp
Only report indels with size smaller than the specified.
- -indelsource <pos1> <pos2>
- used in conjunction with the -indelgroup option,
default: 1 25
Reads having identical substring between <pos1> and <pos2>
in their read names are grouped together for indel identification. A
clustering algorithm is used to automatically segregate this
user-defined group into smaller subgroups, each containing identical
frame-shifted patterns within a small genomic region (defined by
-indelgroup). Those remaining in the same subgroup after clustering are
jointly analyzed as a population. We found that grouping reads together
effectively enhanced the accuracy of indel detection.
Example usage:
For the following reads, where x and y represents sequencing direction: H_sample1_amplicon1.x.gz H_sample1_amplicon1.y.gz H_sample1_amplicon2.x.gz H_sample1_amplicon2.y.gz H_sample2_amplicon1.x.gz H_sample2_amplicon1.y.gz H_sample2_amplicon2.x.gz H_sample2_amplicon2.y.gz
-indelsource 1 8, group all reads into the same group -indelsource 11 19, split reads into two groups: amplicon1 and amplicon2 -indelsource 1 9, split reads into two groups: H_sample1 and H_sample2 -indelsource 1 19, group paired reads -indelsource 1 24, each read in its own group
- -indelgroup <n bp>
- used in conjunction with the -indelsource option,
default: 50bp
Reads from the same indelsource and having indel signatures within
<n bp> are jointly analyzed.
- -indelscore <indel confidence score threshold>
- range [0,100], default: 0
Report only indel with confidence score greater than the threshold.
2.3 Flags that affect SNP detection
- -genotype 0
- output the genotypes at all snp sites, this is default
-genotype -1 turns off the SNP detection.
- -gtscore <genotype confidence score threshold>
- range [0,100], default: 0
Report genotypes whose confidence scores are greater than the
threshold.
- -quality <minimum average phred quality for
analysis>
- range [10, 30], default: 20
Only genotyping positions where arithmetic average phred quality scores
are greater than the threshold. Average is currently estimated from a
11 base pair window.
- -hh <density of het in a 20bp window>
- default: 0.3
Only detect SNPs in regions where the density of heterozygous genotypes
is smaller than the threshold
- -source <pos1> <pos2>
- default: 1 25, group reads for snp detection
Similar to -indelsource but applied to SNP detection
- -nr <noise reduction factor>
- range [0.0,1.0], default: 0.0
Specify noise reduction factor. This option helps remove the background
noise due to peak spilling and paralog amplification. For details
please read the sections of horizontal and vertical scans in our paper.
All options that take input arguments have pre-selected default values
based on our experience. The users are encouraged to find their
favorite parameter settings in their environment. The default
parameters normally lead to a reasonable but not necessarily optimal
outcome.
|
Output
The output of PolyScan is in a concise and self-explanatory format. The
<PARAMETER> section record the parameters used to this
experiment. The indel and snps as reported in the <CONTIG>
sections, which contain a section for <INDEL> and a section for
<SNP>. Each indel is reported in a row of 7 columns: reference
position, read position, read name, indel size, indel type, indel
sequence, confidence score, and the average ratio of the heights of the
secondary peaks versus the primary peaks. Each genotype is reported in
a row of 5 or 6 columns: reference position, read position, read name,
genotype, confidence score, and comments.
Example:
<BEGIN_PARAMETERS> Global Parameters: Project dir: .. Fpoly dir: ../fpoly_dir/ Ace file: Gene1.c1.ace.begin Program: polyscan 2.0 -pr, minimum secondary/primary peak ratio: 0.1500 Parameters for Indel analysis: -maxindelsize, maximum indel size: 100 -indelgroupsize, indel group size: 100 -qs, indel subsequence size: 20 -im, minimum indelscan identity match: 0.85 -mis, minimum indelscan sigature size: 20 -indelsource, 8, 10 -indelscore, confidence score cutoff for het indel: 0 Parameters for SNP analysis: -quality, minimum avg Phred qual: 25.00 -hh, maximum allowed SNP density: 0.30 -gtscore, score cutoff for genotyping: 0 SNP Filtering: <END_PARAMETERS>
<BEGIN CONTIG> Gene1.c1-Contig
<BEGIN_INDEL> 74338 225 H_BS-4000t_030a.g1 1 deletion C 43 0.66 74151 413 H_BS-4002t_030a.g1 4 deletion TGGC 15 0.38 74151 140 H_BS-4003t_066a.g1 4 insertion GTCG 23 0.34 74338 225 H_BS-4004t_030a.g1 1 deletion C 43 0.81 74144 146 H_BS-4004t_066a.g1 4 deletion AGAT 28 0.77 74106 78 H_BS-4005t_030a.b1 3 deletion TTA 5 0.30 74467 100 H_BS-4005t_030a.g1 1 deletion A 38 0.62 74340 224 H_BS-4006t_030a.g1 1 insertion C 58 0.47 <END_INDEL>
<BEGIN_SNP> 74060 74060 Gene1.c1 CC 74060 29 H_BS-3921t_030a.b1 CC 99 74060 31 H_BS-3928t_030a.b1 CC 98 74060 31 H_BS-3929t_030a.b1 CC 99 74060 31 H_BS-3932t_030a.b1 CC 84 74060 30 H_BS-3943t_030a.b1 CC 99 74060 31 H_BS-3945t_030a.b1 CC 99 74060 29 H_BS-3946t_030a.b1 CC 97 74060 28 H_BS-3947t_030a.b1 CC 98 74060 31 H_BS-3948t_030a.b1 CC 97 74060 32 H_BS-3953t_030a.b1 CG 99 heterozygous 74060 32 H_BS-3954t_030a.b1 CC 99 74060 31 H_BS-3958t_030a.b1 GG 98 homozygous_rare 74060 31 H_BS-3959t_030a.b1 CC 96 <END_SNP> <END_CONTIG>
|
|
|
|
|