|
|
Home > Tools > GSC Software > PolyScan > polyScan-3.0 usage
|
|
|
PolyScan-3.0 Usage
|
Updates:
Polyscan-3.0 now supports homozygous indel detection. Heterozygous
indel
and SNP identification remains unchanged from polyscan-2.2.
Homozygous indels are found as gaps in the high quality region of the
reads. A set of criteria is implemented to reduce the false
positive rates by filtering against phred quality,
sequence identity in the flanking regions of gaps. For 1bp
indels, the score increases if there are multiple reads supporting it,
and reduces if otherwise. Our test indicates the polyscan-3.0
detects homozygous indels with around 80% specificity and 80% sensitity.
New Options:
- -hiq <minimum avg phred quality score for homozygous
indels >
- Specify the cutoff for 1bp indels. Cut-off for longer
indels is 0.9 smaller/per bp because longer gap size are less likely to
occur by chance. default: 35
- -him <fraction of flanking bases that match the
reference sequence>
- default: 0.95, which requires at least 95% of the flanking
bases match the reference.
- -nc <a linking table file>
- Linking table can be used to specify the detailed
information of reads and the complex relationship among reads.
With linking
table, arbitrary reads names can be used. This version supports a
simple .csv or tab-delimited file with four columns:
- Read name, which can be arbitrary non-redundant strings
- Sample ID, a string describing the DNA origin of the reads
which affects the statistical test in SNP identification. This is
a generalization of the "-source" option used to SNP detection
- IndelGroup ID, a string describes how reads should be
initially grouped together for joint heterozgyous indel
detection. This is a generalization of the "-indelsource" option
used from Het Indel detection
- Amplicon ID, a string denoting the amplicon that the
current read was sequenced from
- Example:
- H_P00001PCR3029_029a.g1,
P00001, 1, 029a
- H_P00002PCR3029_032a.b1,
P00002, 1, 032a
- This linking table will group the two reads together for
joint het
indel analysis although they are sequenced from two different amplicons
(029a vs 032a),
and from two different samples (P00001
vs P00002).
Usage
1. Required flags:
- -pd <project directory>
- Specify the path of the project to be analyzed. PolyScan
assumes that the project directory is organized in the following
structure:
project_dir/edit_dir, is directory for analysis.
project_dir/chromat_dir, contains the chromatogram files in scf format
project_dir/phd_dir, contains the phd files created by Phred
project_dir/poly_dir, contains the poly files created by Phred
- -ace <input ace file>
- Specify the Consed ace file that contains the reference
sequence, the alignment of reads to the reference and the links to the
phd files. ace file must following the format described in Consed
documentation http://bozeman.mbt.washington.edu/consed/distributions/README.15.0.txt
Please name your reference sequence *.c1 in the ace file so
polyscan can recognize it.
- -refseq <refseq id>
- Specify which read in the ace file represents the reference
sequence, usually obtained from sequence databases such as NCBI
GenBank. The reading-in of refseq-associated scf, phd, poly files
are automatically skipped.
Example usage:
polyscan -pd /human/EGFR -ace /human/EGFR/edit_dir/EGFR.c1.ace.1 -refseq EGFR.c1
Or equivalently if you are running polyscan in the
/human/EGFR/edit_dir,
polyscan -pd .. -ace EGFR.c1.ace.1 -refseq EGFR.c1
2. Optional flags:
2.1 Flags that affects both indel and SNP identification
- -o <output file>
- Specify where the analysis result is stored. If omitted, a
random file
name will be assigned.
- -contig <input contig name>
- Choose a single contig to analyze. If omitted, all contigs
in the ace
file will be analyzed.
- -pp <input polyphred file>
- Analyze the consensus SNP sites listed in the polyphred
file. This
enables PolyScan to be used as a filtering (rescoring) tool for
PolyPhred
- -sd <input snpdetector file>
- Analyze the consensus SNP sites listed in the SNPdetector
file. This
enables PolyScan to be used as a filtering (rescoring) tool for
SNPdetector v2.0
Example usage:
polyscan -pd .. -ace EGFR.c1.ace.1 -refseq EGFR.c1 -pp day1.polyphred.out -sd day1.goodSNPgenotype-a5.csv
- This command forces polyscan to genotype consensus SNP
sites listed in
day1.polyphred.out and day1.goodSNPgenotype-a5.csv
- -ofp <output fpoly directory>
- Dump the fpoly files into a specified directory. "fpoly"
files are
similarly formatted as phred "poly" file.
Each row represents a called base position and contains 24 columns
representing:
1. the polyscan called primary base
2. its pixel position in the trace
3. the polyscan called secondary base
4. its pixel position in the trace
5-9. Channel A, peak position, amplitude, sharpness, regularity
10-14. Channel C, peak position, amplitude, sharpness, regularity
15-19. Channel G, peak position, amplitude, sharpness, regularity
20-24. Channel T, peak position, amplitude, sharpness, regularity
The definitions of sharpness and regularity can be found in our paper.
Example:
A 717 G 720 A 717 804.00 0.757839 0.990878 C 717 0.00 -0.000000 0.522925 G 720 66.00 0.242392 0.904314 T 717 4.00 0.370325 0.833227
- -ifp <input fpoly directory>
- Read the fpoly files from a specified directory. This gives
time saving
from recalculating the fpoly files and is especially useful when
running polyscan multiple times on the same project
- -pr <secondary/primary peak ratio>
- range [0.0,1.0], default: 0.15
Specify a cutoff for the primary/secondary peak amplitude ratio. Peaks
with amplitudes lower than this ratio will be forced to have a 0 allele
quality therefore not called by polyscan. We found that reducing this
ratio to 0.10 significantly increases the sensitivity in detecting
low-level signals frequently seen in tumor-derived samples without
significantly reducing specificity.
- -aq <minimal allele quality>
- range [0.0,1.0], default: 0.2
Eliminate low quality peaks (alleles) from analysis. Allele quality
scores are calculated for peaks in each of the four fluorescence
channels based on 7 features:
- max/min peak spacing ratio in a 7 base window centered
around the
current
- average sharpness in 7 bases
- average regularity in 7 bases
- ratio of distances between the current peak to the left
primary peak
and to the right primary peak
- sharpness of the current peak
- regularity of the current peak
Some features are similar to phred. Others are unique to PolyScan. An
artificial neural network is trained using these 7 features and the
corresponding phred quality scores using the phred called peaks. It is
then used to estimate allele quality scores for all identified peaks in
the four channels. The scale of the allele quality scores is normalized
to between 0 and 1.0 with 1.0 corresponding to a phred quality score of
54.
- -mpa <minimal peak amplitude>
- range [0,100], default: 50
Force peaks that have amplitude smaller than the specified value to
have a zero allele quality and are therefore excluded from analysis.
- -v display version
- -h display usage
2.2 Flags that affect indel detection
Indel Signature Detection:
When there is a frame-shift between the two alleles of a diploid DNA,
one could observe a sequence of consecutively overlapping peaks in the
sequencing trace. We call such sequences of overlapping peaks Indel
Signatures.
Example:
A indel signature of a 5bp AAGTT deletion in both sequencing direction.
- -im <percent identity match between secondary seq and
refseq>
- default: 0.85
Indel signatures are detected through multiple subsequence alignment
with the reference sequence (see details in our paper). This option set
a threshold that requires the subsequence to be similar to the
reference with some mismatch allowed to account for the base-calling
errors. Subsequence alignments with percent-identity-match smaller than
the specified value are not regarded as a credible evidence for a
putative indel signature.
- -qs <length of the secondary seq for indel match>
- default: 20
Adjust the length of the subsequences for the segmented multiple
alignment.
- -mis <minimum indel signature size>
- default: 20
Specify the minimum size for a valid indel signature.
Indel Detection:
Detecting a heterozygous indel requires summarization of indel
signatures from multiple traces (in both directions) and automatically
computes a confidence score that reflects the strength of the existing
evidences.
- -indel
- Turn on indel detection
- -mms, <n bp>
- discard 1bp het indels downstream of a <n bp>
homopolymer, default: 8
The indel signatures detected downstream of the poly trucks of <n
bp> are frequently sequencing artifacts. This option allows users to
specify their favorite <n>.
- -maxindelsize <maximum indel size to detect>
- default: 100 bp
Only report indels with size smaller than the specified.
- -indelsource <pos1> <pos2>
- used in conjunction with the -indelgroup option,
default: 1 25
Reads having identical substring between <pos1> and <pos2>
in their read names are grouped together for indel identification. A
clustering algorithm is used to automatically segregate this
user-defined group into smaller subgroups, each containing identical
frame-shifted patterns within a small genomic region (defined by
-indelgroup). Those remaining in the same subgroup after clustering are
jointly analyzed as a population. We found that grouping reads together
effectively enhanced the accuracy of indel detection.
Example usage:
For the following reads, where x and y represents sequencing direction: H_sample1_amplicon1.x.gz H_sample1_amplicon1.y.gz H_sample1_amplicon2.x.gz H_sample1_amplicon2.y.gz H_sample2_amplicon1.x.gz H_sample2_amplicon1.y.gz H_sample2_amplicon2.x.gz H_sample2_amplicon2.y.gz
-indelsource 1 8, group all reads into the same group -indelsource 11 19, split reads into two groups: amplicon1 and amplicon2 -indelsource 1 9, split reads into two groups: H_sample1 and H_sample2 -indelsource 1 19, group paired reads -indelsource 1 24, each read in its own group
- -indelgroup <n bp>
- used in conjunction with the -indelsource option,
default: 50bp
Reads from the same indelsource and having indel signatures within
<n bp> are jointly analyzed.
- -indelscore <indel confidence score threshold>
- range [0,100], default: 0
Report only indel with confidence score greater than the threshold.
2.3 Flags that affect SNP detection
- -genotype 0
- output the genotypes at all snp sites, this is default
-genotype -1 turns off the SNP detection.
- -gtscore <genotype confidence score threshold>
- range [0,100], default: 0
Report genotypes whose confidence scores are greater than the
threshold.
- -quality <minimum average phred quality for
analysis>
- range [10, 30], default: 20
Only genotyping positions where arithmetic average phred quality scores
are greater than the threshold. Average is currently estimated from a
11 base pair window.
- -hh <density of het in a 20bp window>
- default: 0.3
Only detect SNPs in regions where the density of heterozygous genotypes
is smaller than the threshold
- -source <pos1> <pos2>
- default: 1 25, group reads for snp detection
Similar to -indelsource but applied to SNP detection
- -nr <noise reduction factor>
- range [0.0,1.0], default: 0.0
Specify noise reduction factor. This option helps remove the background
noise due to peak spilling and paralog amplification. For details
please read the sections of horizontal and vertical scans in our paper.
All options that take input arguments have pre-selected default values
based on our experience. The users are encouraged to find their
favorite parameter settings in their environment. The default
parameters normally lead to a reasonable but not necessarily optimal
outcome.
|
Output
The output of PolyScan is in a concise and self-explanatory format. The
<PARAMETER> section record the parameters used to this
experiment. The indel and snps as reported in the <CONTIG>
sections, which contain a section for <INDEL> and a section for
<SNP>. Each heterozygous indel is reported in a row of 7 columns:
reference
position, read position, read name, indel size, indel type, indel
sequence, confidence score, and the average ratio of the heights of the
secondary peaks versus the primary peaks. Homozygous indel is reported
in similar format and the last column represents an auxiliary score
which can be used for further assess the confidence. Each genotype is
reported in
a row of 5 or 6 columns: reference position, read position, read name,
genotype, confidence score, and comments.
Example:
<BEGIN_PARAMETERS> Global Parameters: Project dir: .. Fpoly dir: ../fpoly_dir Ace file: Gene.c1.ace.begin Refseq: Gene.c1 Program: polyscan 3.0 -pr, minimum secondary/primary peak ratio: 0.1500 -aq, minimum allele probability: 0.2000 -mpa, minimum peak amplitude: 100 Parameters for Indel analysis: -hiq, homo indel phred quality cutoff: 35 -him, homo indel local identity match ratio: 0.95 -maxindelsize, maximum indel size: 100 -mis, minimum indel signature size: 20 -qs, indel subsequence size: 20 -im, minimum subsequence identity match: 0.85 -mms, discard 1bp het indels downstream of 8 bp homopolymer -indelscore, confidence score cutoff for het indel: 0 -indelgroupsize, indel group size: 30 Parameters for SNP analysis: -quality, minimum avg Phred qual: 25.00 -hh, maximum allowed SNP density: 0.30 -gtscore, score cutoff for genotyping: 0 SNP Filtering: <END_PARAMETERS>
<BEGIN CONTIG> Gene1.c1-Contig
<BEGIN_INDEL> 74148 150 H_BS-3970tPCR0004869_066a.g1 4 hetDel TCTG 72 0.38 74148 146 H_BS-3975tPCR0004869_066a.g1 4 hetDel TCTG 76 0.52 74151 413 H_BS-4002tPCR0004869_030a.g1 4 hetDel GGCA 57 0.38 74151 140 H_BS-4003tPCR0004869_066a.g1 4 hetDel GGCA 53 0.36 74152 139 H_BS-3948tPCR0004869_066a.g1 3 homoIns CTG 43 0.71 74152 141 H_BS-3958tPCR0004869_066a.g1 4 homoIns TCTG 54 0.72 <END_INDEL>
<BEGIN_SNP> 74060 74060 Gene1.c1 CC 74060 29 H_BS-3921t_030a.b1 CC 99 74060 31 H_BS-3928t_030a.b1 CC 98 74060 31 H_BS-3929t_030a.b1 CC 99 74060 31 H_BS-3932t_030a.b1 CC 84 74060 30 H_BS-3943t_030a.b1 CC 99 74060 31 H_BS-3945t_030a.b1 CC 99 74060 29 H_BS-3946t_030a.b1 CC 97 74060 28 H_BS-3947t_030a.b1 CC 98 74060 31 H_BS-3948t_030a.b1 CC 97 74060 32 H_BS-3953t_030a.b1 CG 99 heterozygous 74060 32 H_BS-3954t_030a.b1 CC 99 74060 31 H_BS-3958t_030a.b1 GG 98 homozygous_rare 74060 31 H_BS-3959t_030a.b1 CC 96 <END_SNP> <END_CONTIG>
|
|
|
|
|