Miropeats discovers regions of sequence similarity amongst any set of DNA sequences and then presents this similarity information graphically. Sequence similarity searching is a very general tool that forms the basis of many different biological sequence analyses but it is limited by the verbosity of traditional alignment presentation styles. Miropeats enhances the utility of conventional DNA sequence comparisons when looking at long lengths of sequence similarity by summarizing extensive large scale sequence similarities on a single page of graphics.
The descriptive abilities of Miropeats open research opportunities that would not be possible, or would be tedious, or difficult to do otherwise. Examples include comparing the repeat structures of entire chromosomes, visualising overlapping sequence fragments in a contig assembly project and comparing the products of different contig assembly programs. Miropeats was originally written to help contig assembly projects at The Genome Institute where it was found to be useful for many diffferent roles. The intrinsic inscrutability of a string of 40,000 characters picked from an alphabet of only 4 letters (a typical cosmid assembly project) is made worse because the shotgun sequencing strategy starts with the original contiguous 40Kb DNA sequence split into an 800 piece puzzle. Miropeats helps shotgun assembly, not by solving the puzzle itself, but by helping the researcher gain an overall understanding of the task presented to them. Miropeats can do this because it draws a simple graphic that shows potential joins, cosmid overlaps, and also distinguishes tandem repeats, inverted repeats, oligo repeats and palindromes from each other.
Miropeats has options to look at all repeated DNA sequence segments (Default) or one can choose to see only those repeated sequences with either both copies on a single sequence, or both copies on different sequences. The program also has an adjustable threshold that lets the user choose what length of DNA sequence similarity should be considered significant and worth displaying. This facility allows Miropeats to be used for analysing different features in sequences varying from less than a Kbase to more than a Mbase. If the picture is too complex then the threshold should be raised but if the picture is not displaying repeats of interest then the threshold should be lowered.
Miropeats itself, is just a UNIX C-shell script. All the DNA comparisons are done by calls to another program called ICAass which is written in ANSI-C. Miropeats has only to parse out the position and quality of any matching DNA segments and convert those above the threshold value into Postscript graphics. The output from Miropeats is always a Postcript graphic file unless there were no repeats found. Miropeats was written and tested on Solaris 2.3 so it may need altering slightly to run on any different UNIX versions. Typical variations between UNIX flavours are print commands (lp|lpr) and 'sed' syntax.
If you need to make a new icaass executable then download the icatools package and type make miropeats or just make. If you type 'make' with no options, then you create the lightly optimised variants of the entire ICAtools package which could be useful anyway. Once 'icaass' and the Miropeats script are available on the user's path then the installation is complete. The program is very flexible about its DNA sequence format restrictions and not demanding of memory or computer cycles. As currently configured, ICAass will work well with any number of sequences that are shorter than 4Mbases.
Miropeats needs to be presented with DNA sequences in one of its recognized formats: EMBL, GenBank, FASTA, Staden, or plain format. The first four listed are complex formats and it is possible to have any number of sequences of the same format in any one file. All the sequence data in plain (unformatted) files is assumed to come from a single sequence. Files suitable for analysis include consensus files from databases (e.g. Xbap - use FASTA format), from local databases (e.g. ACEdb - use dump sequence) and from public databases (e.g. Genbank - use NCBI's Web server).
miropeats consensus_filename- To use the default options on a file of consensus sequences.
miropeats myfile1 myfile2 myfile3- To use the default options on a set of files together.
miropeats -s 200 chromosome1- To use a higher threshold of significance on a long sequence.
miropeats -s 50 -onlyintra chromosome1 chromosome2- A raised threshold with only internal repeats (both copies on a single sequence) being printed.
miropeats -s 150 -onlyinter cosmid1 cosmid2 cosmid3- To visualise the overlap between three cosmids.
The threshold score is simply defined: the number of matching bases minus the number of mismatching bases. The default threshold score is 30. Changing the threshold score ("-s integer") can be very useful for miropeats when its default parameters produced too complicated a diagram to understand. A new assembly project with many potential joins is always going to be more complicated than a single finished cosmid so don't be alarmed if your new sequence assembly project looks difficult; its going to be simpler tomorrow.
If you are only interested in a certain portion of your sequences and want a graphic marking repeated sequence in just those regions, you have to create new subsequence files containing just those selected regions and then run the program again. Miropeats can cope with hundreds of files listed after the command and threshold options.

Printrepeats is not very sophisticated about the positioning of sequence fragments on the page and this can lead to interesting information being hidden behind a jumble of crossing lines. One day I would like to write an interactive version of Printrepeats that would display graphics and allow the user to choose which fragments should be drawn and where they should be placed.
When working with Human DNA sequences or any DNA containing many repeats there is a chance that interesting matches will be lost amongst the many Alu's etc. that are not usually of much consequence. It would be nice if there was an option to screen out the interconnect lines from certain classes of well known repeats.
Sequences should not be longer than 4Mbases.
Permission is granted to any individual or institution to use, copy, or redistribute this software so long as it is not sold for profit, provided that this notice and the original copyright notices are retained. Jeremy Parsons makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty. Its academic software OK!
The latest version of the code should always be available for free from The Littlest Bioinformatics consultancy.
Please send me an email to the address below to ensure that I keep you informed of bug fixes and to let me me know what improvements you would like. Please include the word Miropeats on the subject line somewhere.