/* CONTIG ASSEMBLY PROGRAM (CAP) copyright (c) 1991 Xiaoqiu Huang The distribution of the program is granted provided no charge is made and the copyright notice is included. Proper attribution of the author as the source of the software would be appreciated: "A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps" (submitted to Genomics, 1991) Xiaoqiu Huang Department of Computer Science Michigan Technological University Houghton, MI 49931 E-mail: huang@cs.mtu.edu ------ call format now is: CAP2 file_of_fragments output_file MIN_OVERLAP PERCENT_MATCH where: file_of_fragments -- input data, in pearson/fasta format output_file -- output data, in fasta format MIN_OVERLAP; -- Minimum length of any overlap PERCENT_MATCH; -- Minimum identity percentage of any overlap (give percent as whole number relative to 100) ------- The CAP program uses a dynamic programming algorithm to compute the maximal-scoring overlapping alignment between two fragments. Fragments in random orientations are assembled into contigs by a greedy approach in order of the overlap scores. CAP is efficient in computer memory: a large number of arbitrarily long fragments can be assembled. The time requirement is acceptable; for example, CAP took 4 hours to assemble 1015 fragments of a total of 252 kb nucleotides on a Sun SPARCstation SLC. The program is written in C and runs on Sun workstations. Below is a description of the parameters in the #define section of CAP. Two specially chosen sets of substitution scores and indel penalties are used by the dynamic programming algorithm: heavy set for regions of low sequencing error rates and light set for fragment ends of high sequencing error rates. (Use integers only.) Heavy set: Light set: MATCH = 2 MATCH = 2 MISMAT = -6 LTMISM = -3 EXTEND = 4 LTEXTEN = 2 In the initial assembly, any overlap must be of length at least OVERLEN, and any overlap/containment must be of identity percentage at least PERCENT. After the initial assembly, the program attempts to join contigs together using weak overlaps. Two contigs are merged if the score of the overlapping alignment is at least CUTOFF. The value for CUTOFF is chosen according to the value for MATCH. DELTA is a parameter in necessary conditions for overlap/containment. Those conditions are used to quickly reject pairs of fragments that could not possibly have an overlap/containment relationship. The dynamic programming algorithm is only applied to pairs of fragments that pass the screening. A large value for DELTA means stringent conditions, where the value for DELTA is a real number at least 8.0. POS5 and POS3 are fragment positions such that the 5' end between base 1 and base POS5, and the 3' end after base POS3 are of high sequencing error rates, say more than 5%. For mismatches and indels occurring in the two ends, light penalties are used. A file of input fragments looks like: >G019uabh ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC TCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACAGTAG GACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTT AATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTT ATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTT TAGATAAGCAAAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC >G028uaah CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTT TAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGAT TGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGC TGGCAGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGC ATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCT TCCCCATCCCATCAGTCT >G022uabh TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTG TAGGTGATTGGGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCC CATTAAAACCCTTTATGCCCATACATCATAACACTACTTCCTACCCATAA GCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAAC ACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATT GATTGAT >G023uabh AATAAATACCAAAAAAATAGTATATCTACATAGAATTTCACATAAAATAA ACTGTTTTCTATGTGAAAATTAACCTAAAAATATGCTTTGCTTATGTTTA AGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAATAATCCTCTAC GATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATGGAAATAAAAT GTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACATGAAATGCTTT TTAAAAGAAAATATTAAAGTTAAACTCCCCTATTTTGCTCGTTTTTGCTT ATCTAAAATACATTCTGCACAATCCCCAAAGATTGATCATACGTTAC >G006uaah ACATAAAATAAACTGTTTTCTATGTGAAAATTAACCTANNATATGCTTTG CTTATGTTTAAGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAAT AATCCTCTAAGATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATG GAAATAAAATGTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACAT GAAATGCTTTTTAAAAGAAAATATTAAAGTTAAACTCCCC A string after ">" is the name of the following fragment. Only the five upper-case letters A, C, G, T and N are allowed to appear in fragment data. No other characters are allowed. A common mistake is the use of lower case letters in a fragment. To run the program, type a command of form cap file_of_fragments The output goes to the terminal screen. So redirection of the output into a file is necessary. The output consists of three parts: overview of contigs at fragment level, detailed display of contigs at nucleotide level, and consensus sequences. '+' = direct orientation; '-' = reverse complement The output of CAP on the sample input data looks like: #Contig 1 #G022uabh+(0) TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG ATTGATTGATTGATTGAT #G028uaah+(145) CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAACACAAA A-TTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTAC AGTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATAC ATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTC TATAGCCTCCTTCCCCATCCC-ATCAGTCT #G019uabh+(120) ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC AGTCTTGTTACGTTATGACT-AATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC AAAA-CGAGCAAAAT-GGGGAGTT-A-CTT-A-TATTT-CTTT-AAA--GC #G023uabh-(426) GTAACGT-ATGA-TCAATCTTTGGGGATTGTGCAGAATGT-ATTTTAGATAAGCAAAAAC GAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGTTATAAG ATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTTAGTTTT TTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGATAAAAAG CATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAAAACAGT TTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT #G006uaah-(496) GGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGTTATAAGATCAATTCTG AGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTTAGTTTTTTTCCTATTT GTTTAAGATCTTAGAGGATTATTAAGCTGAACTCCTCAACTGATAAAAAGCATGACATCT TAAACATAAGCAAAGCATATNNT-AGGTTAATTTTCACATAGAAAACAGTTTATTTTATG T Slight modifications by S. Smith on Mon Feb 17 10:18:34 EST 1992. These changes allow for command line arguements for several of the hard coded parameters, as well as a slight modification to the output routine to support GDE format. Changes are commented as: Mod by S.S. more mods for use with Macintosh SeqApp program, d.g. gilbert, June 93 */