Tuesday, September 4, 2012

Get duplicate fasta sequences

ALLPATHS-LG had the DEDUP option turned off by default prior to release 42726. So this bit of code identifies exact duplicates either on the same or negative strand. Many of my runs had 30 to 60 Kb of sequence that was duplicated with almost equal amounts on both strands.

 #!/usr/bin/perl  
 use warnings;  
 # Input parameters  
 open FASTA, $ARGV[0] or die $!;  
 my $seqst_temp="";  
 my $seqs = {GENENAME =>my $genename,LEN =>my $qcom};  
 while($line = <FASTA>){  
 if($line=~ /^>/){  
 if($header){  
 if(exists $seqs{$seqst_temp}{GENENAME}){print "$seqs{$seqst_temp}{GENENAME}\t$header\t$seqs{$seqst_temp}{LEN}\n";}  
 $rseqst_temp = $seqst_temp;  
 $rseqst_temp=revcomp($rseqst_temp);  
 if(exists $seqs{$rseqst_temp}{GENENAME}){print "$seqs{$rseqst_temp}{GENENAME}\t$header\t$seqs{$rseqst_temp}{LEN}\treverse\n";}  
 $seqs{$seqst_temp}{GENENAME}=$header;  
 $seqs{$seqst_temp}{LEN}=length $seqst_temp;  
 }  
 chomp $line;  
 $header="";  
 $header=$line;  
 $seqst_temp="";  
 $rseqst_temp="";  
 }  
 else{  
 $line =~ s/[\n\t\f\r_0-9\s]//g;  
 $seqst_temp .= $line;  
 }  
 }#end of while loop  
 if($header){  
 if(exists $seqs{$seqst_temp}{GENENAME}){print "$seqs{$seqst_temp}{GENENAME}\t$header\t$seqs{$seqst_temp}{LEN}\n";}  
 $rseqst_temp = $seqst_temp;  
 $rseqst_temp=revcomp($rseqst_temp);  
 if(exists $seqs{$rseqst_temp}{GENENAME}){print "$seqs{$rseqst_temp}{GENENAME}\t$header\t$seqs{$rseqst_temp}{LEN}\treverse\n";}  
 $seqs{$seqst_temp}{GENENAME}=$header;  
 $seqs{$seqst_temp}{LEN}=length $seqst_temp;  
 }  
 close FASTA;  
 sub revcomp{  
     my $input = shift;  
     my $revcomp = reverse($input);  
     $revcomp =~ tr/ACGTacgt/TGCAtgca/;  
     return $revcomp;  
 }  

Although this does the job, its ugly in having the same bit of code used twice, once within the loop and once after the while loop. May be the end of file can be handled more elegantly.

No comments: