Monday, May 20, 2013

Combining SNP data tables with perl

Continuing with our previous exploits Analyzing in vitro toxicity and expression data, its time to combine SNP data tables using perl.

#!/usr/bin/perl

open CANFAM2, $ARGV[0] or die $!;
open CANFAM3, $ARGV[1] or die $!;

%can2=();
%can3=();

while($line = <CANFAM2>){
chomp $line;
@tabs=split(/\s+/,$line);$tabs[9]=~s/[\;\"]//g;$tabs[11]=~s/[\;\"]//g;
$key=$tabs[9];
$value=$tabs[0] . "_" . $tabs[3] . "_" . $tabs[11];

if(exists $can2{$key}){print "Error:Duplicate record for SNP_ID $tabs[9]. Skipping second occurence.\n";}
else{$can2{$key}=$value;}

}#end of file while loop

while($line = <CANFAM3>){
chomp $line;
@tabs=split(/\s+/,$line);$tabs[9]=~s/[\;\"]//g;$tabs[11]=~s/[\;\"]//g;
$key=$tabs[9];
$value=$tabs[0] . "_" . $tabs[3] . "_" . $tabs[11];


if(exists $can3{$key}){print "Error:Duplicate record for SNP_ID $tabs[9]. Skipping second occurence.\n";}
else{$can3{$key}=$value;}

}#end of file while loop


print "SNP_ID\tbases\tchr_canfam2\tpos_canfam2\tchr_canfam3\tpos_canfam3\tdistance_between_versions\n";
$transitions=0;
$transversions=0;

foreach $mykey(sort keys %can2){
@can2tabs=split(/\_/,$can2{$mykey});

if($can2tabs[2]=~m/(AG|GA|CT|TC)/){$transitions++;}
elsif($can2tabs[2]=~m/(AC|CA|GT|TG)/){$transversions++;}

    if(exists $can3{$mykey}){
        @can3tabs=split(/\_/,$can3{$mykey});
            if($can2tabs[0]=~m/$can3tabs[0]/){$distance=abs($can2tabs[1]-$can3tabs[1]);}else{$distance="NA";}
        print "$mykey\t$can2tabs[2]\t$can2tabs[0]\t$can2tabs[1]\t$can3tabs[0]\t$can3tabs[1]\t$distance\n";
    }
    else {
    print "Error:SNP_ID $mykey missing in canfam3.\n";
    print "$mykey\t$can2tabs[2]\t$can2tabs[0]\t$can2tabs[1]\tNA\tNA\tNA\n";
    }
}

foreach $mykey(sort keys %can3){
    if(!(exists $can2{$mykey})){
    print "Error:SNP_ID $mykey missing in canfam2.\n";
    @can3tabs=split(/\_/,$can3{$mykey});
    print "$mykey\t$can3tabs[2]\tNA\tNA\t$can3tabs[0]\t$can3tabs[1]\tNA\n";

    if($can3tabs[2]=~m/(AG|GA|CT|TC)/){$transitions++;}
    elsif($can3tabs[2]=~m/(AC|CA|GT|TG)/){$transversions++;}
    }
}
$titv=$transitions/$transversions;
print "Info: $transitions Transistions\n";
print "Info: $transversions Transversions\n";
print "Info: $titv Ti/Tv ratio\n";

close CANFAM2;
close CANFAM3;


To do something bonus, lets calculate transitions and transversions to infer the Ti/Tv ratio.


Info: 1852527 Transistions
Info: 504318 Transversions
Info: 3.67333111251234 Ti/Tv ratio

Apart from this, 39,352 SNP's were found in only one of the two versions of the assembly. These could be attributed to gaps in the assembly that were filled at a later stage by Illumina sequencing. 

No comments: