IzziD code logo

Topics

Home

Bioinfo

Web

Misc

About

Other IzziDs

IzziDassorted

IzziDtravel

IzziDwetlab

IzziD

How to read a fasta file in Perl

Article created: Oct 31, 2011
Article by: Jeremiah Faith

Writing a DNA or protein sequence to a FASTA file in Perl is pretty easy and there are many ways to do so. Reading a FASTA file however, is more difficult because the sequence spans several lines and you don’t know you’ve reached the end of a sequence until you’ve reached the beginning of the next sequence (at which point you might have gone farther than you wanted into the file). One way around this is to read the file until you see the next sequence and then go back a line on the filehandle, but that is a pretty sloppy way to do things. Another method would be to read the entire contents of the file into a hash or an array and then process them one-at-a-time (see below).

sub read_fasta_file {
   my %seqs;
   my $header;
   my $seq;
   open (IN, $fasta_file) or die "can't open $fasta_file: $!\n";
   while (<IN>) {
      if (/>/) {
         if ($seq) {    
            $seqs{$header} = $seq;
         }              

         $header =~ s/^>//; # remove ">"
         $header =~ s/\s+$//; # remove trailing whitespace

         $seq = ""; # clear out old sequence
      }         
      else {    
         s/\s+//g; # remove whitespace
         $seq .= $_; # add sequence
      }         
   }    
   close IN; # finished with file

   if ($seq) { # handle last sequence
      $seqs{$header} = $seq;
   }    

   return \%seqs;
}

However, the above is not a viable solution when you have a huge file where you need to process one sequence at a time. What we’d really like to do is to focus a while loop NOT on the file itself but on the fasta sequences. The following solution does just that. After a little bit of setup, each round of the while loop calls the function read_fasta_sequence() which puts a new FASTA header and sequence into the %sequence_data hash. This method reads each sequence one-at-a-time, so you can process sequences on-the-fly and not store unnecessary information in memory. The function itself is a little complicated, because it has to keep track of a few things to handle the first and last sequences in the file, but the usage (which is the important part) is straight forwards – just call while(read_fasta_sequence($fh,\%sequence_data)) and process the sequences inside the while loop.

my $fasta_file=shift;

my $fh;
open($fh, $fasta_file) or die "can't open $fasta_file: $!\n";

my %sequence_data;
while (read_fasta_sequence($fh, \%sequence_data)) {
   print ">$sequence_data{header}\n$sequence_data{seq}\n\n";
}

sub read_fasta_sequence {
   my ($fh, $seq_info) = @_;

   $seq_info->{seq} = undef; # clear out previous sequence

   # put the header into place
   $seq_info->{header} = $seq_info->{next_header} if $seq_info->{next_header};

   my $file_not_empty = 0; 
   while (<$fh>) {
      $file_not_empty = 1;
      next if /^\s*$/;  # skip blank lines
      chomp;    

      if (/^>/) { # fasta header line
         my $h = $_;    
         $h =~ s/^>//;  
         if ($seq_info->{header}) {
            $seq_info->{next_header} = $h;
            return $seq_info;   
         }              
         else { # first time through only
            $seq_info->{header} = $h;
         }              
      }         
      else {    
         s/\s+//;  # remove any white space
         $seq_info->{seq} .= $_;
      }         
   }    

   if ($file_not_empty) {
      return $seq_info;
   }    
   else {
      # clean everything up
      $seq_info->{header} = $seq_info->{seq} = $seq_info->{next_header} = undef;

      return;   
   }    
}