IzziD code logo

Topics

Home

Bioinfo

Web

Misc

About

Other IzziDs

IzziDassorted

IzziDtravel

IzziDwetlab

IzziD

How to write a fasta file in Perl

Article created: Oct 13, 2011
Article by: Jeremiah Faith

Writing a DNA or protein sequence to a FASTA file in Perl is pretty easy and there are many ways to do so. Performance is usually not an issue for this task, so it’s best to just use something simple. I use the following code.

sub to_fasta {
        my ($seqName, $seq, $len) = @_;

	# default to 80 characters of sequence per line
        $len = 80 unless $len;

        my $formatted_seq = ">$seqName\n";
        while (my $chunk = substr($seq, 0, $len, "")) {
                $formatted_seq .= "$chunk\n";
        }

        return $formatted_seq;
}

The function requires two variables $seqName which will be the FASTA header line and $seq which is the DNA or protein sequence. $len is an optional argument for the number of characters you want to use for each line of sequence. The core of the function is the while loop with the substr() function. The substr() function takes the arguments of a string, an offset, a length, and an optional replacement string. Here, we repeatedly take the first $len bases from the sequence (i.e. the offset is zero) and replace them with nothing (i.e. we remove them). If the sequence contains less than $len characters, substr() will return all of the remaining characters. Finally, when no sequence remains it will return an empty string and the while loop will end.

Caveats for very large sequences

The above function is not idea for outputting very large sequence, since it creates and returns the entire formatted sequence in memory. To be more memory efficient for longer sequences you can just print the sequence as you go rather than storing it.

sub to_fasta {
        my ($seqName, $seq, $len) = @_;

	# default to 80 characters of sequence per line
        $len = 80 unless $len;

        print ">$seqName\n";
        while (my $chunk = substr($seq, 0, $len, "")) {
                print $formatted_seq .= "$chunk\n";
        }
}