SeqIO.parse on a fasta.gz

Here is a solution if you want to handle both regular text and gzipped files:

import gzip
from mimetypes import guess_type
from functools import partial
from Bio import SeqIO

input_file = 'input_file.fa.gz'

encoding = guess_type(input_file)[1]  # uses file extension
_open = partial(gzip.open, mode='rt') if encoding == 'gzip' else open

with _open(input_file) as f:
    for record in SeqIO.parse(f, 'fasta'):
        print(record)

Note: this relies on the file having the correct file extension, which I think is reasonable nearly all of the time (and the errors are obvious and explicit if this assumption is not met). However, read here for ways to actually check the file content rather than relying on this assumption.

Are you using python3?

This ("r" --> "rt") could solve your problem.

import gzip
from Bio import SeqIO

with gzip.open("practicezip.fasta.gz", "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        print(record.id)

@klim's answer is good. However, in some cases you dont want to iterate but just select a single entry. In such cases, use following code:

import pyfastx
fa = pyfastx.Fasta('ATEST.fasta.gz')
s1 = fa['KF530110.1']
fa_sequence = s1.seq

It creates an additional file, namely it indexes each fasta entry. It's really fast.

SeqIO.parse on a fasta.gz

Tags:

Python

Gzip

Bioinformatics

Biopython

Related

Recent Posts