Sequence File Formats in Bioinformatics
In
bioinformatics, different file formats are used to store and exchange sequence
data (like DNA, RNA, or protein sequences). Each format has its own structure
and purpose. Here we will discuss some common sequence file formats as follows-
1. Flat File Format
A flat file is a basic,
text-based method of storing sequences along with annotations. These files are
readable without special software, making them accessible for simple data
storage. Flat files may be structured using delimiters (like tabs or commas) or
left unstructured.
Features of Flat File Format:
Ø Can
contain sequences of nucleotides or proteins.
Ø Annotations
or metadata can be included in a simple or semi-structured format.
Ø Can be
opened in any text editor or software that reads plain text.
Use of Flat File Format:
Flat
file format is mainly use for the storage of sequence data and associated
annotations in a readable format in databases for bioinformatics.
2. FASTA Format
FASTA is a widely used format for representing nucleotide (DNA/RNA) and
protein sequences in bioinformatics. It is simple and easily parsed by
bioinformatics tools.
Structure of FASTA Format:
Ø The first
line starts with a ">" symbol followed by a header (sequence
identifier, description, etc.).
Ø The
subsequent lines contain the nucleotide or amino acid sequence in plain text.
Example of FASTA Format:
>sequence_1
ATCGATCGATCG
Use of FASTA Format:
FASTA is a standard format
for sequence input in various bioinformatics applications like sequence
alignment, BLAST searching, and sequence database submissions.
3. GCG (Genetics Computer Group ) Format:
The GCG format was
developed for the GCG software suite and includes additional data like
checksums to verify the sequence's integrity. It is an older format but
historically important.
Structure of GCG:
Ø
Contains a header with annotations.
Ø
Sequence is provided, often followed by a checksum for validation.
Use of GCG format:
Historically used with the
GCG software suite, now largely obsolete but may still be encountered in older
data sets or archival purposes.
4. EMBL Format
EMBL format is used by the
European Molecular Biology Laboratory to store nucleotide sequences and their
annotations. This format includes structured fields for various types of data
like gene features, references, and sequence data.
Structure of EMBL Format:
Ø Starts
with an identifier (ID) line.
Ø Several
other lines include codes representing metadata (e.g., DE for description, KW
for keywords).
Ø The
sequence itself is indicated by the code "SQ".
Example of EMBL Format:
ID SCU49845; SV 1; linear; mRNA; STD; HUM; 5028
BP.
SQ Sequence 5028 BP; 1721 A; 1081 C; 1329 G;
897 T; 0 other;
agctacggtcagcgcccaattgcgcgcaa...
Use of EMBL Format:
Ideal for storing nucleotide sequences with extensive annotations.
Widely used in nucleotide databases like EMBL-EBI.
5. Clustal Format
The Clustal format is used
for multiple sequence alignments, especially for comparing DNA, RNA, or protein
sequences across different organisms or genes. The format represents aligned
sequences with conserved regions indicated for easy analysis.
Structure of Clustal Format:
Ø Sequences
are aligned and shown across several lines.
Ø Asterisks
or other symbols below the sequences highlight conserved residues.
Example of Clustal Format:
seq1 ATCGT---GAC
seq2 ATC-TAGG-A-
Use of Clustal Format:
Widely used to display and
analyze multiple sequence alignments, often generated by the Clustal software.
6. Phylip Format
Phylip format is designed
for phylogenetic analysis. It stores nucleotide or protein sequences in a
compact format suitable for input into phylogenetic inference programs.
Structure Phylip Format:
Ø
The first line provides the number of sequences and their length.
Ø
Each subsequent line contains a sequence name followed by the actual
sequence.
Example Phylip Format:
Alpha
ACGTACGTACGT...
Beta TACG--TACGT...
Use Phylip Format:
Used mainly in phylogenetic tools for evolutionary analysis of DNA and
protein sequences.
7. Swiss-Prot Format
Swiss-Prot is a format used
by the Swiss-Prot protein database, designed for storing protein sequences with
a strong focus on high-quality annotations. Each protein entry includes
multiple structured fields that provide detailed metadata.
Structure Swiss-Prot Format:
Ø Entries
contain various fields like ID (identifier), AC (accession number), DE
(description), and the actual sequence is located in the "SQ" field.
Example of Swiss-Prot Format:
ID MY_PROTEIN
STANDARD; PRT; 450 AA.
SQ SEQUENCE 450 AA; 51500 MW; ...
MAASEFK...
Use of Swiss-Prot Format:
Swiss-Prot format is mainly used for storing protein sequences in a
well-annotated and validated form. It is favored in protein-related research
databases.
Format |
Used For |
Special Feature |
Flat
File |
General-purpose
storage |
Simple
text format |
FASTA |
Nucleotide
and protein sequences |
Widely
used in sequence alignment |
GCG |
Nucleotide
and protein sequences |
Obsolete,
includes checksums |
EMBL |
Nucleotide
sequences |
Rich
annotation support |
Format |
Used For |
Special Feature |
Flat
File |
General-purpose
storage |
Simple
text format |
Format |
Used For |
Special Feature |
There are several file format like Flat file, GCG format, EMBL format, FASTA format etc. These
formats are essential in bioinformatics, enabling the exchange, storage, and
analysis of biological sequence data across various platforms and tools.