|
RNAlib-2.2.9
|
The ClustalW format is a relatively simple text file containing a single multiple sequence alignment of DNA, RNA, or protein sequences. It was first used as an output format for the clustalw programs, but nowadays it may also be generated by various other sequence alignment tools. The specification is straight forward:
CLUSTAL Wor
CLUSTALW
Each line in a blocks of sequence data consists of the sequence name followed by the sequence symbols, separated by at least one whitespace character. Usually, the length of a sequence in one block does not exceed 60 symbols. Optionally, an additional whitespace separated cumulative residue count may follow the sequence symbols. Optionally, a block may be followed by a line depicting the degree of conservation of the respective alignment columns.
Here is an example alignment in ClustalW format:
CLUSTAL W (1.83) multiple sequence alignment AL031296.1/85969-86120 CUGCCUCACAACGUUUGUGCCUCAGUUACCCGUAGAUGUAGUGAGGGUAACAAUACUUAC AANU01225121.1/438-603 CUGCCUCACAACAUUUGUGCCUCAGUUACUCAUAGAUGUAGUGAGGGUGACAAUACUUAC AAWR02037329.1/29294-29150 ---CUCGACACCACU---GCCUCGGUUACCCAUCGGUGCAGUGCGGGUAGUAGUACCAAU AL031296.1/85969-86120 UCUCGUUGGUGAUAAGGAACAGCU AANU01225121.1/438-603 UCUCGUUGGUGAUAAGGAACAGCU AAWR02037329.1/29294-29150 GCUAAUUAGUUGUGAGGACCAACU
Here is an example alignment in Stockholm 1.0 format:
# STOCKHOLM 1.0 #=GF AC RF01293 #=GF ID ACA59 #=GF DE Small nucleolar RNA ACA59 #=GF AU Wilkinson A #=GF SE Predicted; WAR; Wilkinson A #=GF SS Predicted; WAR; Wilkinson A #=GF GA 43.00 #=GF TC 44.90 #=GF NC 40.30 #=GF TP Gene; snRNA; snoRNA; HACA-box; #=GF BM cmbuild -F CM SEED #=GF CB cmcalibrate --mpi CM #=GF SM cmsearch --cpu 4 --verbose --nohmmonly -E 1000 -Z 549862.597050 CM SEQDB #=GF DR snoRNABase; ACA59; #=GF DR SO; 0001263; ncRNA_gene; #=GF DR GO; 0006396; RNA processing; #=GF DR GO; 0005730; nucleolus; #=GF RN [1] #=GF RM 15199136 #=GF RT Human box H/ACA pseudouridylation guide RNA machinery. #=GF RA Kiss AM, Jady BE, Bertrand E, Kiss T #=GF RL Mol Cell Biol. 2004;24:5797-5807. #=GF WK Small_nucleolar_RNA #=GF SQ 3 AL031296.1/85969-86120 CUGCCUCACAACGUUUGUGCCUCAGUUACCCGUAGAUGUAGUGAGGGUAACAAUACUUACUCUCGUUGGUGAUAAGGAACAGCU AANU01225121.1/438-603 CUGCCUCACAACAUUUGUGCCUCAGUUACUCAUAGAUGUAGUGAGGGUGACAAUACUUACUCUCGUUGGUGAUAAGGAACAGCU AAWR02037329.1/29294-29150 ---CUCGACACCACU---GCCUCGGUUACCCAUCGGUGCAGUGCGGGUAGUAGUACCAAUGCUAAUUAGUUGUGAGGACCAACU #=GC SS_cons -----((((,<<<<<<<<<___________>>>>>>>>>,,,,<<<<<<<______>>>>>>>,,,,,)))):::::::::::: #=GC RF CUGCcccaCAaCacuuguGCCUCaGUUACcCauagguGuAGUGaGgGuggcAaUACccaCcCucgUUgGuggUaAGGAaCAgCU //
Here is an example alignment in FASTA format:
>AL031296.1/85969-86120 CUGCCUCACAACGUUUGUGCCUCAGUUACCCGUAGAUGUAGUGAGGGUAACAAUACUUAC UCUCGUUGGUGAUAAGGAACAGCU >AANU01225121.1/438-603 CUGCCUCACAACAUUUGUGCCUCAGUUACUCAUAGAUGUAGUGAGGGUGACAAUACUUAC UCUCGUUGGUGAUAAGGAACAGCU >AAWR02037329.1/29294-29150 ---CUCGACACCACU---GCCUCGGUUACCCAUCGGUGCAGUGCGGGUAGUAGUACCAAU GCUAAUUAGUUGUGAGGACCAACU
The multiple alignment format (MAF) is usually used to store multiple alignments on DNA level between entire genomes. It consists of independent blocks of aligned sequences which are annotated by their genomic location. Consequently, an MAF formatted MSA file may contain multiple records. MAF files start with a line
##maf
which is optionally extended by whitespace delimited key=value pairs. Lines starting with the character ("#") are considered comments and usually ignored.
A MAF block starts with character ("a") at the beginning of a line, optionally followed by whitespace delimited key=value pairs. The next lines start with character ("s") and contain sequence information of the form
s src start size strand srcSize sequence
where
Here is an example alignment in MAF format (bluntly taken from the UCSC Genome browser website):
##maf version=1 scoring=tba.v8
# tba.v8 (((human chimp) baboon) (mouse rat))
# multiz.v7
# maf_project.v5 _tba_right.maf3 mouse _tba_C
# single_cov2.v4 single_cov2 /dev/stdin
a score=23262.0
s hg16.chr7 27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s baboon 116834 38 + 4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
s mm4.chr6 53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
s rn3.chr4 81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
a score=5062.0
s hg16.chr7 27699739 6 + 158545518 TAAAGA
s panTro1.chr6 28862317 6 + 161576975 TAAAGA
s baboon 241163 6 + 4622798 TAAAGA
s mm4.chr6 53303881 6 + 151104725 TAAAGA
s rn3.chr4 81444246 6 + 187371129 taagga
a score=6636.0
s hg16.chr7 27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon 249182 13 + 4622798 gcagctgaaaaca
s mm4.chr6 53310102 13 + 151104725 ACAGCTGAAAATAThe RNAlib can parse and apply data from constraint definition text files, where each constraint is given as a line of whitespace delimited commands. The syntax we use extends the one used in mfold / UNAfold where each line begins with a command character followed by a set of positions.
Additionally, we introduce several new commands, and allow for an optional loop type context specifier in form of a sequence of characters, and an orientation flag that enables one to force a nucleotide to pair upstream, or downstream.
The following set of commands is recognized:
F
ForceP
ProhibitC
Conflicts/Context dependencyA
Allow (for non-canonical pairs)E
Soft constraints for unpaired position(s), or base pair(s)The optional loop type context specifier [WHERE] may be a combination of the following:
E
Exterior loopH
Hairpin loopI
Interior loop (enclosing pair)i
Interior loop (enclosed pair)M
Multibranch loop (enclosing pair)m
Multibranch loop (enclosed pair)A
All loopsIf no [WHERE] flags are set, all contexts are considered (equivalent to A )
For particular nucleotides that are forced to pair, the following [ORIENTATION] flags may be used:
U
UpstreamD
DownstreamIf no [ORIENTATION] flag is set, both directions are considered.
Sequence positions of nucleotides/base pairs are
based and consist of three positions
,
, and
. Alternativly, four positions may be provided as a pair of two position ranges
, and
using the '-' sign as delimiter within each range, i.e.
, and
.
Below are resulting general cases that are considered valid constraints:
consecutive nucleotides starting at position
to be paired. The optional loop type specifier [WHERE] allows to force them to appear as closing/enclosed pairs of certain types of loops.F i j k [WHERE]
to form. The optional loop type specifier [WHERE] allows to specify in which loop context the base pair must appear.P i 0 k [WHERE]
consecutive nucleotides to participate in base pairing, i.e. make these positions unpaired. The optional loop type specifier [WHERE] allows to force the nucleotides to appear within the loop of specific types.P i j k [WHERE]
to form. The optional loop type specifier [WHERE] allows to specify the type of loop they are disallowed to be the closing or an enclosed pair of.P i-j k-l [WHERE]Description:
to pair with any other nucleotide
. The optional loop type specifier [WHERE] allows to specify the type of loop they are disallowed to be the closing or an enclosed pair of.C i 0 k [WHERE]Description:
[WHERE] flag can be used to enforce specfic loop types the nucleotides must appear in.C i j k
. Two base pairs
and
conflict with each other if
, or
.
, no matter if they are canonical, or non-canonical. In contrast to the above F and W commands, which remove conflicting base pairs, the A command does not. Therefore, it may be used to allow non-canoncial base pair interactions. Since the RNAlib does not contain free energy contributions
for non-canonical base pairs
, they are scored as the maximum of similar, known contributions. In terms of a Nussinov like scoring function the free energy of non-canonical base pairs is therefore estimated as
[WHERE] allows to specify in which loop context the base pair may appear.
to the set of
consecutive nucleotides, starting at position
. The pseudo free energy is applied only if these nucleotides are considered unpaired in the recursions, or evaluations, and is expected to be given in
.
to the set of base pairs
. Energies are expected to be given in
.