This manual documents the command line usage of sina. Please see http://www.arb-silva.de/aligner for a reference to the scientific description of the employed algorithms.
You can view SINA as a one-command pipeline composed of the following stages:
You can enable or disable the middle three stages as required. By default, only the alignment stage is enabled. Briefly, this is what those stages do (see section Options for details on the configuration options accepted by each of the stages).
The default parameters are a pretty good starting point. They were optimized using a large SSU rRNA gene reference MSA. If you want to use SINA for other gene sequences, see section Examples on how to do some simple accuracy benchmarks on them. To improve the results, the parameters you will want to start with are --fs-full-len (set to the typical size of a full-length sequence) and --fs-kmer-len (setting this to 8 may help with more variable or shorter sequences).
Alignment difference visualization requires that the input sequences be already aligned in a way compatible with the used reference alignment. For positions at which the original alignment and the alignment computed by SINA differ, output as shown below will be printed to the log:
Dumping pos 1121 through 1141: --------- 4 14 16-17 21 24 G-C-AGUC- 40 <---(%% ORIG %%) GCA--GUC- 41 <---(## NEW ##) GCA-AGUC- 0-3 5-13 15 18-20 22-23 25-27 29-39 GCAA-GUC- 28
In this case, the bases 'C' and 'A' where placed in other columns than as per the original alignment. The original alignment is marked with <---(%% ORIG %%). The new alignment is marked with <---(## NEW ##). The numbers to the right of the alignment excerpt indicate the indices of the sequences in the alignment reference (field align_family_slv) which the respective row represents. All-gap columns are not shown. The first line indicates the range of alignment columns displayed.
socket may either be of the format hostname:port, specifying a TCP socket, or of the format :filename, specifying a Unix socket. If no running PT server could be contacted and a Unix socket is specified or hostname is "localhost", a PT server will be started locally. If hostname is "__SGE__" SINA will start and contact a PT server on a cluster node using qrsh(1). Otherwise, ssh(1) will be used to start a PT server on the configured host.
The default is to use port "localhost:4040".
CAUTION: If a PT server is already running on the configured socket, but its database does not match the database configured with --ptdb the results will be undefined. The search result retrieved from the PT server identifies sequences using the name field. For completely different databases, this will usually result in SINA being unable to find reference sequences. It may, however, also result in SINA retrieving the wrong sequences.
To determine which orientation is most likely, SINA uses the PT server to search for the sequence in the configured orientations. If an orientation different to the original yields a higher scoring best match, the sequence is modified accordingly.
Normally, SINA will compare the input sequence with all reference sequences found via the PT server search. If the input sequence is a substring of any of the reference sequences, the alignment of the reference sequence of which the input sequence is a substring will be directly transferred to the input sequence.
If the input sequence is found to be an exact match to a reference sequence, this will be noted in the field align_log_slv with the string "copied alignment from identical template sequence". If the input sequence is found to be a substring of a reference sequence, this will be noted with the string "copied alignment from (longer) template sequence". In both cases, the contents of the fields acc and start will also be logged to identify the reference sequence.
If suitable sequences for alignment copying are found, but --realign is set, the sequences will be removed from the alignment reference. This will be noted in the log with the message "sequences [acc list] containing exact candidate removed from family;".
The default setting is to shift
the
bases surrounding such a large insertion aside as required. This is done by iteratively choosing the
nearest free column to the left or right until sufficient columns have been found. Each time bases
are encountered between the insertion and the free column, these bases are added to the insertion. The
main benefit of this naive approach is that the position and size of insertions that could not be
accommodated are known. The message "shifting bases to fit in N bases at pos X to Y"
will be logged
each time an insertion of length N is attempted between positions X and Y with Y-X
The option forbid
configures SINA to instead disallow insertions that will not fit the reference
alignment during the dynamic programming stage of sequence alignment. While this option constitutes
a loss of optimality of the alignment algorithm if the gap extension penalty (see --pen-gapext
below) is different from the gap open penalty (see --pen-gap
below) it results in slightly less
damage to the alignment accuracy.
The option remove
configures SINA to omit bases from insertions as necessary to fit these
insertions into the alignment without moving surrounding aligned bases. This option should be handled
with care as the original sequence is altered. If the alignment is subjected to column masking or
column sampling (such as during tree reconstruction with bootstrapping), omitting bases is safe, as
these methods interpret the resulting MSA from a column perspective.
Which option is the most suitable should be carefully considered for each use case. Whenever possible,
circumventing the necessity to handle insertions that do not fit into the alignment by simply
adding gap columns into the reference alignment is the preferred solution.
This value is likely to change or disappear in future versions.
This value is likely disappear or change in future versions.
./sina -i mysequences.fasta -o alignedsequences.fasta \ --ptdb reference.arb
The first time you run this, a PT server will be started and will begin building its index. The index is stored in reference.arb.pt and will only be computed again if reference.arb changes (the decision is made based on file timestamps only). The PT server will also continue to run once it has been started. Subsequent sina runs will be much faster therefore. Nonetheless, start-up time may be long if reference.arb is large.
./sina -i mysequences.fasta -o alignedsequences.fasta \ --meta-fmt csv \ --ptdb reference.arb \ --search --search-db reference.arb --lca-fields tax_slv
The classifications will be (among the other values) written to alignedsequences.fasta.csv to the column labeled "lca_tax_slv".
./sina -i mysequences.fasta -o mysequences.arb --prealigned
This will generate an ARB file from your aligned sequences suitable for use as a reference MSA. The first word of each FASTA header will be written to the ARB field "name". Make sure they are unique for each sequence. ARB uses this field to identify sequences, duplicates will overwrite the previous sequence with the same name. The remainder of the fasta header will be written to the field "full_name".
./sina -i myreference.arb --ptdb myreference.arb \ -o /dev/null --outtype fasta \ --fs-leave-query-out --show-dist
The average accuracy will be printed at the end of the SINA run.
sed '/^[^>]/ y/uU/tT/' rna.fasta > dna.fasta
The author of SINA reserves all copyrights and other intellectual property rights. All further rights are at Ribocon GmbH (the "Owner") in legal agreement with the author of SINA and all third parties involved.
If you are interested in commercial use of the SILVA stand-alone software contact sina@ribocon.com.
Personal Use and Evaluation License (PUEL) for SINA Stand-Alone Software
This license applies if you download the SINA Stand-Alone Software Package (the "Product") from www.arb-silva.de. In summary, the license allows you to use the Product free of charge for academic Personal Use or, alternatively, for non-academic, time-limited Evaluation.
Overview: Personal Use (academic) is when you install the Product yourself and you make use of it. You can use the Product within an academic study to process as much data as you like and publish the processed data as long as you follow the terms below. If you deploy the Product to a single or multiple computers for colleagues within your institution, e.g. in the capacity as a system administrator, this would no longer qualify as Personal Use.
Personal Use does NOT include (1) any redistribution of the Product, (2) any kind of Product-based data analysis service for third parties, or (3) integration of the Product into another software.
License Agreement: You should have received a copy of the license agreement with this software in the file LICENSE.txt. If you did not, please visit http://www.arb-silva.de/aligner/sina.