SUPAGRO





UM2


CNRS

This is the documentation for pipelines based on MACSE v2

1. Overview

MACSE (Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons) provides a complete toolkit dedicated to the multiple alignment of coding sequences that can be leveraged via both the command line and a Graphical User Interface (GUI).

Various strategies can be built using the MACSE toolkit to handle datasets of various sizes and containing various types of sequences (contigs, pseudogenes, barcoding sequences).

We share here some of the pipelines that we have built so far using MACSE. These bash pipelines are encapsulated into Singularity containers so that you don't need to deal with dependancies or configuration issues.

2. Getting started

If you are new to Singularity, you should probably start here:

3. The alignment pipelines

4. The barcoding related pipelines

For Barcoding, if you have dozen of thousands of sequences to align (e.g. COI-5P or matK) we suggest the following steps i) using a reference sequence that you have eyed cheked, identify similar sequences in your dataset and reverse complement them when needed; then select a subset of about 100 sequences representatives of this dataset ii) align these representative sequences iii) align (in parallel) each sequence against these representative subset. This can be done in three command lines using our dedicated pipelines: We succesfully launch this barcoding pipeline on different taxonomic group. The corresponding data are available on this page.

5. OMM_MACSE and AlFiX pipelines share several common steps and options

Mandatory input and output file options

Both pipelines produce several output files: two alignment files (at the NT and AA levels); a csv file with filtering statistics per sequence, a fasta file with filtering details (where nucleotides of input sequences are in upper cases if present in the final alignment and in lower case otherwise). All output files are stored in a new folder and named with a common prefix so that they do not mix with your own files. Both pipelines have therefore three mandatory options:


option_name used for example
--in_seq_file specifying the INput SEQuence FILE containing the coding nucleotide sequences to be aligned, in fasta format--in_seq_file LOC_48720_NT.fasta
--out_dir specifying the OUTput DIRectory in which result files will be stored --out_dir RES_LOC_48720
--out_file_prefix specifying the common part of output file names --out_file_prefix LOC_48720

Basic usage examples

As there are only three mandatory options, you can launch these pipelines with default options using the following commands (see Singularity quick start if needed):

Optional options for saving intermediary files (mostly for debugging purposes)

option_name used for example
--out_detail_dir specifying the output directory that will contains all intermediary files--out_detail_dir DETAILS_LOC_48720
--save_details turning ON when intermediary files need to be saved --save_details
--debug turning ON to keep the temporary folder created in /tmp/ --debug

Optional options related to genetic codes and less reliable sequences set

Both pipelines allow you to specify the genetic code that should be used to translate your sequences, and to provide a second input file that contains less reliable sequences (e.g. newly assembled contigs, pseudogenes, etc...) in which frameshifts and stop codons are expected to be more frequent:

option_name used for usage example
--genetic_code_number code_number selecting the relevant genetic code--genetic_code_number 5
--in_seq_lr_file specifying the INput SEQuence FILE containing the Less Reliable sequences --in_seq_lr_file less_reliable_seq_file.fasta

Optional options related to filtering steps

Both pipelines include four optional filtering steps: All these filtering steps are active by default but can be individually turned OFF and the minimal percentage of nucleotides used for the final trimming step can be adjusted:

option_name used for default value
--no_prefiltering turning OFF the pre-filtering step ON
--no_FS_detection turning OFF the detection of frameshifts (only relevant for the OMM_MACSE pipeline) ON
--no_filtering turning OFF the HmmCleaner alignment filtering ON
--no_postfiltering turning OFF the post-filtering of the alignment that mask isolated AA ON
--min_percent_NT_at_ends allowing to set the minimal number of nucleotides that should be present at the first and last site of the final alignment0.7

Optional option to allocate more memory (for large datasets)

For datasets containing numerous long sequences, MACSE may need more memory than the default value allocated to the java virtual machine. This can be set using the following option:
option_name used for usage example
--java_mem passing the argument to the jvm via its Xmx option--java_mem 600m

Optional options to specify how frameshifting codons should be exported

The output directory contains several files (see details in the readme_output.txt file in the output directory). The final alignment files (NT and AA) are obtained after replacing STOP codons by "NNN", and frameshifting codons by either "NNN" (default) or "---" using exportAlignment:
option_name used for default
--replace_FS_by_gaps replacing frameshifting codons by "---" instead of "NNN"OFF ("NNN")

Three options specific to the OMM_MACSE pipeline

Because OMM_MACSE relies on an external tool to rapidly align the amino acid sequences after having detected frameshifts thanks to a draft alignment performed by MACSE, three additional options are available. The first allows to select the amino acid alignment tool (MAFFT or Muscle); the second allows to pass extra parameters to the alignment tool; the third allows to turn OFF the detection of frameshifts by MACSE. Note that if this option is turned OFF while some sequences actually do contain frameshifts, the resulting alignment will be meaningless since based on an erroneous translation of the nucleotide sequences.
option_name used for usage example
--alignAA_soft specifying the software to use, MAFFT (default), Muscle or PRANK for aligning the frameshift corrected amino acid sequences--alignAA_soft MUSCLE
--aligner_extra_option specifying the options to provide to the alignment software--aligner_extra_option "--localpair --maxiterate 1000"
--no_FS_detection turning OFF the detection of frameshifts by MACSE--no_FS_detection

Muscle and PRANK are used with default parameters. MAFFT (default aligner) is launch using "--localpair --maxiterate 2000", which correspond to L-INS-i algorithma> and is usually better adapted for CDS sequences.