FAQ

What are the tunable settings and how can they influence my results?

● Input fasta files type: The input files can be either GENOMES (fully assembled or contigs) or ORFs (open reading frames). In the case of GENOMES, the first step in the pipeline is ORFs prediction using Prodigal. If the user already has ORFs and would like to use them as the starting point, they should choose ORFs in this parameter, and in that case ORFs prediction is skipped.

● Filter out contigs / orfs of plasmids: The pipeline can as a first step filter out contigs or ORFs that belong to plasmids. This is done by a simple search for the word "plasmid" in the contig or ORF record name.

● Minimum sequence identity & sequence coverage (protein-level) for homologs detection: These parameters are used in the homology search step, which is the first step in the orthogroup inference step. The homology search is performed using the MMSEQS2 program, which is a fast and sensitive homology search tool. The default values for these parameters are 40% sequence identity and 70% sequence coverage. These values can be adjusted to increase or decrease the stringency of the homology search. For example, if the user is interested in detecting more remote homologs, the sequence identity and coverage thresholds should be decreased. In that case, the inferred orthogroups will be larger and contain more genes in each orthogroup. In contrast, if the user is interested in detecting only very close homologs, the sequence identity and coverage thresholds should be increased. In that case, the inferred orthgroups will be smaller as they will contain only very similar genes.

● Minimum percent of strains required to consider an orthogroup as part of the core genome: The parameter dictates the inclusion or exclusion of orthogroups in the core proteome (and core genome). By default, this value is set to 100% and thus, only orthogroups that contain members of all analyzed genomes are included in the core proteome. However, when bacteria from different orders are analyzed, this strict definition can lead to a very small core proteome.
In that case, the core threshold can be lowered. For example, when a 70% threshold is used, orthogroups shared by at least 70% of the analyzed genomes are included in the "core" proteome. In this case, the tree will be inferred using a larger dataset, albeit, with missing values. As always, the best way to study the impact of tunable parameters on a specific dataset is to perform trial runs using different thresholds.

● Root the species tree according to this outgroup: The outgroup genome is used to root the species tree. By default, no outgroup is used and the produced species tree is unrooted. Alternatively, the user can indicate one of the file names (without the file extension) in the uploaded dataset as an outgroup. In that case, The outgroup genome should be a genome that is phylogenetically distant from the ingroup genomes.
Note that if the outgroup name is not found in the dataset, it will be ignored and the produced tree will be unrooted.

● Apply bootstrap over the species tree: The bootstrap values quantify the reliability of all non-trivial splits (branches) in the inferred tree. By default, they are not computed. The user can choose to compute them and present them on the species tree.

● Add orphan genes to orthogroups table: By default, orphan genes (genes that do not belong to any orthogroup) are not included in the orthogroups table (in the result folder 05a_orthogroups). The user can choose to include them as orthogroups with single genes. Of note, orthogroups that contain multiple genes of a single genome - are always included in the orthogroups table, regardless of this parameter.

Should I include an outgroup genome in my analysis?

An outgroup genome allows reconstructing rooted trees. However, including an outgroup genome may strongly affect the results. For example, core genes are genes shared by all analyzed genomes. A phylogenetically distant outgroup may share only some of the genes shared by all ingroup genomes, thus resulting in sparse data, i.e., a core proteome composed of fewer genes. This, in turn, may produce a less accurate phylogenetic tree. Note that it can also lead to no tree at all, if including the outgroup results in no genes being shared by all genomes (see Why a species tree was not included in my analysis results?). Including a remote outgroup may also introduce a long-branch attraction artifact, as well as biases due to different nucleotide composition in the ingroup vs. outgroup sequences (see Can I trust the obtained core gene phylogenetic tree?). Thus, in cases where an outgroup is available, we recommend running the analysis twice - with and without the outgroup. The user is advised to compare these two runs and study the impact of including the outgroup on the specific data being analyzed.

Why does the orthogroup inference step take so long?

The first step in orthogroup inference is homologs detection. When searching for homologs, all of the genes in each genome are queried against all other genes in all other genomes, making it the most computationally intensive step. Even though M1CR0B1AL1Z3R uses MMSEQS2, the fastest algorithm currently available for this task (Steinegger M. & Soding J., Nat Biotechnol, 2017) and despite the fact that it uses parallelization for this computational step, it might take over 12 hours for a dataset containing ~150 genomes.

How reliable is the orthogroup inference step and is there a way to improve it?

The detection of orthologous genes relies on the correct identification of ORFs. Thus, errors in detecting ORFs can propagate to erroneous detection of orthologs. In addition, the ortholog-identification step relies on homology search algorithms such as BLAST (or in our case, MMSEQS2). These algorithms are approximate and may lead to two types of error: the identification of an erroneous orthologs and the misidentification of a correct ortholog. An ortholog can also be missed because some of the analyzed genomes may not be fully assembled. Finally, horizontal gene transfer (HGT) is a major evolutionary force shaping bacterial evolution and thus, seemingly orthologous sequences can, in fact, represent xenology rather than orthology (xenology is when sequence similarity stems from HGT events rather than from vertical divergence following speciation events). It is critically important to be aware of these potential biases when interpreting the results. An excellent reference that discusses these problems and suggests algorithms to test (and sometimes correct) for such biases is Philippe H., et al. (PLoS Biology, 2011).

Can I trust the multiple sequence alignments of the orthogroups?

Despite substantial advances in multiple alignment theory and the development of even faster and more accurate alignment programs, generated alignments are still fraught with errors (Thompson JD., et al., PLoS One, 2011; Sela I., et al.,Nucleic Acids Res, 2015). In this web server, we use the MAFFT program (Katoh K. and Standley DM., Mol Biol Evol, 2013), which combines accuracy and computational speed, and is one of the most widely used alignment methods. However, other excellent alignment programs exist. For example, the PRANK program (Loytynoja A. and Goldman N., Science, 2008) is considerably slower than MAFFT, but has been shown to yield more accurate alignments compared to MAFFT in simulation studies (e.g., Sela I., et al.,Nucleic Acids Res, 2015). In our web server, we provide the inferred multiple sequence alignments for all the orthogroups, and the user can download these files and realign them using any other alignment method. In addition, we suggest testing the reliability of any specific alignment using existing tools, such as the GUIDANCE2 web server (Sela I., et al.,Nucleic Acids Res, 2015).

Can I trust the produced species phylogenetic tree?

Reconstructing accurate phylogenetic trees is one of the "holy grails" of molecular evolution research. It is a notoriously difficult task, known to be affected by many factors, including (i) the quality of the input sequences; (ii) the sequence sampling; (iii) the quality of the input alignment; (iv) the identification of orthologous sequences; (v) missing data; (vi) the assumed evolutionary model; (vii) the level of sequence divergence and saturation; (viii) random stochastic factors. Often, reconstructed trees are highly supported, yet they reflect non-phylogenetic signal rather than genuine phylogenetic signal (Philippe H, et al., PLoS Biology, 2011). For example, long tree branches tend to cluster together regardless of their true evolutionary relationships, a phenomenon called "long-branch attraction" (Felsenstein J, Syst Zool, 1978). Similarly, and especially when bacterial sequences are analyzed, genomes with similar GC content can cluster together, generating trees that reflect similarity in nucleotide composition rather than true vertical inheritance (Galtier N and Gouy M, PNAS, 1995). Furthermore, horizontal gene transfer events can bias the inference of orthologous sequences (see How reliable is the orthogroup inference step and is there a way to improve it?), and thus trees inferred from a concatenation of many multiple sequence alignments may reflect "an average" of conflicting gene trees. This average may or may not reflect the desired "vertical" (or species) tree. Eliminating these biases is an active research area. We highly recommend manually inspecting every step of the M1CR0B1AL1Z3R pipeline, including the phylogenetic tree reconstruction. Hence, a user can download the core proteome and try to run it with or without a specific subset of alignments. A user can test each gene for congruency with the obtained species tree. Genes that are significantly incongruent with the species tree are suspected of reflecting cases of horizontal transfer. One can then reconstruct the tree, without genes suspected of horizontal transfer, and compare it to the tree obtained using all of the data. Note that removing genes reduces the noise, but also the phylogenetic signal, and there is no consensus regarding the optimal strategy. We recommend Anisimova, et al. (BMC Evolutionary Biology, 2013) and Philippe et al. (PLoS Biology, 2011) for further reading about biases and potential solutions when reconstructing phylogenetic trees.

Why does the species tree reconstruction step take so long?

We aim to reconstruct the phylogenetic tree based on state-of-the-art methodologies. To this end, we apply the maximum-likelihood paradigm, which relies on an explicit evolutionary model of sequence evolution. Specifically, we assume the GTR+I+gamma model, with among-site rate variation modeled using the discrete gamma distribution. The tree is reconstructed using one of the fastest programs, IQ-TREE (Minh, Bui Quang, et al, Molecular biology and evolution, 2020). Despite the speed of IQ-TREE, the number of sequences and the number of sites in the alignment can significantly affect the running time.

Why was a species tree not included in my analysis results?

There are several scenarios that can lead to results lacking a phylogenetic tree. First, when a dataset consists of less than four different genomic sequences, there is only one unrooted tree. Hence, there is no point in running the tree search algorithm. Second, there may be no core genes. The tree reconstruction is based on the core proteome, and if no core genes exist (genes shared by all genomes), there are no data to reconstruct the tree. An empty core proteome could result from extensive missing data, so that each gene is missing in at least one genome. A way around this is to change the definition of core genes in the Advanced Setting "Minimum percent of strains required to consider an orthogroup as part of the core genome" (see What are the tunable settings and how can they influence my results?). Third, even when more than three sequences are available and core genes do exist, the tree can still be missing. This can reflect genomic sequences that are identical. IQ-TREE reduces the input alignment to unique sequences, i.e., it removes duplicates. If removing duplicates leaves less than four sequences, a tree will not be generated.

back to home page