Overview

Introduction

Significant technological advances in the last decade have enabled the use of large-scale sequencing, mining, and analysis of bacterial and archaeal genomic data in an unprecedented resolution. Such large-scale analyses have contributed to our understanding of microbial diversity and evolution, both within populations and among different phylogenetic groups. Following sequencing and genome assembly, analyzing microbial genomics data involves various computational steps such as the detecting open reading frames (ORFs), finding orthologous groups, aligning orthologous sequences, and reconstructing phylogenetic trees. These steps as well as additional analysis-specific computations require the use of multiple bioinformatics tools and various ad-hoc programming scripts, making the entire process cumbersome, tedious, and prone to errors due to manual handling. This has prompted laboratories to implement their own in-house analysis pipelines, and thus different analysis applications have begun to emerge. These applications require specific working environments, i.e., operating system, multi-cores machines, and more than basic technological skills, such as installations and running.

The motivation for developing M1CR0B1AL1Z3R

M1CR0B1AL1Z3R (pronounced: microbializer) was developed in order to facilitate large scale microbial genomics analyses. Our goal was to make M1CR0B1AL1Z3R easily accessible to the scientific community and thus to allow handling the abovementioned computational challenges of analyzing dozens of bacterial genomes simultaneously. Such computations are highly demanding, and therefore, M1CR0B1AL1Z3R runs background computational processes in parallel on a high-performance computer cluster. Notably, due to the high computational demand of these analyses, they cannot run on a simple desktop computer. To this end, M1CR0B1AL1Z3R enables the research community to analyze massive genomic data that previously necessitated expensive computational infrastructure and expert knowledge. Visual and textual results, ready for publication or further analysis, are given as an output.

The type of data analyzed by M1CR0B1AL1Z3R

The input to the M1CR0B1AL1Z3R web server is a collection of genomic sequences, one file for each genome. The input data should be binned (in the case of metagenome sequencing, binning is the classification of reads and contigs to different taxonomic units) and assembled (the generation of contigs from sequence reads) beforehand. Of note, we do not assume that the reads were assembled to a single contig. The input to our web server can thus be microbial genomes, where each genome is composed of one or multiple contigs. In the case the user already has the open reading frames (ORFs) of each genome and wants to start the analysis from those, we also accept ORFs as valid inputs.
Currently, the web server is designed for bacterial and archaeal genomes (eukaryotic genomes and transcriptomes are not handled). M1CR0B1AL1Z3R is useful for analyzing genomes from a range of phylogenetic relatedness: from genomic sequences of bacterial isolates of a specific species, to genomes of different species that belong to a diverse phylogenetic group such as the Gammaproteobacteria.

Input specification

The input for the web server is a zipped folder (either .zip or .tar.gz) in which each file is in a FASTA format containing genomic sequence of a different species/strain/isolate. Each file can contain either a fully assembled genome, a collection of contigs (originating from the same genome) or the open reading frames (ORFs) of that genome. We support up to 2120 input genomes, hence the maximum number of FASTA files in the zipped folder is 2120.

Basic terms used in M1CR0B1AL1Z3R

● An open reading frame (ORF) is a part of the gneome that encodes for a protein.

● An Orthogroup or orthologous group is a set of genes that diverged from a single gene in the LCA (last common ancestor) of the analyzed genomes. The genes in an orthogroup have high sequence similarity that can be detected using remote homology search algorithms such as BLAST. Of note, two genes from different genomes in an orthogroup are considered orthologs (=genes that diverged from a specification event) and two genes from the same genome in an orthogroup are considered paralogs (=genes that diverged from a duplication event).

● Species tree reflect the evolution of orthologous sequences, i.e., the vertical phylogenetic relationships among the analyzed genome. In Bacteria and Archaea, many genes do not follow the species tree due to horizontal gene transfer events. Gene trees (are being reconstructed based on a MSA of a single-gene orthologous group and thus) represent the evolutionary relationships among the analyzed genes. Gene trees can differ from the species tree due to horizontal gene transfer and gene duplications events. Trees in general are inferred and not observed and as such they are subjected to biases and errors (see Can I trust the obtained core gene phylogenetic tree?).

● A phyletic pattern matrix is a table in which row i is a genome, column j is an orthogroup and the i,j entry contains 1 if the orthogroup j contains a member of genome i, o.w., 0. Phyletic patterns data are used to infer events of gene loss and gene gain (acquisition), taking into account the phylogenetic relationships among the analyzed genomes.

● Core genes are genes that are shared by all analyzed genomes. The collection of protein sequences that are encoded by these genes is called the core proteome. Core genes are often essential to the organism and hence are never lost. As these genes are found in all analyzed genomes, they are often used to reconstruct the species tree. In contrast, the pan genome is the entire set of genes, i.e., a gene belongs to the pan genome if it is present in at least one of the analyzed genomes.

● GC content: the percentage of the nucleotides G and C out of the total number of nucleotides. GC content varies among genomes.

Output and methodology


The pipeline of M1CR0B1AL1Z3R is composed of the following algorithmic steps:


● M1CR0B1AL1Z3R calculates ANI (average nucleotide identity) values between all pairs of genomes using FastANI (Jain C., et al., Nature communications, 2018).

● M1CR0B1AL1Z3R extract all ORFs from all genomes using Prodigal (Hyatt D., et al., BMC Bioinformatics, 2010). If the inputs are already ORFs, this step is skipped. The ORFs are then translated to amino acids sequences.

● M1CR0B1AL1Z3R uses the protein sequences of each genome to calculate its BUSCO score (a measure of the completeness of the genome assembly, Simão, Felipe A., et al, Bioinformatics, 2015).

● M1CR0B1AL1Z3R infers orthogroups among the protein sequences of all genmoes using a self-implemented version of the OrthoMCL algorithm (Li, Li, Christian J. Stoeckert, and David S. Roos, Genome research, 2003). The algorithm uses MMSEQS2 (Steinegger M. & Soding J., Nat Biotechnol, 2017) to infer homologous sequences and then clusters them using MCL (Van Dongen SM., Thesis, 2000).

● M1CR0B1AL1Z3R reconstructs an amino acid multiple sequence alignment for each orthogroup using MAFFT (Katoh K. & Standley DM., Mol Biol Evol, 2013).

● M1CR0B1AL1Z3R computes the core-proteome (proteins that are shared across all bacterial genomes) and provides the core-proteome alignment, based on which, the species tree is inferred. In cases where there are duplications (i.e., two genomes for which the sequences of the core genes are identical), a reduced core-proteome alignment (containing only unique sequences) is provided as well.

● M1CR0B1AL1Z3R reconstructs a maximum-likelihood-based phylogenetic tree using IQ-TREE (Minh, Bui Quang, et al, Molecular biology and evolution, 2020) based on the inferred core-proteome alignment and visualizes it interactively.

● M1CR0B1AL1Z3R detects orphan genes (genes that are not part of any orthogroup).

● M1CR0B1AL1Z3R adds functional annotation to each orthogroup using the KOfam database (Aramaki, Takuya, et al, Bioinformatics, 2020) which is based on the KEGG Orthology database.

● M1CR0B1AL1Z3R analyzes the codon bias among the different genomes and the different orthogroups (Sharp, Paul M., and Wen-Hsiung Li, Nucleic acids research, 1987)


M1CR0B1AL1Z3R provides the following results in the "All outputs" zip available in the download page (see example here):


● Folder 01_ani: a table and heatmap of the ANI (average nucleotide identity) values between all pairs of genomes.

● Folder 02a_orfs: a file for each genome containing the ORFs of the genome.

● Folder 02b_orfs_plots: ORFs statistics (ORFs count and GC content) per genome.

● Folder 02c_translated_orfs: a file for each genome containing the translated ORFs (amino acid sequences) of the genome.

● Folder 03_genome_completeness: The genome completeness BUSCO score of each genome.

● Folder 04_orphan_genes: A file for each genome containing a list of the orphan genes of that genome (genes that don't have any orthologs in other genomes). Orphan genes may be single-copy, meaning that they have only one copy in the gneome, or multi-copy, meaning that they have multiple copies in the genome (and in that case they are part of an "orphan orthogroup" - orthgroup that contains genes from a single genome).

● Folder 05a_orthogroups: The orthogroups: (1) in a table (csv) format, where each row is an orthogroup and each column is a genome. (2) an annotated table similar to (1) with the addition of KEGG Orthology annotations for each orthogroup as well as the average Codon Adaptation Index (CAI) of all the genes in the orthogroup. (3) A phyletic pattern matrix (a table where row i is a genome, column j is an orthogroup and the i,j entry contains 1 if the orthogroup j contains a member of genome i, o.w., 0).

● Folder 05b_orthogroups_sizes: A bar plot of the orthogroups sizes frequency.

● Folder 06a_orthogroups_dna: A fasta file for each orthogroup containing the UNALIGNED DNA sequences of the members.

● Folder 06b_orthogroups_aa: A fasta file for each orthogroup containing the UNALIGNED AA sequences of the members.

● Folder 06c_orthogroups_aa_msa: A Fasta file for each orthogroup containing the ALIGNED AA sequences of the members.

● Folder 06d_orthogroups_induced_dna_msa_by_aa_msa: A fasta file for each orthogroup containing the ALIGNED DNA sequences (codon alignment) of the members.

● Folder 07a_aligned_core_proteome: A fasta file containing the aligned core-proteome of the species in the dataset and different statistics of it.

● Folder 07b_aligned_core_genome: A fasta file containing the aligned core-genome of the species in the dataset and different statistics of it.

● Folder 08_genome_numeric_representation: A genome numeric representation of the core genome/proteome. This is a fasta file, where the first genome is used as reference and all core genes are numbered according to their order in it. For each other genome, the core genes ids (defined by the reference genome) are written in the order they appear in the genome. This representation can be used to infer translocation events within genomes and inversions of segments of multiple genes.

● Folder 09_species_phylogeny: A reliable maximum likelihood species phylogeny tree.

● Folder 10_codon_bias: (1) W_vectors.csv - A table of the relative adaptiveness (w vector) of each codon in each genome, as defined by Sharp and Li, 1987. (2) Relative_Adaptiveness_scatter_plot.png - A clustering of the genomes based on their W vectors. (3) CAI_table.csv - A table of the Codon Adaptation Index (CAI) average of each orthogroup. (4) CAI_histogram.png - A histogram of the CAI values of the orthogroups.

In addition to the "All outputs" zip, the user can download several outputs separately from the download page (see example here).

back to home page