Reciprocally retained gene families for plant WGMs

Concept

As from ksrates v2.0.0, an additional K_S analysis type called reciprocal_retention and specifically targeting plants has been introduced alongside the existing paranome and colinearity.

This pipeline implements the construction of paralog K_S distributions on the basis of plant gene families known to undergo preferential expansion via whole-genome duplication (WGD), rather than small-scale duplication (SSD), followed by retention in duplicate. This “reciprocal retention” pattern is hypothesized to stem from dosage balance constraints. By targeting gene families known to expand in this manner, the analysis aims to mitigate the noise from SSDs and accentuate the signal of WGD peaks, analogously to what achieved with the construction of anchor pair K_S distributions.

A detailed explanation about the reciprocal retention pattern and the implementation of this method in ksrates is available in our preprint on bioRxiv.

_images/mixed_palm_anchors_recret.svg — Figure 1: Rate-adjusted mixed paralog–ortholog K_S plot for oil palm, showing anchor pair K_S distribution (light gray) and reciprocally retained K_S distribution (dark gray). Both distribution types reduces the SSD background and define two WGM peaks.

This method is able to provide precise and narrow peak signals for recent WGMs, and distinguishable peak signals for old WGMs (e.g. Eudicot gamma triplication), offering an additional resource in the toolbox of WGM inference.

Compared to the anchor pairs, leveraging the reciprocal retention pattern doesn’t require any genome structural annotation (GFF file) and makes the search for K_S values less susceptible to genome fragmentation and rearrangements. Nevertheless, since this particular expansion pattern looses its strength over long evolutionary time due to dosage balance compensation mechanisms, the inference of ancient WGMs remains challenging. Moreover, the limited pool of available reciprocally retained gene families (hundreds, up to few thousands) can produce less populated K_S distributions, depending on the species.

Reciprocal retention pipeline

In Li et al. (2016), the analysis of a dataset of 37 angiosperm species provided 9178 core-angiosperms gene families (orthogroups); later, in Tasdighian et al. (2017), these gene families were ranked based on the strength of their reciprocal retention pattern, employing birth-death stochastic modeling on their gene count data. In ksrates, we utilize the set of gene families spanning up to the top 2000th position in this ranking in order to collect a sufficient amount of K_S values without reintroducing too much of the SSD background.

Warning

It follows that this method is not applicable to taxonomic groups other than plants. Furthermore, as the set of reciprocally retained gene families has been derived solely from angiosperms, its suitability for non-angiosperm plants might also be debated.

The process of obtaining reciprocally retained gene families for the focal species involves reconstructing them from scratch using a clustering algorithm, encompassing the following steps (for comprehensive details, please consult our preprint’s Methods section):

Note

If running the pipeline outside the container, make sure your ksrates manual installation included the step downloading file original_angiosperm_sequences.tar.gz.

Homology search: execute diamond on the merged sequences from the focal species’ FASTA file and the FASTA files originally used for the 37 angiosperms in Li et al. (2016)
Gene family clustering: execute OrthoMCL on the generated diamond table
Matching gene families: match the original top 2000 gene families in the ranking to the newly generated OrthoMCL families, based on shared gene IDs from the 37 angiosperms. Note that an original gene family may match multiple new gene families.
Retaining good matches: keep only the well-matched gene families, i.e. those introducing zero or few extra genes belonging to the 37 angiosperms;
Reconstructing gene families: in case of multiple well-matched gene families, merge them to reconstruct the top gene family
Extracting focal species’ genes: discard all but the focal species’ genes and retain only gene families including at least two focal genes, which is the minimum required for K_S estimate

Runtime, memory and space usage

Compared to the paranome or collinearity analyses, the reciprocal_retention pipeline requires longer runtime (about a day), larger memory usage during execution (about 40 GB) and larger writing space for temporary output files (at least 50 GB). It is therefore recommended to check the space availability on disk in advance, execute the pipeline on a computer cluster with parallelization and remove heavy temporary files after completion.

Note

Users may want to initially run only the whole-paranome and anchor pairs analyses in order to obtain prompter results; the reciprocal retention analysis can be activated in a subsequent run by setting reciprocal_retention = yes in the ksrates configuration file.

The diamond step, taking as input tens of angiosperm FASTA files (focal species plus 37 angosperms), generates an output file of 21 GB. This file can be deleted once the reciprocal retention pipeline is completed. The file location for the example dataset is paralog_distributions/wgd_elaeis/reciprocal_retention/orthomcl/elaeis_original_37.dmd.tsv.

The OrthoMCL step, taking as input the diamond table, generates within the tmp subdirectory the “blast parse output” (BPO) output file, which occupies the same space as the diamond table (i.e. 21 GB). This file can be deleted once the reciprocal retention pipeline is completed. The file location for the example dataset is paralog_distributions/wgd_elaeis/reciprocal_retention/orthomcl/May_16/tmp/all.bpo.

Note on OrthoMCL implementation

OrthoMCL v1.4 was chosen over other tools (e.g. OrthoFinder) to be consistent with the software choice made in the original publication that generated the core-angiosperm gene families (Li et al., 2016). We however make use of an edited version, called OrthoMCLight, where a few optimizations have reduced memory requirement from 100 GB to 40 GB and the runtime from several days to about a day. This was achieved by changing the following:

The blast_parse subroutine avoids loading the entire diamond table into memory and instead processes it line by line;
Multi-threading is supported in the MCL step

Both OrthoMCL mode 3 and 4 are available. Mode 3 is always performed at first execution of the reciprocal retention analysis and by default also from the second execution on: it takes as input the diamond homology table, it parses it into a “blast parse output” (BPO) format typical of OrthoMCL v1.4 and finally runs MCL. Mode 4 instead accepts directly the parsed table file and then runs MCL; if the parsed table has already been generated in a previous ksrates run, the user can provide its filepath through option --parsed-homology-table when running command paralogs-ks.

Compatibility with the original OrthoMCL v1.4 software is supported through the expert configuration file (use_original_orthomcl_version).

Feedback

For any question or feedback concerning output, user experience and implementation of the reciprocal_retention pipeline, feel free to reach out on our GitHub VIB-PSB/ksrates repository.