Usage
This section illustrates how to run ksrates on the use case dataset proposed in the Explained example, where the rate-adjustment is relative to the focal species oil palm (Elaeis guineensis). The use case dataset is stored in the GitHub repository under the example directory. The pipeline steps can be run either through Nextflow (recommended) or manually. In either case it is advised to use a computing cluster.
Note
WSL2 users can enter the Windows file system from the terminal through e.g. cd mnt/c/Users/your_username.
Run example case as a Nextflow pipeline (recommended)
The ksrates pipeline can be automatically run through Nextflow with a few preparation steps.
Clone the GitHub repository to get the
exampledataset, access the subdirectory and unzip the sequence data files in there:git clone https://github.com/VIB-PSB/ksrates cd ksrates/example gunzip sequences/*
Prepare the configuration files.
The directory already contains a pre-filled ksrates configuration file for focal species
elaeis(config_files/config_elaeis.txt), a pre-filled ksrates expert configuration file (config_files/config_expert.txt) and a Nextflow configuration file template (nextflow.config) to be filled in as described in the Nextflow configuration file section. For more details, refer to the Configuration files section.Note
To generate a new ksrates configuration file for your own analyses, launch the pipeline (step 3 below) specifying the desired non-existing filename after the
--configoption. By not finding the file, the code produces a template to be filled in as described in ksrates configuration file section. After that, repeat step 3 again.Launch ksrates through the following command line:
nextflow run VIB-PSB/ksrates -profile apptainer --config config_files/config_elaeis.txt --expert config_files/config_expert.txt
Note
As from ksrates
v2.0.0, the Nextflow pipeline has been ported to DSL2 syntax and requires at least Nextflow version22.03.0-edge. Refer to the installation page to learn how to get the latest Nextflow version. You can also launch a specific (e.g. previous) Nextflow version through theNXF_VERenvironmental variable in the command line:NXF_VER=24.10.5 nextflow run VIB-PSB/ksrates <args>
The first time the command is executed, Nextflow downloads a local copy of the ksrates Nextflow pipeline from the
VIB-PSB/ksratesGitHub repository and stores it in the$HOME/.nextflowdirectory. Parameter-profilespecifies which container will be pulled from Docker Hub (either Apptainer or Docker).Note
Since the Apptainer image is by default stored in the launching folder under
work/singularity, it is recommended to specify a “centralized” destination path throughapptainer.cacheDirin the Nextflow configuration file.The ksrates configuration file is specified through the
--configparameter, while the ksrates expert configuration file is specified through the--expertparameter. The Nextflow configuration file is automatically detected when named with the Nextflow-reservednextflow.configfilename and when located in the launching directory; alternatively, the user can provide a custom file by specifying its name or path using the-Coption (see Nextflow documentation).
Run example case as a manual pipeline
The pipeline can otherwise be run by manually launching all the individual commands which it is composed of. This also allows to re-execute single desired steps.
The syntax to run a command depends on how the package is installed:
Local installation:
ksrates [OPTIONS] COMMAND [ARGS]
Apptainer container:
Open an interactive container where to launch commands with the syntax indicated in the local installation above:
apptainer shell docker://vibpsb/ksrates
Or launch a single command through the container:
apptainer exec docker://vibpsb/ksrates ksrates [OPTIONS] COMMAND [ARGS]
Note
WSL2 users need option
-Bto mount the Windows file system in the container (e.g.-B /mnt/c/Users/your_username).Apptainer downloads the container image from Docker Hub in
$HOME/.apptainer/cacheand from then on makes use of the local copy.Docker container:
Open an interactive container where to launch commands with the syntax indicated in the local installation above:
docker run -it --rm -v $PWD:/temp -w /temp vibpsb/ksrates
Or launch a single command through the container:
docker run --rm -v $PWD:/temp -w /temp vibpsb/ksrates ksrates [OPTIONS] COMMAND [ARGS]
The
--rmoption is given to remove the container after the command is executed to save disk space (note that the container image will not be removed). The-voption mounts the current working directory in the container, while-wlets the command be run within this directory.Docker pulls the container image from Docker Hub and from then on makes use of the local copy.
In order to submit the command as a job on a computer cluster, wrap the command in the appropriate syntax for the cluster executor system/HPC scheduler (e.g. qsub for a Sun Grid Engine (SGE) or compatible cluster or a PBS/Torque family scheduler). It is strongly recommended to run the KS paralog and orthologs estimation steps (see commands below) on a computer cluster.
An overview of the commands is available by accessing the package help menu (ksrates -h):
generate-config Generates configuration file.
init Initializes rate-adjustment.
orthologs-adjustment Performs ortholog substitution rate-adjustment.
orthologs-analysis Computes ortholog divergence times Ks estimates.
orthologs-ks Performs ortholog Ks estimation.
orthologs-ks-cleanup Delete all ortholog BLAST tables.
paralogs-analyses Detects WGD signatures in paralog Ks distribution.
paralogs-ks Performs paralog Ks estimation.
paralogs-ks-multi Performs paralog Ks estimation for all species.
plot-orthologs Generates ortholog Ks distributions plot.
plot-paralogs Generates rate-adjusted mixed Ks plot.
plot-tree Generates phylogram with Ks-unit branch lengths.
The order of execution of the single commands to run the whole workflow is the following. We assume here a local installation without the use of a ksrates container.
Clone the GitHub repository to get the
exampledataset, access the subdirectory and unzip the sequence data files in there:git clone https://github.com/VIB-PSB/ksrates cd ksrates/example gunzip sequences/*
The directory already contains a pre-filled configuration file for focal species
elaeis(config_files/config_elaeis.txt) and a pre-filled expert configuration file (config_files/config_expert.txt).Note
To generate a new configuration file for your own analyses, run the following command to produce a template to be filled in as described in ksrates configuration file section:
ksrates generate-config path/to/config_filename.txt
Run the initialization script to obtain the ortholog trios for the rate-adjustment (
rate_adjustment/elaeis/ortholog_trios_elaeis.tsv) and to extract the species pairs to be run through the wgd ortholog KS analysis (rate_adjustment/elaeis/ortholog_pairs_elaeis.txt):ksrates init config_files/config_elaeis.txt --expert config_files/config_expert.txt
This step also generates
wgd_runs_elaeis.txtin the launching directory, which drafts all the commands to be run in steps 4 and 5.Launch the wgd paralog KS analysis to estimate the paralog KS values for the focal species:
ksrates paralogs-ks config_files/config_elaeis.txt --expert config_files/config_expert.txt --n-threads 4
The output files are generated in the
paralog_distributions/wgd_elaiesdirectory, i.e./elaeis.ks.tsvfor whole-paranome,elaeis.ks_anchors.tsvfor anchor pairs andelaeis.ks_recret_top2000.tsvfor reciprocally retained gene families.Using multiple threads to parallelize the analysis will reduce the compute time. The
--n-threadsoption configures the number of threads to use (set this according to your available resources, i.e. CPUs/cores; e.g. 10 or more cores running on a computer cluster).Launch the wgd ortholog KS analysis to estimate the ortholog KS values for each required species pair. These are listed in
rate_adjustment/elaeis/ortholog_pairs_elaeis.txt:ksrates orthologs-ks config_files/config_elaeis.txt --expert config_files/config_expert.txt elaeis asparagus --n-threads 4 ksrates orthologs-ks config_files/config_elaeis.txt --expert config_files/config_expert.txt elaeis oryza --n-threads 4 ksrates orthologs-ks config_files/config_elaeis.txt --expert config_files/config_expert.txt oryza asparagus --n-threads 4
The output files are generated in the
ortholog_distributionsdirectory, e.g. the first command generates filewgd_asparagus_elaeis/asparagus_elaeis.ks.tsv. The two species names are in case-insensitive alphabetical order.Using multiple threads to parallelize the analysis will reduce the compute time. The
--n-threadsoption configures the number of threads to use (set this according to your available resources, i.e. CPUs/cores; e.g. 10 or more cores running on a computer cluster).Estimate the mode and associated standard deviation for each ortholog KS distribution:
ksrates orthologs-analysis config_files/config_elaeis.txt --expert config_files/config_expert.txt
The results are stored in a local database, namely a TSV file called by default
ortholog_peak_db.tsvand generated by default in the launching directory (see ksrates configuration file).Plot the ortholog KS distributions for each focal species–other species pair (and each of their trios):
ksrates plot-orthologs config_files/config_elaeis.txt --expert config_files/config_expert.txt
The command generates a PDF file for each species pair with the three ortholog KS distributions obtained from each of the species trios the species pair is involved in. Note that if multiple trios/outgroups exist, the file is a multi-page PDF showing one trio per page. The two species names are in case-insensitive alphabetical order. In this example case there is only the E. guineensis–O. sativa species pair, thus the correspondent PDF file generated is
rate_adjustment/elaeis/orthologs_elaeis_oryza.pdf.Perform the rate-adjustment. Pre-requisite: all wgd paralog and ortholog KS analyses (steps 4 and 5) and ortholog KS distribution mode estimates (step 6) must be completed.
ksrates orthologs-adjustment config_files/config_elaeis.txt --expert config_files/config_expert.txt
The branch-specific KS contributions and the rate-adjusted ortholog KS mode estimates are collected in
rate_adjustment/elaeis/adjustment_table_elaeis.tsv.Plot the adjusted mixed paralog–ortholog KS distribution plot (
rate_adjustment/elaeis/mixed_elaeis_adjusted.pdf):ksrates plot-paralogs config_files/config_elaeis.txt --expert config_files/config_expert.txt
Plot the phylogram based on the input phylogenetic tree with branch lengths equal to the KS distances estimated from the ortholog KS distirbutions (
rate_adjustment/elaeis/tree_elaeis_distances.pdf):ksrates plot-tree config_files/config_elaeis.txt --expert config_files/config_expert.txt
Plot the adjusted mixed paralog–ortholog KS distribution with inferred WGD components:
ksrates paralogs-analyses config_files/config_elaeis.txt --expert config_files/config_expert.txt
The methods used for detecting WGD signatures depend on the paralog analysis settings in the ksrates configuration files. For more details please refer to section Mixture modeling of paralog KS distributions.
Finally, the two following commands are not strictly part of the workflow:
Remove all BLAST
.tsvfiles generated byorthologs-ksin order to free disk space:ksrates orthologs-ks-cleanup path/to/ortholog_distributions --expert config_files/config_expert.txt
The command only acts within the provided path to the
ortholog_distributionsdirectory, for example removingwgd_asparagus_elaeis/asparagus_elaeis.blast.tsvand all the other analogous files.Run the paralog KS analysis for all species provided in the Newick tree, and not only for the focal species:
ksrates paralogs-ks-multi config_files/config_elaeis.txt --expert config_files/config_expert.txt --n-threads 4
For example it will generate
paralog_distributions/wgd_asparagusandparalog_distributions/wgd_oryzawith all related paralog output files.
Practical considerations
When dealing with large input phylogenies it is useful to know that ksrates can be used iteratively, by starting with a small dataset and subsequently adding additional species to finetune the phylogenetic positioning of any hypothesized WGDs. For such iterative analyses the pipeline can reuse data from previous runs, and will only perform additional calculations on the extended dataset where needed.
When ksrates is run, the ortholog KS values for each species pair in the input phylogenetic tree and the associated ortholog KS modes are stored in a local database.
When the ksrates pipeline is subsequently rerun with additional species included in the input phylogeny, ksrates will skip the ortholog KS calculations for any species pair for which an ortholog KS mode has already been stored. The database consists of two tabular files (ortholog_peak_db.tsv and ortholog_ks_list_db.tsv, see Other output for more details) generated/accessed by default in the working directory. A custom path location can be otherwise specified in the ksrates configuration file.
In case a user doesn’t want to reuse an existing ortholog KS mode of a particular species pair and wants instead to re-estimate it from the same input data but using e.g. a different number of bootstrap iterations or KDE bandwidth, the line concerning the mode has to be manually deleted from the ortholog_peak_db.tsv database file. The successive ksrates pipeline will re-estimate the mode according to the new parameters by starting from the previously computed ortholog KS estimates for the species pair concerned, thereby skipping the onerous ortholog KS estimation step.