Configuration files

ksrates requires up to three configuration files:

(always required): a ksrates configuration file with all necessary settings for a given analysis.
(required when using the ksrates Nextflow pipeline): a Nextflow configuration file to configure the executor resources (e.g. computer cluster, local computer) and container settings.
(optional) an additional ksrates configuration file with expert and debug settings.

ksrates configuration file

Description

The analysis configuration file is composed of a first section defining the species used, their phylogenetic relationships and their sequence data files, a second section concerning general analysis settings and a third section concerning mainly figure parameters. Below is an example for an analysis with oil palm (Elaeis guineensis) as the focal species.

[SPECIES]
focal_species = elaeis
newick_tree = ((elaeis, oryza), asparagus);

latin_names =       elaeis    : Elaeis guineensis,
                    oryza     : Oryza sativa,
                    asparagus : Asparagus officinalis

fasta_filenames =   elaeis    : sequences/elaeis.fasta,
                    oryza     : sequences/oryza.fasta,
                    asparagus : sequences/asparagus.fasta

gff_filename = sequences/elaeis.gff3

peak_database_path = ortholog_peak_db.tsv
ks_list_database_path = ortholog_ks_list_db.tsv


[ANALYSIS SETTING]
paranome = yes
collinearity = yes
reciprocal_retention = no

gff_feature = mrna
gff_attribute = id

max_number_outgroups = 4
consensus_mode_for_multiple_outgroups = mean among outgroups


[PARAMETERS]
x_axis_max_limit_paralogs_plot = 5
bin_width_paralogs = 0.1
y_axis_max_limit_paralogs_plot = None

num_bootstrap_iterations = 200
divergence_colors =  Red, MediumBlue, Goldenrod, Crimson, ForestGreen, Gray, SaddleBrown, Black

x_axis_max_limit_orthologs_plots = 5
bin_width_orthologs = 0.1

max_ks_paralogs = 5
max_ks_orthologs = 10

The [SPECIES] section includes:

focal_species: name of the focal species. A K_S paralog distribution is generated for this species and its K_S-scale is used as the rate-adjustment reference. It is advised to use a short name (for example, the genus or family name) or common abbreviation here; spaces and underscores are not tolerated. For example, “elaeis” or “eguineensis” instead of “Elaeis guineensis”.
newick_tree: phylogenetic relationships among the involved species as Newick format using leaf nodes only (more info e.g. on this website). It has to contain the focal species as named in parameter focal_species. It is advised to use short names or abbreviations for the other species as well; spaces and underscores are not tolerated.
latin_names: list of associations between each of the species names in parameter newick_tree and their scientific names, which will be used in legends and plot titles. The association is made with a colon (‘:’).
fasta_filenames: list of associations between each of the species names in parameter newick_tree and the corresponding paths to their FASTA files. The association is made with a colon (‘:’).
gff_filename: association between the focal species as named in parameter focal_species and the path to the GFF3 file for the focal species (only required for collinearity analysis). The association is made with a colon (‘:’).
peak_database_path: path to the database of ortholog K_S distribution peaks. If the file is not present yet, it will be automatically generated.
ks_list_database_path: path to the database of ortholog K_S lists. If the file is not present, it will be automatically generated.

The [ANALYSIS SETTING] section includes:

paranome: whether to build/plot the whole-paranome K_S distribution of the focal species (options: “yes” and “no”). [Default: “yes”]
collinearity: whether to build/plot the anchor pair K_S distribution of the focal species (options: “yes” and “no”). A GFF file for the focal species is required, see parameter gff_filename above. [Default: “no”]
reciprocal_retention: whether to build/plot the reciprocally retained K_S distribution of the focal species (options: “yes” and “no”). [Default: “no”]
gff_feature: parsing keyword from the third column of the GFF file (e.g. gene, mrna…). Case insensitive.
gff_attribute: parsing keyword from the ninth column of the GFF file (e.g. id, name…). Case insensitive.
max_number_outgroups: maximum number of trios/outspecies allowed to adjust a divergent pair; if None, all possible outspecies obtained from the phylogenetic tree will be used to form trios and adjust the pair. For more details see below. [Default: 4]
consensus_mode_for_multiple_outgroups: when a divergent pair is adjusted by two or more outgroups, it is possible to get a consensus value considering either the mean among all the rate-adjustments for that pair (“mean among outgroups”) or only one rate-adjustment, mostly likely coming from a close and slowly evolving species (“best outgroup”). [Default: “mean among outgroups”]

The [PARAMETERS] section includes:

For the mixed K_S distributions plot
- x_axis_max_limit_paralogs_plot: highest value of the x axis in the mixed distribution plot. [Default: 5]
- bin_width_paralogs: bin width in paralog K_S distribution histogram. By default there are ten bins per unit. [Default: 0.1]
- y_axis_max_limit_paralogs_plot: customized highest value of the y axis in the mixed plot. [Default: None]
For ortholog divergence K_S
- num_bootstrap_iterations: number of bootstrap iterations for mode estimation. [Default: 200]
- divergence_colors: list of colors assigned to the divergence nodes: all divergence lines coming from the same divergence node share the same color. [Default: 8 colors]
For the ortholog K_S distribution plots
- x_axis_max_limit_orthologs_plots: highest value of the x axis in the ortholog distribution plots. [Default: 5]
- bin_width_orthologs: bin width in ortholog K_S distribution histogram. By default there are ten bins per unit. [Default: 0.1]
K_S value thresholds
- max_ks_paralogs: maximum value accepted for paralog K_S from data table. [Default: 5]
- max_ks_orthologs: maximum value accepted for ortholog K_S from data table. [Default: 10]

Guidelines to set the maximum number of outgroups per rate-adjustment

Parameter max_number_outgroups limits the number of outgroup species used to adjust a species pair; without that, all possible outgroups would be taken. Having multiple rate-adjustments on the same divergence can provide stronger support for the rate-adjusted plot and is therefore advised to adjust with at least 3 or 4 outgroups to have more reliable results.

However, the more the outgroups, the more the number of ortholog distributions that will have to be computed by the wgd ortholog pipeline, which is a quite computationally demanding step. Setting a maximum amount of outgroups lowers the number of rate-adjustments and can therefore save time and resources. It is a good option in case the tree has a complex structure that would collect an unnecessary large number of outgroups or in case the user wants to have a quicker, although somewhat less reliable, result. Note that another option to lower the number of ortholog distributions is to start with a simpler tree structure.

In case mean among outgroup is set for the consensus rate-adjustment value, it is advised to use at least 3 or better 4 outgroups to adjust a species pair in order to buffer the weight of misleading outliers when computing the mean.

Guidelines to set the consensus method for multiple rate-adjustments

A consensus value for the rate-adjustment is needed when multiple rate-adjustments are performed for a species pair. The pipeline computes two consensus strategies, but then generates the divergence lines in the mixed plot according to the method specified in the configuration file under consensus_mode_for_multiple_outgroups.

mean among outgroups: with this option, the final rate-adjustment of a species pair is the mean of the rate-adjustments obtained from all the used outgroups. It is the default method because it avoids to rely on a single voice that could be biased (e.g. bad quality data).
best outgroup: with this option, only the rate-adjustment obtained from the best outgroup is considered for the final rate-adjustment of a species pair. The best outgroup is the one with the smallest OC segment, which is also computed through K_S value decomposition as during relative rate testing. The OC segment is a combined measure of how close is the outgroup and how low is its rate; the smaller the OC segment, the better can the outgroup detect the branch-specific K_S contributions of the two ingroups. The OC is stored in adjustment_table_species.tsv. If one outgroup shows a remarkably smaller OC than the others, then it can be worth it to re-run the pipeline (or just the plotting of the mixed distribution) by setting in the configuration file the best outgroup method. However, it’s first better to check the quality of the rate-adjustment result coming from it, especially if the outgroup species has transcriptome data: its ortholog distributions in orthologs_species1_species2.pdf should have clear peaks in order to give a reliable rate-adjustment.

Nextflow configuration file

The Nextflow configuration file is used to configure various settings for the ksrates Nextflow pipeline, such as the executor (e.g. computing cluster, local computer) and its resources (e.g. number of CPUs/cores and memory to use, cluster queues, walltimes etc.) and use of the ksrates Apptainer (formerly Singularity) or Docker container. We provide a template Nextflow configuration file for the ksrates Nextflow pipeline in the docs subdirectory of the GitHub repository, which can be copied and adapted to the user’s specific resources and requirements. Below, we briefly explain some of the basic key settings. For a more complete description please refer to Nextflow documentation.

profiles {
    docker {
        docker.enabled = true
        docker.runOptions = "-v $PWD:$PWD"
    }
    apptainer {
        apptainer.enabled = true
        apptainer.cacheDir = ''
        apptainer.autoMounts = true
    }
}

executor {
    name = ''
    queueSize =
    cpus =
}

process {
    container = ''

    withName: 'processName' {
        cpus =
        penv = ''
        memory = ''
        time = ''
        clusterOptions = ''
        beforeScript = ''
    }
}

env {
    SOME_ENV_VARIABLE = ''
}

The profiles scope configures the apptainer and docker container settings:
- enable enables or disables the use of the respective container.
- runOptions (only for Docker) mounts the host’s current directory into the container at the same path.
- cacheDir (only for Apptainer) the directory where remote the Apptainer image from Docker Hub is stored. When using a computing cluster it must be a shared folder accessible to all computing nodes.
- autoMounts (only for Apptainer) automatically mounts host paths in the executed container and allows the user to run the pipeline from any directory in a cluster [Default: true]. It requires the user bind control feature in Apptainer installation, which is active by default.
The executor scope configures the underlying system where processes are executed and its overall resources to use:
- name specifies the system type or HPC scheduler to be used (e.g. sge, slurm, pbs, local; for more details see the Nextflow documentation).
- queueSize sets the maximum number of tasks/Nextflow processes handled in parallel by the executor, i.e. for example the number jobs submitted simultaneously on a computer cluster) [Default: 100]. Useful in case of CPU/core/slot usage restriction policies. Set to a value of 1 to configure a fully sequential workflow where no processes are run in parallel.
- cpus sets the maximum number of CPUs/cores made available by the underlying system to the Nextflow pipeline when using a local executor (and only available for the local executor setting). Useful to limit CPU/core usage since by default all available CPUs/cores will be used by the local executor, i.e. when running the whole pipeline on the computer where Nextflow is launched.
The process scope defines the configuration for the processes of the ksrates pipeline:
- container defines the Apptainer or Docker ksrates container image to be used, vibpsb/ksrates:latest. A local copy is pulled from Docker Hub and stored for successive usage.
- withName defines settings for individual processes in the ksrates Nextflow pipeline.
  
  There are 11 processes in the pipeline, 6 of which (checkConfig, setupAdjustment, setParalogAnalysis, setOrthologAnalysis, doRateAdjustment and drawTree) are by default run locally because they execute minimal calculations. The remaining 5 processes (estimatePeaks, plotOrthologDistrib, paralogsAnalyses, wgdParalogs and wgdOrthologs) are instead run by default on a cluster, if available, and can be configured under this section of the Nextflow configuration file. wgdParalogs and wgdOrthologs are the most computationally demanding and it is advised to assign them a higher computational power than the other processes. If available, we suggest to configure about 10 CPUs/cores/slots/threads and about 20GB memory (or, on average, about 2GB per configured CPU) for each of these two processes.
  
  Settings can be tailored to your configured executor (see above) through the use of Nextflow process directives (for a complete list and detailed descriptions see the Nextflow documentation), such as:
  cpus sets the number of CPUs/cores/slots/threads, e.g. 8. It is recommended to set multiple cores for wgdParalogs and wgdOrthologs processes [Default if not set: 1].
  
  penv when using an SGE executor defines the parallel environment to be used when submitting a parallel task.
  
  memory sets how much memory the process is allowed to use, e.g. 16GB.
  
  time defines how long a process is allowed to run.
  
  clusterOptions any native configuration option accepted by your cluster submit command, such as options specific to your cluster and not supported out of the box by Nextflow (e.g. if your cluster doesn’t accept the memory directive because it expects defining the amount of memory per CPU).
  
  beforeScript allows you to execute a custom (Bash) snippet before the main process script is run. This may be useful to initialize the underlying computer cluster environment or for other custom initialisation, for example it can be used to load required dependencies if one of the ksrates containers is not used, provided that the cluster has those dependencies installed. In that case, the required external dependencies (see also the wgd Documentation) for the ksrates Nextflow processes are:
  
  wgdParalogs: Python dependencies listed in requirements.txt, plus BLAST, MUSCLE, MCL, PAML, FastTree and i-ADHoRe (if collinearity analysis is configured).
  
  wgdOrthologs: Python dependencies listed in requirements.txt, plus BLAST, MUSCLE and PAML.
  
  All other processes: Python dependencies listed in requirements.txt.
The env scope allows the definition one or more variable that will be exported in the environment where the workflow tasks will be executed.
The params scope accepts the preserve parameter to keep leftover temporay folders and incomplete files when the pipeline is prematurely interrupted due to an error [Default: false]. Alternatively, --preserve can be provided directly in the Nextflow launching command line:
```
nextflow run VIB-PSB/ksrates --config config_files/config_elaeis.txt --preserve
```

Expert configuration file

This is an optional configuration file that contains several "expert" parameters for fine-tuning the analysis or for development/debug purposes. The file can be provided in the command line through the --expert option. However, when named with default name config_expert.txt and placed in the launching directory, the file is automatically detected without needing the option in the command line.

Syntax for the Nextflow pipeline:

nextflow run VIB-PSB/ksrates --config config_files/config_elaeis.txt --expert path/to/my_expert_config.txt

Syntax for single ksrates commands:

ksrates <command> config_files/config_elaeis.txt --expert path/to/my_expert_config.txt

The following can be used as a template (default values):

[EXPERT PARAMETERS]

logging_level = info
preserve_ks_tmp_files = no
kde_bandwidth_modifier = 0.4
plot_adjustment_arrows = no
max_mixture_model_iterations = 600
num_mixture_model_initializations = 10
extra_paralogs_analyses_methods = no
max_mixture_model_components = 5
max_mixture_model_ks = 5
max_gene_family_size = 200
distribution_peak_estimate = mode
min_ks_anchor_pairs = 0.05
num_reciprocally_retained_gfs = 2000
use_bottom_gfs_instead_of_top = no
use_original_orthomcl_version = no

logging_level: the lowest logging/verbosity level of messages printed to the console/logs (increasing severity levels: notset, debug, info, warning, error, critical). Messages less severe than level will be ignored; notset causes all messages to be processed. [Default: “info”]
preserve_ks_tmp_files: whether to preserve or not the intermediate files generated during the paralogs K_S and ortholog K_S pipelines (options: “yes” and “no”). [Default: “no”]
kde_bandwidth_modifier: modifier to adjust the fitting of the KDE curve on the underlying whole-paranome or anchor K_S distribution. The KDE Scott’s factor internally computed by SciPy tends to produce an overly smooth KDE curve, especially with steep WGD peaks, and therefore it is reduced by multiplying it by a modifier. Decreasing the modifier leads to tighter fits, increasing it leads to smoother fits, and setting it to 1 gives the default KDE factor. Note that a too small factor is likely to take into account data noise. [Default: 0.4]
plot_adjustment_arrows: flag to toggle the plotting of rate-adjustment arrows below the adjusted mixed paralog–ortholog K_S plot. These arrows start from the original unadjusted ortholog divergence K_S estimate and end on the rate-adjusted estimate (options: “yes” and “no”). [Default: “no”]
max_mixture_model_iterations: maximum number of EM iterations for mixture modeling. [Default: 600]
num_mixture_model_initializations: number of times the EM algorithm is initialized (either for the random initialization in the exponential-lognormal mixture model or for k-means in the lognormal mixture model). [Default: 10]
extra_paralogs_analyses_methods: flag to toggle the optional analysis of the paralog K_S distribution with non default mixture model methods (see section Mixture modeling of paralog KS distributions and Supplementary Materials) [Default: “no”]
max_mixture_model_components: maximum number of components considered during execution of the mixture models. [Default: 5]
max_mixture_model_ks: upper limit for the K_S range in which the exponential-lognormal and lognormal-only mixture models are performed. [Default: 5]
max_gene_family_size: maximum number of members that any paralog gene family can have to be included in K_S estimation. Large gene families increase the run time and are often composed of unrelated sequences grouped together by shared protein domains or repetitive sequences. But this is not always the case, so one may want to check manually the gene families in file paralog_distributions/wgd_species/species.mcl.tsv and increase (or even decrease) this number. [Default: 200]
distribution_peak_estimate: the statistical method used to obtain a single ortholog K_S estimate for the divergence time of a species pair from its ortholog distribution or to obtain a single paralog K_S estimate from an anchor K_S cluster or from lognormal components in mixture models (options: “mode” or “median”). [Default: “mode”]
min_ks_anchor_pairs: lower limit for the K_S range used to build the anchor pair K_S distribution. By default, K_S values smaller than 0.05 in the anchor distribition are ignored (weighted as 0), in order to remove noise at recent K_S age that could be mistakenly interpreted as a very recent WGM. Set this value to 0 to instead keep all K_S values when you are actually interested in WGMs with very young age. [Default: 0.05]
num_reciprocally_retained_gfs: number of gene families at the top of the reciprocal retention ranking that will be used to build the related K_S distribution. [Default: 2000]
use_bottom_gfs_instead_of_top: use the bottom-ranked reciprocally retained GFs instead of the top-ranked ones (not recommended; only meant for comparison purposes with top-ranked GFs)
use_original_orthomcl_version: allows compatibility with the original OrthoMCL v1.4 version; by default it is used a modified faster version called OrthoMCLight. [Default: “no”]