Mixture modeling of paralog K_S distributions

The interpretation of mixed paralog–ortholog K_S distributions can be challenged by the fact that paralog WGD peaks are often not clearly distinguishable due to the progressive WGD signal erosion over time and potential overlapping of multiple peaks from successive WGDs. In order to more objectively define the K_S age of potential WGD peaks, a clustering feature based on mixture modeling has been implemented in ksrates.

Three methods are available: anchor K_S clustering, exponential-lognormal mixture modeling and lognormal-only mixture modeling. A short overview can be found below. For an extended description of these methods, please refer to the Supplementary Materials of the preprint.

Warning

Please be aware that mixture modeling results on K_S distributions should be interpreted cautiously because mixture models tend to overestimate the number of components present in the target K_S distribution and hence the number of WGDs.

The available methods depend on the K_S analysis type(s) selected in the ksrates configuration file:

When colinearity analysis is requested (colinearity = yes), the anchor pair K_S distribution is analyzed through a clustering based on the anchor pair K_S values found in colinear segment pairs; lognormal mixture modeling is also optionally available.

When paranome analysis is requested (paranome = yes), the whole-paranome K_S distribution is analyzed through exponential-lognormal mixture modeling; lognormal mixture modeling is also optionally available.

When reciprocal retention analysis is requested (reciprocal_retention = yes) the reciprocally retained K_S distribution is analyzed through lognormal mixture modeling.

Optional methods are activated through the expert configuration file (parameter extra_paralogs_analyses_methods). They follow the scheme illustrated in the table below; for example, when colinearity and paranome analyses are requested (fourth row), by default only the anchor K_S clustering is performed, while three other methods are optionally available.

Default methods are marked by ✓, while optional methods are marked by (✓).
Analysis setup	Anchor K_S clustering	Exp-log MM paranome	Log MM anchors	Log MM paranome	Log MM rec.ret
Colinearity-only	✓		(✓)
Paranome-only		✓		(✓)
Reciprocal retention-only					✓
Colinearity & Paranome	✓	(✓)	(✓)	(✓)
Colinearity & Rec.retention	✓		(✓)		✓
Paranome & Rec.retention		✓		(✓)	✓
Colinearity & Paranome & Rec.retention	✓	(✓)	(✓)	(✓)	✓

Anchor K_S clustering

A clustering approach is used in order to classify anchor pair K_S values into groups that potentially stem from different WGDs in the ancestry of the focal species. The approach does not fit the anchor pair K_S values directly, but instead uses lognormal mixture modeling to cluster median K_S values calculated for the collinear segment pairs (pairs of sequence regions with conserved gene content and order) that the anchor pairs reside on. Segment pairs originated by the same WGD are likely to share a similar median K_S age and to fall into the same cluster.

Then, in order to show the original anchor pair K_S data, each median K_S value is replaced by the K_S list of the segment pair it derives from. Small, flat or too old anchor pair K_S clusters are finally removed from the dataset because their link to a real WGD event is ambiguous or unlikely.

Exponential-lognormal mixture model

ksrates implements a custom exponential-lognormal mixture model (ELMM) algorithm for analyzing whole-paranome K_S distributions. The exponential component is used to model the L-shaped background of a whole-paranome K_S distribution generated by small-scale duplications, while one or more lognormal components are used to model potential WGD peaks.

Since adequate initialization of the component parameters is crucial for obtaining decent mixture modeling results, ksrates uses three different initialization approaches and ultimately chooses the best one based on the Bayesian Information Criterion (BIC):

Data-driven initialization, which infers component parameters from the shape of the K_S distribution. The exponential component is initialized to match the height of the left boundary of the distribution, while each lognormal component is initialized in correspondence of a peak.
Initialization with random component parameters, by default with two to five random components.
Hybrid initialization, where one random lognormal component is added to the data-driven initialized components in the attempt to compensate for possible overlooked signals.

In all three strategies, an extra “buffer” lognormal component is initialized by default at the right boundary of the K_S range to avoid that other components stretch towards higher values in an attempt to fit the hard-to-fit distribution tail.

Data output file format

Among its outputs (for a complete list see section Output files and directory organization), the ELMM produces a tabular (TSV) file and a text file where parameters and fitting results are stored:

elmm_species_parameters.tsv:
- The model initialization approach is stored in column 1 Model according to a numerical code (1: data-driven, 2: hybrid data-driven plus a random lognormal component, 3: random initialization with exponential component and one lognormal component, 4: random initialization with exponential component and two lognormal components; higher numbers feature random initialization with exponential component and increasing number of lognormal components).
- The initialization round is stored in column 2 Initialization. By default each model type (except type 1) is initialized and fitted 10 times, so this column shows numbers from 1 to 10.
- The BIC and loglikelihood scores for the fitted model are stored in columns 3 BIC and 4 Loglikelihood.
- The number of algorithm iterations needed to reach convergence is stored in column 5 Convergence. If greater than (by default) 600 iterations would be needed, convergence is not reached and the cell will show NA.
- The fitted exponential rate parameter and its component weight are stored in columns 6 Exponential_Rate and 7 Exponential_Weight.
- The mean, standard deviation and weight of the fitted Normal components used to define the correspondent lognormal components are stored in columns 8 to 10: Normal_Mean, Normal_SD and Normal_Weight. When there are multiple lognormal components, the data for each of them are stored in a separate row (the number of rows for each model and initialization is thus equal to the number of lognormal components).

_images/elmm.png — This file section shows the result for the first initalization of model 5: each row stores the same data for the exponential component plus the data for one of the three lognormal components.

elmm_species_parameters.txt reports the results in a more descriptive and easy-to-read logging format.

Lognormal mixture model

Logormal mixture modeling (LMM) makes use of only lognormal components and is available for whole-paranome, anchor pair and reciprocally retained K_S distributions.

The lognormal components are initialized through k-means and fitted by default with two to five components. For each number of components the mixture model is initialized multiple times and the best fit is chosen according to the largest log-likelihood. Among the resulting four models (one for each number of components), the best fitting model is taken to be the one with the lowest BIC score.

Warning

LMM is more prone than ELMM to detect spurious peaks in the left side of the whole-paranome K_S distribution because it is less suitable to fit the small-scale duplication background.

Data output file format

Among its outputs (for a complete list see section Output files and directory organization), the LMM produces tabular (TSV) files and text files where parameters and fitting results are stored:

lmm_species_parameters.tsv for whole-paranome, anchor pairs and/or reciprocally retained paralogs:
- The model type is stored in column 1 Model according to a numerical code (1: one lognormal component, 2: two lognormal components, 3: three lognormal components; and so on).
- The BIC and loglikelihood scores for the fitted model are stored in columns 2 BIC and 3 Loglikelihood
- The number of algorithm iterations needed to reach convergence is stored in column 4 Convergence. If greater than (by default) 600 iterations would be needed, convergence is not reached and the cell will show NA.
- The mean, standard deviation and weight of the fitted Normal components used to define the correspondent lognormal components are stored in columns 5 to 7: Normal_Mean, Normal_SD and Normal_Weight. When there are multiple lognormal components, the data for each of them are stored in a separate row (the number of rows per model is thus equal to the number of components).

_images/lmm.png — This file section shows the result for model 5: each row stores the data for one of the five lognormal components.

lmm_species_parameters.txt collects the model results in a more descriptive and easy-to-read TXT logging format; it is generated for whole-paranome, anchor pairs and/or reciprocally retained paralogs.

Mixture modeling of paralog KS distributions

Anchor KS clustering

Exponential-lognormal mixture model

Data output file format

Lognormal mixture model

Data output file format

Mixture modeling of paralog K_S distributions

Anchor K_S clustering