damuta.models module

class HierarchicalTandemLda(dataset: DataSet, type_col: str, n_damage_sigs=None, n_misrepair_sigs=None, alpha_bias=0.1, psi_bias=0.01, beta_bias=0.1, phi_obs=None, etaC_obs=None, etaT_obs=None, opt_method='ADVI', init_strategy='kmeans', init_signatures=None, seed=2021)

Bases: TandemLda

Bayesian inference of mutational signatures and their activities using a Hierarchical Tandem LDA model.

This class fits COSMIC-style mutational signatures using a Hierarchical Tandem LDA model, where damage signatures and misrepair signatures have separate sets of activities. A tissue-type hierarchical prior is fitted over damage-misrepair signature associations to improve the interpretability of misrepair activity specificities.

Parameters:
  • dataset (DataSet) – Dataset containing mutation counts and sample annotations for fitting.

  • n_damage_sigs (int) – Number of damage signatures to fit.

  • n_misrepair_sigs (int) – Number of misrepair signatures to fit.

  • type_col (str) – Name of the annotation column containing the tissue type of each sample.

  • alpha_bias (float or numpy.ndarray of shape (32,)) – Dirichlet concentration parameter for damage signature trinucleotide context priors.

  • psi_bias (float or numpy.ndarray of shape (n_damage_sigs,)) – Dirichlet concentration parameter for damage signature activity priors.

  • beta_bias (float or numpy.ndarray of shape (6,)) – Dirichlet concentration parameter for misrepair signature substitution type priors.

  • gamma_bias (float or numpy.ndarray of shape (n_misrepair_sigs,)) – Dirichlet concentration parameter for misrepair signature activity priors.

  • opt_method (str) – Optimization method: “ADVI” for mean-field inference or “FullRankADVI” for full-rank inference.

  • seed (int) – Random seed for reproducibility.

model

PyMC3 model instance.

Type:

pymc3.Model

model_kwargs

Dictionary of parameters used for constructing the model.

Type:

dict

approx

PyMC3 approximation object created via self.fit().

Type:

pymc3.approximations.Approximation

run_id

Unique identifier for the current run, used for saving checkpoint files.

Type:

str

get_estimated_W(n_draws=1)

Extract signature activity tensor from model posterior.

Parameters:

n_draws (int, default=1) – Number of posterior samples to draw.

Returns:

4D array of signature activities with shape (n_draws, n_samples, n_damage_sigs, n_misrepair_sigs).

Return type:

np.ndarray

class Lda(dataset: DataSet, n_sigs=None, alpha_bias=0.1, psi_bias=0.01, tau_obs=None, opt_method='ADVI', init_strategy='kmeans', init_signatures=None, seed=2021)

Bases: Model

Bayesian inference of mutational signatures and their activities.

Fit COSMIC-style mutational signatures using a Latent Dirichlet Allocation (LDA) model.

Parameters:
  • dataset (DataSet) – Data object containing mutation counts for fitting.

  • n_sigs (int) – Number of signatures to infer.

  • alpha_bias (float or numpy.ndarray of shape (96,), default=0.1) – Dirichlet concentration parameter for signature prior. Controls the sparsity of inferred signatures.

  • psi_bias (float or numpy.ndarray of shape (n_sigs,), default=0.01) – Dirichlet concentration parameter for signature activity prior. Controls the sparsity of signature activities.

  • tau_obs (numpy.ndarray, optional) – Observed signatures to include in the model.

  • opt_method ({'ADVI', 'FullRankADVI'}, default='ADVI') – Optimization method for variational inference. ‘ADVI’ for mean-field, ‘FullRankADVI’ for full-rank.

  • init_strategy ({'uniform', 'kmeans', 'from_sigs'}, default='uniform') – Strategy for initializing signatures.

  • init_signatures (SignatureSet, optional) – Pre-defined signatures for initialization when init_strategy is ‘from_sigs’.

  • seed (int, default=2021) – Random seed for reproducibility.

model

PyMC3 model instance.

Type:

pymc3.Model

model_kwargs

Dictionary of parameters used in model construction.

Type:

dict

approx

Variational approximation object. Created after calling self.fit().

Type:

pymc3.approximations.Approximation

run_id

Unique identifier for the current run. Used for checkpoint files and wandb logging.

Type:

str

n_sigs

Number of signatures being fit.

Type:

int

dataset

Input dataset used for fitting.

Type:

DataSet

fit(n_iter=30000, \*\*kwargs)

Fit the model to the data using variational inference.

sample_posterior(n_samples=1000)

Sample from the fitted posterior distribution.

get_signatures()

Extract inferred signatures from the fitted model.

get_activities()

Extract inferred signature activities from the fitted model.

get_estimated_SignatureSet(n_draws=1)

Construct SignatureSet from posterior

get_estimated_activities_DataFrame(n_draws=1)

Extract activities as DataFrame

get_estimated_signatures(n_draws=1)

Extract signatures from model posterior

class TandemLda(dataset: DataSet, n_damage_sigs=None, n_misrepair_sigs=None, alpha_bias=0.1, psi_bias=0.01, beta_bias=0.1, gamma_bias=0.01, phi_obs=None, etaC_obs=None, etaT_obs=None, opt_method='ADVI', init_strategy='kmeans', init_signatures=None, seed=2021)

Bases: Model

Bayesian inference of mutational signatures and their activities using a Tandem LDA model.

This class fits COSMIC-style mutational signatures using a Tandem Latent Dirichlet Allocation (LDA) model, where damage signatures and misrepair signatures each have their own set of activities.

Parameters:
  • dataset (DataSet) – Data for fitting the model.

  • n_damage_sigs (int) – Number of damage signatures to fit.

  • n_misrepair_sigs (int) – Number of misrepair signatures to fit.

  • alpha_bias (float or numpy.ndarray of shape (32,)) – Dirichlet concentration parameter on (0, inf) for damage signatures. Determines the prior probability of trinucleotide context types appearing in inferred damage signatures.

  • psi_bias (float or numpy.ndarray of shape (n_damage_sigs,)) – Dirichlet concentration parameter on (0, inf) for damage signature activities. Determines the prior probability of each damage signature activity.

  • beta_bias (float or numpy.ndarray of shape (6,)) – Dirichlet concentration parameter on (0, inf) for misrepair signatures. Determines the prior probability of substitution types appearing in inferred misrepair signatures.

  • gamma_bias (float or numpy.ndarray of shape (n_misrepair_sigs,)) – Dirichlet concentration parameter on (0, inf) for misrepair signature activities. Determines the prior probability of each misrepair signature activity.

  • opt_method (str) – Optimization method for variational inference. Either “ADVI” for mean-field inference or “FullRankADVI” for full-rank inference.

  • seed (int) – Random seed for reproducibility.

model

PyMC3 model instance.

Type:

pymc3.Model

model_kwargs

Dictionary of parameters passed when constructing the model (e.g., hyperprior values).

Type:

dict

approx

PyMC3 approximation object created via self.fit().

Type:

pymc3.approximations.Approximation

run_id

Unique identifier for the current run, used for saving checkpoint files and in wandb if enabled.

Type:

str

get_estimated_SignatureSet(n_draws=1)

Construct a SignatureSet object from model posterior samples.

Parameters:

n_draws (int, default=1) – Number of posterior samples to draw. If >1, samples are averaged in the signature space

Returns:

SignatureSet constructed from damage and misrepair signatures.

Return type:

SignatureSet

get_estimated_W(n_draws=1)

Extract signature activity tensor from model posterior.

Parameters:

n_draws (int, default=1) – Number of posterior samples to draw.

Returns:

4D array of signature activities with shape (n_draws, n_samples, n_damage_sigs, n_misrepair_sigs).

Return type:

np.ndarray

get_estimated_activities(n_draws=1)
get_estimated_activities_DataFrame(n_draws=1)

Extract damage and misrepair signature activities as DataFrames.

Parameters:

n_draws (int, default=1) – Number of posterior samples to draw. If >1, samples are averaged in the activity space.

Returns:

Tuple of (theta, gamma) DataFrames where theta contains damage signature activities and gamma contains misrepair signature activities.

Return type:

tuple

get_estimated_connections_DataFrame(n_draws=1)

Extract damage-misrepair signature connections as DataFrame.

Parameters:

n_draws (int, default=1) – Number of posterior samples to draw. If >1, samples are averaged in the connection space.

Returns:

DataFrame with damage-misrepair signature combinations as columns, with column names like ‘D1_M1’, ‘D1_M2’, etc.

Return type:

pd.DataFrame

get_estimated_signatures(n_draws=1)

Extract damage and misrepair signatures from model posterior.

Parameters:

n_draws (int, default=1) – Number of posterior samples to draw.

Returns:

Tuple containing (phi, eta) where phi is damage signatures array and eta is misrepair signatures array reshaped to (n_draws, n_sigs, 6).

Return type:

tuple