damuta.sim module
- encode_counts(counts)
Encode mutation counts into a format suitable for topic modeling.
This function takes a DataFrame of mutation counts and encodes it into two lists of indices: one for the 32 mutation types and another for the 6 possible base changes.
- Parameters:
counts (pandas.DataFrame) – A DataFrame where each row represents a sample and each column represents a mutation type (96 mutation types in total).
- Returns:
A tuple containing two elements:
list of lists: Each inner list contains indices (0-31) representing the 32 mutation types for each mutation in each sample.
list of lists: Each inner list contains indices (0-5) representing the 6 possible base changes for each mutation in each sample.
- Return type:
tuple
Notes
The encoding is based on the 96 mutation types, which are converted into 32 mutation types (context) and 6 base changes. This encoding is useful for topic modeling approaches in mutation signature analysis.
- sim_from_sigs(tau, tau_hyperprior, S, N, I=None, seed=None)
Simulate mutation data from predefined signatures.
This function generates simulated mutation data based on given signatures and hyperparameters. It uses a Dirichlet process to generate sample-specific activities and then creates mutation counts for each sample.
Parameters:
- taupandas.DataFrame
Predefined signatures in COSMIC format.
- tau_hyperpriorfloat
Concentration parameter for the Dirichlet prior on signature activities.
- Sint
Number of samples to simulate.
- Nint
Number of mutations per sample.
- Iint, optional
Number of signatures to use. If None, all signatures in tau are used.
- seedint, optional
Random seed for reproducibility.
Returns:
- datapandas.DataFrame
Simulated mutation data. Each row represents a sample, and each column represents a mutation type.
- sim_paramsdict
Dictionary containing simulation parameters: - ‘tau’: The signatures used for simulation. - ‘tau_activities’: The generated sample-specific activities.
- sim_parametric(n_damage_sigs, n_misrepair_sigs, S, N, alpha_bias=0.9, psi_bias=0.1, gamma_bias=0.1, beta_bias=0.9, seed=1333)
Simulate data using a parametric model with damage and misrepair signatures.
This function generates simulated mutation data based on a model with separate damage and misrepair processes. It creates random distributions for damage signatures (phi), sample-specific activities (theta), misrepair signatures (eta), and their interactions (A).
Parameters:
- n_damage_sigsint
Number of damage signatures to simulate.
- n_misrepair_sigsint
Number of misrepair signatures to simulate.
- Sint
Number of samples to simulate.
- Nint
Number of mutations per sample.
- alpha_biasfloat, optional
Concentration parameter for the Dirichlet prior on damage signatures (default: 0.9).
- psi_biasfloat, optional
Concentration parameter for the Dirichlet prior on sample-specific activities (default: 0.1).
- gamma_biasfloat, optional
Concentration parameter for the Dirichlet prior on misrepair signature activities (default: 0.1).
- beta_biasfloat, optional
Concentration parameter for the Dirichlet prior on misrepair signatures (default: 0.9).
- seedint, optional
Random seed for reproducibility (default: 1333).
Returns:
- tuple
A tuple containing two elements: 1. pandas.DataFrame: Simulated mutation counts for each sample and mutation type. 2. dict: Dictionary containing the generated model parameters and intermediate results.