damuta.base module
- class DataSet(counts: DataFrame, annotation: DataFrame | None = None)
Bases:
objectContainer for tabular data, allowing simple access to a mutation data set and corresponding annotation for each sample.
DataSetis instatiated from a pandas dataframe of mutation counts, and (optionally) a pandas dataframe of the same size of sample annotations. The dataframe index is taken as sample ids. All samples that appear in counts should also appear in annotation, and vice versa. Mutation types are expect to be in COSMIC format (ex. A[C>A]A).- Parameters:
counts (pd.DataFrame) – Nx96 dataframe of mutation counts, one sample per row. Index is assumed to be sample ids.
annotation (pd.DataFrame) – NxF dataframe of meta-data features to annotate samples with. Index is assumed to be sample ids.
Examples
>>> import pandas as pd >>> counts = pd.read_csv('tests/test_data/pcawg_counts.csv', index_col = 0, header = 0) >>> annotation = pd.read_csv('tests/test_data/pcawg_cancer_types.csv', index_col = 0, header = 0) >>> pcawg = DataSet(counts, annotation) >>> pcawg.nsamples 2778
- annotate_tissue_types(type_col) array
Set a specified column of annotation as the sample tissue type
Tissue type information is used by hirearchical models to create tissue-type prior. See class:HierarchicalTendemLda for more details.
- annotation: DataFrame = None
- counts: DataFrame
- property ids: list
List sample ids in dataset
- property n_samples: int
Number of samples in dataset
- class SignatureSet(signatures: DataFrame)
Bases:
objectContainer for tabular data, allowing simple access to a set of mutational signature definitions.
- Parameters:
signatures (pd.DataFrame) – Nx96 dataframe of signautre definitions, one signature per row. Rows must sum to 1.
Examples
- classmethod from_damage_misrepair(damage_signatures: DataFrame, misrepair_signatures: DataFrame)
- property index: int
Names of signatures in dataset
- property n_damage_sigs: DataFrame
Number of damage signatures in dataset
Damage signatures represent the distribution of mutations over 32 trinucleotide contexts. They are computed by marginalizing over substitution classes.
- property n_misrepair_sigs: DataFrame
Number of misrepair signatures in dataset
Misrepair signatures represent the distribution of mutations over 6 substitution types. They are computed by marginalizing over trinucleotide context classes.
- property n_sigs: int
Number of signatures in dataset
- signatures: DataFrame
- summarize_separation() DataFrame
Summary statistics of pair-wise cosine distances for signatures, damage signatures, and misrepair signatures.