damuta.base module

class DataSet(counts: DataFrame, annotation: DataFrame | None = None)

Bases: object

Container for tabular data, allowing simple access to a mutation data set and corresponding annotation for each sample.

DataSet is instatiated from a pandas dataframe of mutation counts, and (optionally) a pandas dataframe of the same size of sample annotations. The dataframe index is taken as sample ids. All samples that appear in counts should also appear in annotation, and vice versa. Mutation types are expect to be in COSMIC format (ex. A[C>A]A).

Parameters:

counts (pd.DataFrame) – Nx96 dataframe of mutation counts, one sample per row. Index is assumed to be sample ids.
annotation (pd.DataFrame) – NxF dataframe of meta-data features to annotate samples with. Index is assumed to be sample ids.

Examples

>>> import pandas as pd
>>> counts = pd.read_csv('tests/test_data/pcawg_counts.csv', index_col = 0, header = 0)
>>> annotation = pd.read_csv('tests/test_data/pcawg_cancer_types.csv', index_col = 0, header = 0)
>>> pcawg = DataSet(counts, annotation)
>>> pcawg.nsamples
2778

annotate_tissue_types(type_col) → array

Set a specified column of annotation as the sample tissue type

Tissue type information is used by hirearchical models to create tissue-type prior. See class:HierarchicalTendemLda for more details.

annotation: DataFrame = None

counts: DataFrame

property ids: list: List sample ids in dataset

property n_samples: int: Number of samples in dataset

class SignatureSet(signatures: DataFrame)

Bases: object

Container for tabular data, allowing simple access to a set of mutational signature definitions.

Parameters:: signatures (pd.DataFrame) – Nx96 dataframe of signautre definitions, one signature per row. Rows must sum to 1.

Examples

classmethod from_damage_misrepair(damage_signatures: DataFrame, misrepair_signatures: DataFrame)

property index: int: Names of signatures in dataset

property n_damage_sigs: DataFrame

Number of damage signatures in dataset

Damage signatures represent the distribution of mutations over 32 trinucleotide contexts. They are computed by marginalizing over substitution classes.

property n_misrepair_sigs: DataFrame

Number of misrepair signatures in dataset

Misrepair signatures represent the distribution of mutations over 6 substitution types. They are computed by marginalizing over trinucleotide context classes.

property n_sigs: int: Number of signatures in dataset

signatures: DataFrame

summarize_separation() → DataFrame: Summary statistics of pair-wise cosine distances for signatures, damage signatures, and misrepair signatures.