damuta.base module

class DataSet(counts: DataFrame, annotation: DataFrame | None = None)

Bases: object

Container for tabular data, allowing simple access to a mutation data set and corresponding annotation for each sample.

DataSet is instatiated from a pandas dataframe of mutation counts, and (optionally) a pandas dataframe of the same size of sample annotations. The dataframe index is taken as sample ids. All samples that appear in counts should also appear in annotation, and vice versa. Mutation types are expect to be in COSMIC format (ex. A[C>A]A).

Parameters:
  • counts (pd.DataFrame) – Nx96 dataframe of mutation counts, one sample per row. Index is assumed to be sample ids.

  • annotation (pd.DataFrame) – NxF dataframe of meta-data features to annotate samples with. Index is assumed to be sample ids.

Examples

>>> import pandas as pd
>>> counts = pd.read_csv('tests/test_data/pcawg_counts.csv', index_col = 0, header = 0)
>>> annotation = pd.read_csv('tests/test_data/pcawg_cancer_types.csv', index_col = 0, header = 0)
>>> pcawg = DataSet(counts, annotation)
>>> pcawg.nsamples
2778
annotate_tissue_types(type_col) array

Set a specified column of annotation as the sample tissue type

Tissue type information is used by hirearchical models to create tissue-type prior. See class:HierarchicalTendemLda for more details.

annotation: DataFrame = None
counts: DataFrame
property ids: list

List sample ids in dataset

property n_samples: int

Number of samples in dataset

class SignatureSet(signatures: DataFrame)

Bases: object

Container for tabular data, allowing simple access to a set of mutational signature definitions.

Parameters:

signatures (pd.DataFrame) – Nx96 dataframe of signautre definitions, one signature per row. Rows must sum to 1.

Examples

classmethod from_damage_misrepair(damage_signatures: DataFrame, misrepair_signatures: DataFrame)
property index: int

Names of signatures in dataset

property n_damage_sigs: DataFrame

Number of damage signatures in dataset

Damage signatures represent the distribution of mutations over 32 trinucleotide contexts. They are computed by marginalizing over substitution classes.

property n_misrepair_sigs: DataFrame

Number of misrepair signatures in dataset

Misrepair signatures represent the distribution of mutations over 6 substitution types. They are computed by marginalizing over trinucleotide context classes.

property n_sigs: int

Number of signatures in dataset

signatures: DataFrame
summarize_separation() DataFrame

Summary statistics of pair-wise cosine distances for signatures, damage signatures, and misrepair signatures.