Simulating and Loading Data

Damuta provides two classes for input data: DataSet and SignatureSet.

DataSet ensures that a counts dataframe and sample annotation can be easily aligned via matching on sample ids.

SignatureSet provides some simple methods for summarizing and understanding mutational signatures, as well for extracing damage and misrepair signatures from COSMIC-format signatures.

Simulating Data

We can simulate a dataset of mutation counts using the function sim_parametric. We will simulate 500 samples containing 10000 mutations each, with varying activities of 10 damage signatures, and 8 misrepair signatures.

[1]:

from damuta.sim import sim_parametric

counts, params = sim_parametric(S=500, N=10000, n_damage_sigs=10, n_misrepair_sigs=8, seed=1992)
counts.head()

WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

[1]:

	A[C>A]A	A[C>A]C	A[C>A]G	A[C>A]T	C[C>A]A	C[C>A]C	C[C>A]G	C[C>A]T	G[C>A]A	G[C>A]C	...	C[T>G]G	C[T>G]T	G[T>G]A	G[T>G]C	G[T>G]G	G[T>G]T	T[T>G]A	T[T>G]C	T[T>G]G	T[T>G]T
simulated_sample_0	55	16	63	87	20	23	142	29	35	31	...	157	162	42	66	59	95	49	29	76	50
simulated_sample_1	122	92	133	135	73	43	116	19	17	22	...	118	168	21	4	92	29	51	104	101	55
simulated_sample_2	140	227	481	172	358	360	106	56	171	142	...	24	23	23	13	53	55	57	10	43	4
simulated_sample_3	19	18	38	56	9	190	31	35	75	29	...	387	96	69	461	426	212	552	43	30	203
simulated_sample_4	194	188	236	238	178	97	16	54	40	79	...	18	95	15	8	34	7	22	134	80	63

5 rows × 96 columns

<Figure size 640x480 with 0 Axes>

Next, lets make use of DataSet to organize our data for us.

We will simulate some metadata annotate our 500 samples with. This is most applicable for pan-cancer data, and necessary when fitting Damuta’s HierarchicalTandemLda model, but the annotation slot of the DataSet is also useful for holding clinical metadata about each sample.

Note: Pan-cancer data is not required. All Damuta models can just as easily fit a dataset where all samples come from the same tissue type.

The DataSet class at minimum acts as a container for a pandas DataFrame of mutation type counts. The metadata annotation is also be a pandas DataFrame, in tidy format (ie. each row is a sample, each column is a feature). The DataFrame index of both the count data and annotation data is the sample id.

In this example, we simulated the counts dataframe, but in principle any trinucleotide count data that can be loaded with pd.read_csv can be used.

[ ]:

import numpy as np
import pandas as pd
import damuta as da

# pick from 3 tissues
tissues = np.array(["Breast-AdenoCA", "Kidney-RCC", "ColoRect-AdenoCA"])

# pick from primary or metastatic tumour
types = np.array(['primary', 'metastatic'])


# randomly assign tissue type to samples
annotation = pd.DataFrame.from_dict({"tissue_type": tissues[np.random.choice(3,500)],
                                     "tumour_type": types[np.random.choice(2,500)]
                                     })
annotation = annotation.set_index(counts.index)
annotation.head()

	tissue_type	tumour_type
simulated_sample_0	Breast-AdenoCA	primary
simulated_sample_1	ColoRect-AdenoCA	metastatic
simulated_sample_2	ColoRect-AdenoCA	metastatic
simulated_sample_3	Breast-AdenoCA	primary
simulated_sample_4	ColoRect-AdenoCA	metastatic

Pair the counts and metadata with the DataSet class.

[3]:

simulated_data = da.DataSet(counts, annotation)
print(f"simulated_data contains {simulated_data.n_samples} samples")
print(simulated_data.ids[0:5])

simulated_data contains 500 samples
['simulated_sample_0', 'simulated_sample_1', 'simulated_sample_2', 'simulated_sample_3', 'simulated_sample_4']

Loading signature data

Lastly, let’s retrieve a set of mutational signatures from the COSMIC database.

[4]:

signatures = pd.read_csv("https://cancer.sanger.ac.uk/signatures/documents/452/COSMIC_v3.2_SBS_GRCh37.txt", sep='\t', index_col=0 , header=0)

COSMIC = da.SignatureSet(signatures.T)
print(f"COSMIC contains {COSMIC.n_sigs} signatures")

COSMIC contains 78 signatures

Every COSMIC mutational signature can be re-written as a product of a damage signautre and misrepair signature.

[5]:

from damuta.plotting import *

plot_damage_signatures(COSMIC.damage_signatures.loc[["SBS2", "SBS5", "SBS6"]])
plot_misrepair_signatures(COSMIC.misrepair_signatures.loc[["SBS2", "SBS5", "SBS6"]])
plot_cosmic_signatures(COSMIC.signatures.loc[["SBS2", "SBS5", "SBS6"]])

[5]:

<seaborn.axisgrid.FacetGrid at 0x2b4c3cf129a0>

The COSMIC signautres are a high quality reference set, as can seen in their high degree of separation (low cosine similarity between different signatures). However there is higher similarity in the Damage and Misrepair aspects of this signature set.

[6]:

COSMIC.summarize_separation()

[6]:

	Mutational signature similarity	Damage signature similarity	Misrepair signature similarity
count	3003.000000	3003.000000	3003.000000
mean	0.188203	0.339645	0.684479
std	0.188601	0.240654	0.196205
min	0.000346	0.001997	0.194228
25%	0.047759	0.137200	0.526309
50%	0.118585	0.286935	0.703217
75%	0.276046	0.518048	0.850554
max	0.979184	0.980951	0.999338

DAMUTA signatures

We can also initialize a signature set from just the damage and misrepair signature definitions.

[7]:

import os
os.chdir('/home/harrigan/damuta-package/docs/examples')

[8]:

DAMUTA = da.SignatureSet.from_damage_misrepair(
    pd.read_csv('example_data/damage_signatures.csv', index_col=0),
    pd.read_csv('example_data/misrepair_signatures.csv', index_col=0))

[9]:

plot_damage_signatures(DAMUTA.damage_signatures.loc[["D1", "D2", "D3"]])
plot_misrepair_signatures(DAMUTA.misrepair_signatures.loc[["M1", "M2", "M3"]])

[9]:

<seaborn.axisgrid.FacetGrid at 0x2b4c3e0e3a60>

In this case, the signature set will also contain the outer product: all (unweighted) combinations of damage and misrepair signatures

[10]:

DAMUTA.signatures

[10]:

	A[C>A]A	A[C>A]C	A[C>A]G	A[C>A]T	C[C>A]A	C[C>A]C	C[C>A]G	C[C>A]T	G[C>A]A	G[C>A]C	...	C[T>G]G	C[T>G]T	G[T>G]A	G[T>G]C	G[T>G]G	G[T>G]T	T[T>G]A	T[T>G]C	T[T>G]G	T[T>G]T
D1_M1	0.018900	0.006087	0.049454	0.006764	0.003098	0.003505	1.931383e-02	0.016475	0.032477	0.012318	...	0.000178	0.001199	0.000683	0.000283	0.000313	0.000501	0.003581	0.000477	0.000749	0.000913
D1_M2	0.045850	0.014768	0.119973	0.016409	0.007516	0.008502	4.685471e-02	0.039969	0.078787	0.029884	...	0.001342	0.009065	0.005168	0.002140	0.002369	0.003789	0.027079	0.003609	0.005662	0.006902
D1_M3	0.001538	0.000495	0.004024	0.000550	0.000252	0.000285	1.571713e-03	0.001341	0.002643	0.001002	...	0.000005	0.000032	0.000018	0.000008	0.000008	0.000014	0.000097	0.000013	0.000020	0.000025
D1_M4	0.001658	0.000534	0.004339	0.000593	0.000272	0.000308	1.694615e-03	0.001446	0.002850	0.001081	...	0.000316	0.002132	0.001215	0.000503	0.000557	0.000891	0.006369	0.000849	0.001332	0.001623
D1_M5	0.059906	0.019295	0.156753	0.021440	0.009820	0.011109	6.121874e-02	0.052222	0.102941	0.039046	...	0.000079	0.000531	0.000303	0.000125	0.000139	0.000222	0.001585	0.000211	0.000331	0.000404
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
D18_M2	0.036187	0.088653	0.017828	0.138128	0.001980	0.006050	1.689988e-03	0.084667	0.033732	0.060501	...	0.000455	0.001164	0.001911	0.002536	0.000773	0.001977	0.000882	0.001352	0.000847	0.000382
D18_M3	0.001214	0.002974	0.000598	0.004633	0.000066	0.000203	5.668964e-05	0.002840	0.001132	0.002029	...	0.000002	0.000004	0.000007	0.000009	0.000003	0.000007	0.000003	0.000005	0.000003	0.000001
D18_M4	0.001309	0.003206	0.000645	0.004996	0.000072	0.000219	6.112254e-05	0.003062	0.001220	0.002188	...	0.000107	0.000274	0.000449	0.000596	0.000182	0.000465	0.000207	0.000318	0.000199	0.000090
D18_M5	0.047281	0.115832	0.023294	0.180473	0.002587	0.007904	2.208080e-03	0.110624	0.044073	0.079049	...	0.000027	0.000068	0.000112	0.000148	0.000045	0.000116	0.000052	0.000079	0.000050	0.000022
D18_M6	0.000021	0.000052	0.000010	0.000080	0.000001	0.000004	9.836452e-07	0.000049	0.000020	0.000035	...	0.000117	0.000298	0.000489	0.000649	0.000198	0.000506	0.000226	0.000346	0.000217	0.000098

108 rows × 96 columns

[11]:

plot_cosmic_signatures(DAMUTA.signatures.loc[["D1_M1", "D7_M4", "D18_M6"]])

[11]:

<seaborn.axisgrid.FacetGrid at 0x2b4c3d53c9a0>

[12]:

DAMUTA.summarize_separation()

[12]:

	Mutational signature similarity	Damage signature similarity	Misrepair signature similarity
count	5778.000000	153.000000	15.000000
mean	0.176980	0.270253	0.524924
std	0.194173	0.209966	0.220816
min	0.000980	0.031818	0.254266
25%	0.047516	0.120673	0.320952
50%	0.109671	0.184103	0.504210
75%	0.235473	0.367171	0.643424
max	0.999640	0.981909	0.893437