{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Models\n",
"\n",
"DAMUTA provides several latent variable models for probabilistic mutational signature analysis. Here, models are visualized with [graphviz](https://graphviz.org/).\n",
"\n",
"PyMC3 has some tutorials to help you get familiar with [dirichlet-multinomial](https://www.pymc.io/projects/examples/en/latest/mixture_models/dirichlet_mixture_of_multinomials.html) models.\n",
"\n",
"## Choosing a Model\n",
"\n",
"### Overview of available models (Lda, TandemLda, HierarchicalTandemLda)\n",
"\n",
"Damuta offers three main types of models:\n",
"\n",
"1. **Lda (Latent Dirichlet Allocation)**: \n",
" - Provided primarily as a baseline for comparison with other probabilistic models.\n",
" - Not recommended for fitting COSMIC activities in practice.\n",
"\n",
"2. **TandemLda (Tandem Latent Dirichlet Allocation)**:\n",
" - Use when you don't have hierarchical sample information.\n",
" - Suitable for most mutational signature analysis tasks.\n",
"\n",
"3. **HierarchicalTandemLda (Hierarchical Tandem Latent Dirichlet Allocation)**:\n",
" - Use when you have hierarchical sample information (e.g., tissue type).\n",
" - Provides more nuanced analysis by incorporating sample metadata.\n",
"\n",
"### When to use each model\n",
"\n",
"- **HierarchicalTandemLda**: \n",
" - Choose this model if you have hierarchical sample information, such as tissue type.\n",
" - It allows for more detailed analysis by incorporating sample metadata.\n",
"\n",
"- **TandemLda**: \n",
" - Use when you don't have hierarchical sample information.\n",
" - Suitable for standard mutational signature analysis.\n",
"\n",
"- **Lda**: \n",
" - Primarily used as a baseline for comparing other probabilistic models.\n",
" - Not recommended for practical COSMIC activity fitting.\n",
" - If you need to fit COSMIC activities, consider using tools like SigProfiler or deconstructSigs instead.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import pymc3 as pm\n",
"import arviz as az\n",
"import numpy as np\n",
"import pandas as pd\n",
"import damuta as da\n",
"import matplotlib.pyplot as plt\n",
"from damuta.models import Lda, TandemLda, HierarchicalTandemLda"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Load data\n",
"counts = pd.read_csv('example_data/pcawg_counts.csv', index_col=0)\n",
"annotation = pd.read_csv('example_data/pcawg_cancer_types.csv', index_col=0)\n",
"pcawg = da.DataSet(counts, annotation)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Baseline model\n",
"\n",
"Dirichlet-multinomial set up like latent dirichlet allocation. Infers COSMIC-format 96-dimensional mutational signatures and their activities.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lda = Lda(pcawg, n_sigs = 20)\n",
"lda._build_model(**lda._model_kwargs)\n",
"pm.model_graph.model_to_graphviz(lda.model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tandem LDA\n",
"\n",
"Two LDA's at once! Infers Damage and Misrepair signatures and their activities. \n",
"\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t_lda = TandemLda(pcawg, n_damage_sigs = 18, n_misrepair_sigs=6)\n",
"t_lda._build_model(**t_lda._model_kwargs)\n",
"pm.model_graph.model_to_graphviz(t_lda.model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Hierarchical Tandem LDA\n",
"\n",
"The full Hierarchical Tandem LDA model is similar to the Tandem LDA model, with an added hierarchical prior to incorporate information shared across tissue-type. Infers Damage and Misrepair signatures, their activities, and their tissue-specific sparsity.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ht_lda = HierarchicalTandemLda(pcawg, type_col=\"pcawg_class\", n_damage_sigs = 18, n_misrepair_sigs=6)\n",
"ht_lda._build_model(**ht_lda._model_kwargs)\n",
"pm.model_graph.model_to_graphviz(ht_lda.model)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "damuta-dev",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.20"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}