{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Models\n", "\n", "DAMUTA provides several latent variable models for probabilistic mutational signature analysis. Here, models are visualized with [graphviz](https://graphviz.org/).\n", "\n", "PyMC3 has some tutorials to help you get familiar with [dirichlet-multinomial](https://www.pymc.io/projects/examples/en/latest/mixture_models/dirichlet_mixture_of_multinomials.html) models.\n", "\n", "## Choosing a Model\n", "\n", "### Overview of available models (Lda, TandemLda, HierarchicalTandemLda)\n", "\n", "Damuta offers three main types of models:\n", "\n", "1. **Lda (Latent Dirichlet Allocation)**: \n", " - Provided primarily as a baseline for comparison with other probabilistic models.\n", " - Not recommended for fitting COSMIC activities in practice.\n", "\n", "2. **TandemLda (Tandem Latent Dirichlet Allocation)**:\n", " - Use when you don't have hierarchical sample information.\n", " - Suitable for most mutational signature analysis tasks.\n", "\n", "3. **HierarchicalTandemLda (Hierarchical Tandem Latent Dirichlet Allocation)**:\n", " - Use when you have hierarchical sample information (e.g., tissue type).\n", " - Provides more nuanced analysis by incorporating sample metadata.\n", "\n", "### When to use each model\n", "\n", "- **HierarchicalTandemLda**: \n", " - Choose this model if you have hierarchical sample information, such as tissue type.\n", " - It allows for more detailed analysis by incorporating sample metadata.\n", "\n", "- **TandemLda**: \n", " - Use when you don't have hierarchical sample information.\n", " - Suitable for standard mutational signature analysis.\n", "\n", "- **Lda**: \n", " - Primarily used as a baseline for comparing other probabilistic models.\n", " - Not recommended for practical COSMIC activity fitting.\n", " - If you need to fit COSMIC activities, consider using tools like SigProfiler or deconstructSigs instead.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import pymc3 as pm\n", "import arviz as az\n", "import numpy as np\n", "import pandas as pd\n", "import damuta as da\n", "import matplotlib.pyplot as plt\n", "from damuta.models import Lda, TandemLda, HierarchicalTandemLda" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Load data\n", "counts = pd.read_csv('example_data/pcawg_counts.csv', index_col=0)\n", "annotation = pd.read_csv('example_data/pcawg_cancer_types.csv', index_col=0)\n", "pcawg = da.DataSet(counts, annotation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Baseline model\n", "\n", "Dirichlet-multinomial set up like latent dirichlet allocation. Infers COSMIC-format 96-dimensional mutational signatures and their activities.\n", "\n", "![LDA Model](example_data/lda.png)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "%3\n", "\n", "cluster2,778 x 96\n", "\n", "2,778 x 96\n", "\n", "cluster20 x 96\n", "\n", "20 x 96\n", "\n", "cluster2,778 x 20\n", "\n", "2,778 x 20\n", "\n", "\n", "B\n", "\n", "B\n", "~\n", "Deterministic\n", "\n", "\n", "corpus\n", "\n", "corpus\n", "~\n", "Multinomial\n", "\n", "\n", "B->corpus\n", "\n", "\n", "\n", "\n", "data\n", "\n", "data\n", "~\n", "Data\n", "\n", "\n", "corpus->data\n", "\n", "\n", "\n", "\n", "gamma_tau\n", "\n", "gamma_tau\n", "~\n", "Gamma\n", "\n", "\n", "tau\n", "\n", "tau\n", "~\n", "Deterministic\n", "\n", "\n", "gamma_tau->tau\n", "\n", "\n", "\n", "\n", "tau->B\n", "\n", "\n", "\n", "\n", "theta\n", "\n", "theta\n", "~\n", "Deterministic\n", "\n", "\n", "theta->B\n", "\n", "\n", "\n", "\n", "gamma_theta\n", "\n", "gamma_theta\n", "~\n", "Gamma\n", "\n", "\n", "gamma_theta->theta\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lda = Lda(pcawg, n_sigs = 20)\n", "lda._build_model(**lda._model_kwargs)\n", "pm.model_graph.model_to_graphviz(lda.model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tandem LDA\n", "\n", "Two LDA's at once! Infers Damage and Misrepair signatures and their activities. \n", "\n", "\n", "![Tandem LDA Model](example_data/t_lda.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t_lda = TandemLda(pcawg, n_damage_sigs = 18, n_misrepair_sigs=6)\n", "t_lda._build_model(**t_lda._model_kwargs)\n", "pm.model_graph.model_to_graphviz(t_lda.model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Hierarchical Tandem LDA\n", "\n", "The full Hierarchical Tandem LDA model is similar to the Tandem LDA model, with an added hierarchical prior to incorporate information shared across tissue-type. Infers Damage and Misrepair signatures, their activities, and their tissue-specific sparsity.\n", "\n", "![Hierarchical Tandem LDA Model](example_data/ht_lda.png)\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "%3\n", "\n", "cluster2,778 x 96\n", "\n", "2,778 x 96\n", "\n", "cluster18 x 32\n", "\n", "18 x 32\n", "\n", "cluster2,778 x 18\n", "\n", "2,778 x 18\n", "\n", "cluster37 x 6\n", "\n", "37 x 6\n", "\n", "cluster2,778 x 6\n", "\n", "2,778 x 6\n", "\n", "cluster18 x 2,778 x 6\n", "\n", "18 x 2,778 x 6\n", "\n", "cluster6 x 3\n", "\n", "6 x 3\n", "\n", "cluster6 x 2 x 3\n", "\n", "6 x 2 x 3\n", "\n", "\n", "B\n", "\n", "B\n", "~\n", "Deterministic\n", "\n", "\n", "corpus\n", "\n", "corpus\n", "~\n", "Multinomial\n", "\n", "\n", "B->corpus\n", "\n", "\n", "\n", "\n", "data\n", "\n", "data\n", "~\n", "Data\n", "\n", "\n", "corpus->data\n", "\n", "\n", "\n", "\n", "phi\n", "\n", "phi\n", "~\n", "Deterministic\n", "\n", "\n", "phi->B\n", "\n", "\n", "\n", "\n", "gamma_phi\n", "\n", "gamma_phi\n", "~\n", "Gamma\n", "\n", "\n", "gamma_phi->phi\n", "\n", "\n", "\n", "\n", "theta\n", "\n", "theta\n", "~\n", "Deterministic\n", "\n", "\n", "theta->B\n", "\n", "\n", "\n", "\n", "gamma_theta\n", "\n", "gamma_theta\n", "~\n", "Gamma\n", "\n", "\n", "gamma_theta->theta\n", "\n", "\n", "\n", "\n", "b_t\n", "\n", "b_t\n", "~\n", "Gamma\n", "\n", "\n", "gamma\n", "\n", "gamma\n", "~\n", "Gamma\n", "\n", "\n", "b_t->gamma\n", "\n", "\n", "\n", "\n", "a_t\n", "\n", "a_t\n", "~\n", "Gamma\n", "\n", "\n", "a_t->gamma\n", "\n", "\n", "\n", "\n", "gamma_A\n", "\n", "gamma_A\n", "~\n", "Gamma\n", "\n", "\n", "gamma->gamma_A\n", "\n", "\n", "\n", "\n", "A\n", "\n", "A\n", "~\n", "Deterministic\n", "\n", "\n", "gamma_A->A\n", "\n", "\n", "\n", "\n", "A->B\n", "\n", "\n", "\n", "\n", "gamma_etaC\n", "\n", "gamma_etaC\n", "~\n", "Gamma\n", "\n", "\n", "etaC\n", "\n", "etaC\n", "~\n", "Deterministic\n", "\n", "\n", "gamma_etaC->etaC\n", "\n", "\n", "\n", "\n", "eta\n", "\n", "eta\n", "~\n", "Deterministic\n", "\n", "\n", "etaC->eta\n", "\n", "\n", "\n", "\n", "etaT\n", "\n", "etaT\n", "~\n", "Deterministic\n", "\n", "\n", "etaT->eta\n", "\n", "\n", "\n", "\n", "gamma_etaT\n", "\n", "gamma_etaT\n", "~\n", "Gamma\n", "\n", "\n", "gamma_etaT->etaT\n", "\n", "\n", "\n", "\n", "eta->B\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ht_lda = HierarchicalTandemLda(pcawg, type_col=\"pcawg_class\", n_damage_sigs = 18, n_misrepair_sigs=6)\n", "ht_lda._build_model(**ht_lda._model_kwargs)\n", "pm.model_graph.model_to_graphviz(ht_lda.model)" ] } ], "metadata": { "kernelspec": { "display_name": "damuta-dev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.20" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }