.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_soil_example.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_soil_example.py: Soil microbiome networks =========================================== Microbial abundance networks are a typical application of Graphical Lasso. [ref12]_ We have a look at a `dataset of 88 soil samples `_ from North and South America. We do two steps of preprocessing: * We filter on OTUs with a minimum frequeny of 100, obtaining :math:`N=88` samples of counts from :math:`p=116` OTUs. * We replace zero counts by adding a pseudocount of 1 to all entries. Bacterial composition is correlated with environmental variables, such as temperature, moisture or pH. In this experiment, we want to demonstrate that these confounding factors can be reconstructed using Graphical Lasso with latent variables. .. GENERATED FROM PYTHON SOURCE LINES 17-31 .. code-block:: Python # sphinx_gallery_thumbnail_number = 2 import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from scipy import stats from gglasso.helper.utils import sparsity, zero_replacement, normalize, log_transform from gglasso.problem import glasso_problem from gglasso.helper.basic_linalg import scale_array_by_diagonal .. GENERATED FROM PYTHON SOURCE LINES 32-34 For this, we first load the dataset and compute relative abundances (this is done by ``normalize``). Hence, we obtain compositional data where each sample is on the unit simplex. Typically, Graphical Lasso is not applied to compositional data directly. We apply the centered log-ratio transform (using ``log_transform``). .. GENERATED FROM PYTHON SOURCE LINES 34-45 .. code-block:: Python soil = pd.read_csv('../data/soil/processed/soil_116.csv', sep=',', index_col = 0).T print(soil.head()) X = normalize(soil) X = log_transform(X) (p,N) = X.shape print("Shape of the transformed data: (p,N)=", (p,N)) .. rst-class:: sphx-glr-script-out .. code-block:: none 103.CA2 103.CO3 103.SR3 103.IE2 ... 103.AR1 103.TL1 103.HI4 103.BB1 1124701 16.0 15.0 2.0 9.0 ... 1.0 1.0 1.0 1.0 697997 3.0 5.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 203969 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 205391 1.0 1.0 1.0 2.0 ... 1.0 7.0 1.0 1.0 843189 1.0 1.0 1.0 1.0 ... 1.0 2.0 2.0 1.0 [5 rows x 89 columns] Shape of the transformed data: (p,N)= (116, 89) .. GENERATED FROM PYTHON SOURCE LINES 46-48 The dataset also contains the pH value for each sample. We do not make use of this for estimating the network. We also calculate the sampling depth, i.e. the number of total counts per sample. .. GENERATED FROM PYTHON SOURCE LINES 48-59 .. code-block:: Python ph = pd.read_csv('../data/soil/processed/ph.csv', sep=',', index_col = 0) ph = ph.reindex(soil.columns) print(ph.head()) depth = soil.sum(axis=0) metadata = pd.read_table('../data/soil/original/88soils_modified_metadata.txt', index_col=0) temperature = metadata["annual_season_temp"].reindex(ph.index) .. rst-class:: sphx-glr-script-out .. code-block:: none ph 103.CA2 8.02 103.CO3 6.02 103.SR3 6.95 103.IE2 5.52 103.BP1 7.53 .. GENERATED FROM PYTHON SOURCE LINES 60-62 We compute the empirical covariance matrix and scale it to correlations. Then, we create an instance of ``glasso_problem`` and do model selection using a grid search. Note that we set ``latent=True`` because we want to account for unobserved latent factors. .. GENERATED FROM PYTHON SOURCE LINES 62-79 .. code-block:: Python S0 = np.cov(X.values, bias = True) S = scale_array_by_diagonal(S0) P = glasso_problem(S, N, latent = True, do_scaling = False) print(P) lambda1_range = np.logspace(0.5,-1.5,8) mu1_range = np.logspace(1.5,-0.2,6) modelselect_params = {'lambda1_range': lambda1_range, 'mu1_range': mu1_range} P.model_selection(modelselect_params = modelselect_params, method = 'eBIC', gamma = 0.25) print(P.reg_params) .. rst-class:: sphx-glr-script-out .. code-block:: none SINGLE GRAPHICAL LASSO PROBLEM WITH LATENT VARIABLES Regularization parameters: {'lambda1': None, 'mu1': None} ADMM terminated after 38 iterations with status: optimal. ADMM terminated after 41 iterations with status: optimal. ADMM terminated after 26 iterations with status: optimal. ADMM terminated after 25 iterations with status: optimal. ADMM terminated after 57 iterations with status: optimal. ADMM terminated after 61 iterations with status: optimal. ADMM terminated after 34 iterations with status: optimal. ADMM terminated after 41 iterations with status: optimal. ADMM terminated after 26 iterations with status: optimal. ADMM terminated after 25 iterations with status: optimal. ADMM terminated after 57 iterations with status: optimal. ADMM terminated after 61 iterations with status: optimal. ADMM terminated after 34 iterations with status: optimal. ADMM terminated after 41 iterations with status: optimal. ADMM terminated after 26 iterations with status: optimal. ADMM terminated after 25 iterations with status: optimal. ADMM terminated after 57 iterations with status: optimal. ADMM terminated after 61 iterations with status: optimal. ADMM terminated after 126 iterations with status: optimal. ADMM terminated after 53 iterations with status: optimal. ADMM terminated after 27 iterations with status: optimal. ADMM terminated after 24 iterations with status: optimal. ADMM terminated after 57 iterations with status: optimal. ADMM terminated after 61 iterations with status: optimal. ADMM terminated after 148 iterations with status: optimal. ADMM terminated after 149 iterations with status: optimal. ADMM terminated after 52 iterations with status: optimal. ADMM terminated after 38 iterations with status: optimal. ADMM terminated after 56 iterations with status: optimal. ADMM terminated after 62 iterations with status: optimal. ADMM terminated after 95 iterations with status: optimal. ADMM terminated after 66 iterations with status: optimal. ADMM terminated after 66 iterations with status: optimal. ADMM terminated after 58 iterations with status: optimal. ADMM terminated after 55 iterations with status: optimal. ADMM terminated after 62 iterations with status: optimal. ADMM terminated after 116 iterations with status: optimal. ADMM terminated after 53 iterations with status: optimal. ADMM terminated after 53 iterations with status: optimal. ADMM terminated after 53 iterations with status: optimal. ADMM terminated after 63 iterations with status: optimal. ADMM terminated after 82 iterations with status: optimal. ADMM terminated after 126 iterations with status: optimal. ADMM terminated after 55 iterations with status: optimal. ADMM terminated after 56 iterations with status: optimal. ADMM terminated after 56 iterations with status: optimal. ADMM terminated after 56 iterations with status: optimal. ADMM terminated after 84 iterations with status: optimal. {'lambda1': 0.22758459260747887, 'mu1': 6.606934480075961} .. GENERATED FROM PYTHON SOURCE LINES 80-95 The Graphical Lasso solution is of the form :math:`\Theta - L` where :math:`\Theta` is sparse and :math:`L` has low rank. We use the low rank component of the Graphical Lasso solution in order to do a robust PCA. For this, we use the eigendecomposition .. math:: L = V \Sigma V^T where the columns of :math:`V` are the orthonormal eigenvecors and :math:`\Sigma` is diagonal containing the eigenvalues. Denote the columns of :math:`V` corresponding only to positive eigenvalues with :math:`\tilde{V} \in \mathbb{R}^{p\times r}` and :math:`\tilde{\Sigma} \in \mathbb{R}^{r\times r}` accordingly, where :math:`r=\mathrm{rank}(L)`. Then we have .. math:: L = \tilde{V} \tilde{\Sigma} \tilde{V}^T. Now we project the data matrix :math:`X\in \mathbb{R}^{p\times N}` onto the eigenspaces of :math:`L^{-1}` - which are the same as of :math:`L` - by computing .. math:: U := X^T \tilde{V}\tilde{\Sigma}. .. GENERATED FROM PYTHON SOURCE LINES 95-119 .. code-block:: Python def robust_PCA(X, L, inverse=True): sig, V = np.linalg.eigh(L) # sort eigenvalues in descending order sig = sig[::-1] V = V[:,::-1] ind = np.argwhere(sig > 1e-9) if inverse: loadings = V[:,ind] @ np.diag(np.sqrt(1/sig[ind])) else: loadings = V[:,ind] @ np.diag(np.sqrt(sig[ind])) # compute the projection zu = X.values.T @ loadings return zu, loadings, np.round(sig[ind].squeeze(),3) L = P.solution.lowrank_ proj, loadings, eigv = robust_PCA(X, L, inverse=True) r = np.linalg.matrix_rank(L) .. GENERATED FROM PYTHON SOURCE LINES 120-125 We scanned the correlation between metadata variables and the two low-rank components. The largest correlations were found for a) pH and b) temperature. We compute the projections from the PCA implied by Graphical Lasso and plot * the projection of each sample onto the first low-rank component vs. the original pH value. * the projection of each sample onto the second low-rank component vs. the original temperature value. .. GENERATED FROM PYTHON SOURCE LINES 127-129 pH vs. PCA1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 129-139 .. code-block:: Python fig, ax = plt.subplots(1,1) im = ax.scatter(proj[:,0], ph, c = depth, cmap = plt.cm.Blues, vmin = 0) cbar = fig.colorbar(im) cbar.set_label("Sampling depth") ax.set_xlabel(f"PCA component 1 with eigenvalue {eigv[0]}") ax.set_ylabel("pH") print("Spearman correlation between pH and 1st component: {0}, p-value: {1}".format(stats.spearmanr(ph, proj[:,0])[0], stats.spearmanr(ph, proj[:,0])[1])) .. image-sg:: /auto_examples/images/sphx_glr_plot_soil_example_001.png :alt: plot soil example :srcset: /auto_examples/images/sphx_glr_plot_soil_example_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Spearman correlation between pH and 1st component: -0.8599836576738641, p-value: 3.777130551783766e-27 .. GENERATED FROM PYTHON SOURCE LINES 140-142 Temperature vs. PCA2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 142-153 .. code-block:: Python fig, ax = plt.subplots(1,1) im = ax.scatter(proj[:,1], temperature, c = depth, cmap = plt.cm.Blues, vmin = 0) cbar = fig.colorbar(im) cbar.set_label("Sampling depth") ax.set_xlabel(f"PCA component 2 with eigenvalue {eigv[1]}") ax.set_ylabel("Temperature") print("Spearman correlation between temperature and 2nd component: {0}, p-value: {1}".format(stats.spearmanr(temperature, proj[:,1])[0], stats.spearmanr(temperature, proj[:,1])[1])) .. image-sg:: /auto_examples/images/sphx_glr_plot_soil_example_002.png :alt: plot soil example :srcset: /auto_examples/images/sphx_glr_plot_soil_example_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Spearman correlation between temperature and 2nd component: -0.5900116855285238, p-value: 1.1690533739643573e-09 .. GENERATED FROM PYTHON SOURCE LINES 154-154 We see that the projection of the sample data onto the low-rank components of Graphical Lasso is highly correlated to environmental confounders. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 36.600 seconds) .. _sphx_glr_download_auto_examples_plot_soil_example.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_soil_example.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_soil_example.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_soil_example.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_