.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/plot_soil_example.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_plot_soil_example.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_plot_soil_example.py:


Soil microbiome networks
===========================================

Microbial abundance networks are a typical application of Graphical Lasso. [ref12]_

We have a look at a `dataset of 88 soil samples <https://pubmed.ncbi.nlm.nih.gov/19502440/>`_ from North and South America. We do two steps of preprocessing:

* We filter on OTUs with a minimum frequeny of 100, obtaining :math:`N=88` samples of counts from :math:`p=116` OTUs.
* We replace zero counts by adding a pseudocount of 1 to all entries.

Bacterial composition is correlated with environmental variables, such as temperature, moisture or pH.
In this experiment, we want to demonstrate that these confounding factors can be reconstructed using Graphical Lasso with latent variables.

 
.. GENERATED FROM PYTHON SOURCE LINES 17-31

.. code-block:: Python

    # sphinx_gallery_thumbnail_number = 2

    import numpy as np
    import pandas as pd

    import seaborn as sns
    import matplotlib.pyplot as plt
    from scipy import stats

    from gglasso.helper.utils import sparsity, zero_replacement, normalize, log_transform
    from gglasso.problem import glasso_problem
    from gglasso.helper.basic_linalg import scale_array_by_diagonal


.. GENERATED FROM PYTHON SOURCE LINES 32-34

For this, we first load the dataset and compute relative abundances (this is done by ``normalize``). Hence, we obtain compositional data where each sample is on the unit simplex.
Typically, Graphical Lasso is not applied to compositional data directly. We apply the centered log-ratio transform (using ``log_transform``). 

.. GENERATED FROM PYTHON SOURCE LINES 34-45

.. code-block:: Python


    soil = pd.read_csv('../data/soil/processed/soil_116.csv', sep=',', index_col = 0).T
    print(soil.head())

    X = normalize(soil)

    X = log_transform(X)
    (p,N) = X.shape

    print("Shape of the transformed data: (p,N)=", (p,N))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

             103.CA2  103.CO3  103.SR3  103.IE2  ...  103.AR1  103.TL1  103.HI4  103.BB1
    1124701     16.0     15.0      2.0      9.0  ...      1.0      1.0      1.0      1.0
    697997       3.0      5.0      1.0      1.0  ...      1.0      1.0      1.0      1.0
    203969       1.0      1.0      1.0      1.0  ...      1.0      1.0      1.0      1.0
    205391       1.0      1.0      1.0      2.0  ...      1.0      7.0      1.0      1.0
    843189       1.0      1.0      1.0      1.0  ...      1.0      2.0      2.0      1.0

    [5 rows x 89 columns]
    Shape of the transformed data: (p,N)= (116, 89)


.. GENERATED FROM PYTHON SOURCE LINES 46-48

The dataset also contains the pH value for each sample. We do not make use of this for estimating the network.
We also calculate the sampling depth, i.e. the number of total counts per sample.

.. GENERATED FROM PYTHON SOURCE LINES 48-59

.. code-block:: Python


    ph = pd.read_csv('../data/soil/processed/ph.csv', sep=',', index_col = 0)
    ph = ph.reindex(soil.columns)
    print(ph.head())

    depth = soil.sum(axis=0)

    metadata = pd.read_table('../data/soil/original/88soils_modified_metadata.txt', index_col=0)

    temperature = metadata["annual_season_temp"].reindex(ph.index)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

               ph
    103.CA2  8.02
    103.CO3  6.02
    103.SR3  6.95
    103.IE2  5.52
    103.BP1  7.53


.. GENERATED FROM PYTHON SOURCE LINES 60-62

We compute the empirical covariance matrix and scale it to correlations. Then, we create an instance of ``glasso_problem`` and do model selection using a grid search.
Note that we set ``latent=True`` because we want to account for unobserved latent factors.

.. GENERATED FROM PYTHON SOURCE LINES 62-79

.. code-block:: Python


    S0 = np.cov(X.values, bias = True)
    S = scale_array_by_diagonal(S0)


    P = glasso_problem(S, N, latent = True, do_scaling = False)
    print(P)

    lambda1_range = np.logspace(0.5,-1.5,8)
    mu1_range = np.logspace(1.5,-0.2,6)

    modelselect_params = {'lambda1_range': lambda1_range, 'mu1_range': mu1_range}

    P.model_selection(modelselect_params = modelselect_params, method = 'eBIC', gamma = 0.25)

    print(P.reg_params)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

 
    SINGLE GRAPHICAL LASSO PROBLEM WITH LATENT VARIABLES
    Regularization parameters:
    {'lambda1': None, 'mu1': None}
    ADMM terminated after 38 iterations with status: optimal.
    ADMM terminated after 41 iterations with status: optimal.
    ADMM terminated after 26 iterations with status: optimal.
    ADMM terminated after 25 iterations with status: optimal.
    ADMM terminated after 57 iterations with status: optimal.
    ADMM terminated after 61 iterations with status: optimal.
    ADMM terminated after 34 iterations with status: optimal.
    ADMM terminated after 41 iterations with status: optimal.
    ADMM terminated after 26 iterations with status: optimal.
    ADMM terminated after 25 iterations with status: optimal.
    ADMM terminated after 57 iterations with status: optimal.
    ADMM terminated after 61 iterations with status: optimal.
    ADMM terminated after 34 iterations with status: optimal.
    ADMM terminated after 41 iterations with status: optimal.
    ADMM terminated after 26 iterations with status: optimal.
    ADMM terminated after 25 iterations with status: optimal.
    ADMM terminated after 57 iterations with status: optimal.
    ADMM terminated after 61 iterations with status: optimal.
    ADMM terminated after 126 iterations with status: optimal.
    ADMM terminated after 53 iterations with status: optimal.
    ADMM terminated after 27 iterations with status: optimal.
    ADMM terminated after 24 iterations with status: optimal.
    ADMM terminated after 57 iterations with status: optimal.
    ADMM terminated after 61 iterations with status: optimal.
    ADMM terminated after 148 iterations with status: optimal.
    ADMM terminated after 149 iterations with status: optimal.
    ADMM terminated after 52 iterations with status: optimal.
    ADMM terminated after 38 iterations with status: optimal.
    ADMM terminated after 56 iterations with status: optimal.
    ADMM terminated after 62 iterations with status: optimal.
    ADMM terminated after 95 iterations with status: optimal.
    ADMM terminated after 66 iterations with status: optimal.
    ADMM terminated after 66 iterations with status: optimal.
    ADMM terminated after 58 iterations with status: optimal.
    ADMM terminated after 55 iterations with status: optimal.
    ADMM terminated after 62 iterations with status: optimal.
    ADMM terminated after 116 iterations with status: optimal.
    ADMM terminated after 53 iterations with status: optimal.
    ADMM terminated after 53 iterations with status: optimal.
    ADMM terminated after 53 iterations with status: optimal.
    ADMM terminated after 63 iterations with status: optimal.
    ADMM terminated after 82 iterations with status: optimal.
    ADMM terminated after 126 iterations with status: optimal.
    ADMM terminated after 55 iterations with status: optimal.
    ADMM terminated after 56 iterations with status: optimal.
    ADMM terminated after 56 iterations with status: optimal.
    ADMM terminated after 56 iterations with status: optimal.
    ADMM terminated after 84 iterations with status: optimal.
    {'lambda1': 0.22758459260747887, 'mu1': 6.606934480075961}


.. GENERATED FROM PYTHON SOURCE LINES 80-95

The Graphical Lasso solution is of the form :math:`\Theta - L` where :math:`\Theta` is sparse and :math:`L` has low rank.
We use the low rank component of the Graphical Lasso solution in order to do a robust PCA. For this, we use the eigendecomposition

.. math::
  L = V \Sigma V^T
where the columns of :math:`V` are the orthonormal eigenvecors and :math:`\Sigma` is diagonal containing the eigenvalues.
Denote the columns of :math:`V` corresponding only to positive eigenvalues with :math:`\tilde{V} \in \mathbb{R}^{p\times r}` and :math:`\tilde{\Sigma} \in \mathbb{R}^{r\times r}` accordingly, where :math:`r=\mathrm{rank}(L)`. Then we have 

.. math::
  L = \tilde{V} \tilde{\Sigma} \tilde{V}^T.
Now we project the data matrix :math:`X\in \mathbb{R}^{p\times N}` onto the eigenspaces of :math:`L^{-1}` - which are the same as of :math:`L` - by computing

.. math::
  U := X^T \tilde{V}\tilde{\Sigma}.


.. GENERATED FROM PYTHON SOURCE LINES 95-119

.. code-block:: Python


    def robust_PCA(X, L, inverse=True):
        sig, V = np.linalg.eigh(L)
       
        # sort eigenvalues in descending order
        sig = sig[::-1]
        V = V[:,::-1]    
    
        ind = np.argwhere(sig > 1e-9)
    
        if inverse:
            loadings = V[:,ind] @ np.diag(np.sqrt(1/sig[ind]))
        else:
            loadings = V[:,ind] @ np.diag(np.sqrt(sig[ind]))
    
        # compute the projection
        zu = X.values.T @ loadings
    
        return zu, loadings, np.round(sig[ind].squeeze(),3)

    L = P.solution.lowrank_
    proj, loadings, eigv = robust_PCA(X, L, inverse=True)
    r = np.linalg.matrix_rank(L)


.. GENERATED FROM PYTHON SOURCE LINES 120-125

We scanned the correlation between metadata variables and the two low-rank components. The largest correlations were found for a) pH and b) temperature.
We compute the projections from the PCA implied by Graphical Lasso and plot 

* the projection of each sample onto the first low-rank component vs. the original pH value.
* the projection of each sample onto the second low-rank component vs. the original temperature value.

.. GENERATED FROM PYTHON SOURCE LINES 127-129

pH vs. PCA1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. GENERATED FROM PYTHON SOURCE LINES 129-139

.. code-block:: Python

    fig, ax = plt.subplots(1,1)
    im = ax.scatter(proj[:,0], ph, c = depth, cmap = plt.cm.Blues, vmin = 0)
    cbar = fig.colorbar(im)
    cbar.set_label("Sampling depth")
    ax.set_xlabel(f"PCA component 1 with eigenvalue {eigv[0]}")
    ax.set_ylabel("pH")

    print("Spearman correlation between pH and 1st component: {0}, p-value: {1}".format(stats.spearmanr(ph, proj[:,0])[0], 
                                                                                  stats.spearmanr(ph, proj[:,0])[1]))


.. image-sg:: /auto_examples/images/sphx_glr_plot_soil_example_001.png
   :alt: plot soil example
   :srcset: /auto_examples/images/sphx_glr_plot_soil_example_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Spearman correlation between pH and 1st component: -0.8599836576738641, p-value: 3.777130551783766e-27


.. GENERATED FROM PYTHON SOURCE LINES 140-142

Temperature vs. PCA2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. GENERATED FROM PYTHON SOURCE LINES 142-153

.. code-block:: Python

    fig, ax = plt.subplots(1,1)
    im = ax.scatter(proj[:,1], temperature, c = depth, cmap = plt.cm.Blues, vmin = 0)
    cbar = fig.colorbar(im)
    cbar.set_label("Sampling depth")
    ax.set_xlabel(f"PCA component 2 with eigenvalue {eigv[1]}")
    ax.set_ylabel("Temperature")

    
    print("Spearman correlation between temperature and 2nd component: {0}, p-value: {1}".format(stats.spearmanr(temperature, proj[:,1])[0], 
                                                                                  stats.spearmanr(temperature, proj[:,1])[1]))
    

.. image-sg:: /auto_examples/images/sphx_glr_plot_soil_example_002.png
   :alt: plot soil example
   :srcset: /auto_examples/images/sphx_glr_plot_soil_example_002.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Spearman correlation between temperature and 2nd component: -0.5900116855285238, p-value: 1.1690533739643573e-09


.. GENERATED FROM PYTHON SOURCE LINES 154-154

We see that the projection of the sample data onto the low-rank components of Graphical Lasso is highly correlated to environmental confounders.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 36.600 seconds)


.. _sphx_glr_download_auto_examples_plot_soil_example.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_soil_example.ipynb <plot_soil_example.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_soil_example.py <plot_soil_example.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_soil_example.zip <plot_soil_example.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_