Skip to content

Integrative single-cell transcriptome analysis reveals a subpopulation of fibroblasts associated with favorable prognosis of liver cancer patients

Authors H. Wang, C. Feng, M. Lu

Introduction

An integrative single cell transcriptomic data was performed between normal and tumor patients from the liver tissue. Here we show the power of integration-ready biomedical molecular data to identify markers associated with fibroblast subpopulation.

Normal and tumor single cell datasets are shown to readily integrate due to the standardised data structure on Liver OmixAtlas.

In order to evaluate the clinical significance of those markers, we utilise curated transcriptomics data for liver hepatocellular carcinoma to perform survival analysis in cancer patients. The overall analysis showed that overexpression of SPARCL1 is associated with favourable prognosis of liver cancer patients.

Import functions

Please follow the instructions below

1. Check the appendix section below to install necessary packages
2. Refresh the kernel after installation is complete

import re
from cmapPy.pandasGEXpress.parse_gct import parse
from matplotlib.pyplot import rc_context
import itertools
import pandas as pd
import numpy as np
import GEOparse
from sh import gunzip
import wget
import urllib.request
import os

Defining custom functions to use

def get_data_geo(accession):

    if not os.path.isfile('./metadata/'+accession+'_family.soft.gz'): 

        GEOparse.get_GEO(geo=accession, destdir="./metadata/", silent=True)  
        gunzip('./metadata/'+accession+'_family.soft.gz')

    os.makedirs(accession, exist_ok = True)

    with open('./metadata/'+accession+'_family.soft', 'rt') as f:
        for line in f:
            if('!Series_supplementary_file' in line):
                file = str(line.partition("=")[2]).strip()
                urllib.request.urlretrieve(file, accession+'/'+file.split('/')[-1])
            else:
                continue
    f.close()

1. Querying Biomedical Molecular data

1.1 On Polly Liver OmixAtlas

1.1.1 Install Polly Python

!pip3 install polly-python --user
Looking in indexes: https://pypi.org/simple, http://54.245.179.143:80/
Collecting polly-python
  Downloading https://files.pythonhosted.org/packages/a5/5c/3a3d40cdd4d61e99e7da4a0c2899a6d5e81c093667abf25aba2f157d18db/polly_python-0.0.4-py3-none-any.whl
Requirement already satisfied: idna in /usr/local/lib/python3.6/dist-packages (from polly-python) (2.8)
Collecting python-magic (from polly-python)
  Downloading https://files.pythonhosted.org/packages/d3/99/c89223c6547df268596899334ee77b3051f606077317023617b1c43162fb/python_magic-0.4.24-py2.py3-none-any.whl
Collecting postpy2 (from polly-python)
  Downloading https://files.pythonhosted.org/packages/7e/25/d98f5ea87d937bf44963be78b504dbb3ddea1c6ddfcc0985b02d062b41dd/postpy2-0.0.6-py3-none-any.whl
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from polly-python) (1.1.5)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from polly-python) (2020.6.20)
Requirement already satisfied: chardet in /usr/local/lib/python3.6/dist-packages (from polly-python) (3.0.4)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.6/dist-packages (from polly-python) (1.26.3)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from polly-python) (2.25.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas->polly-python) (2.8.0)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.6/dist-packages (from pandas->polly-python) (1.19.5)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->polly-python) (2019.3)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.7.3->pandas->polly-python) (1.14.0)
Installing collected packages: python-magic, postpy2, polly-python
Successfully installed polly-python-0.0.4 postpy2-0.0.6 python-magic-0.4.24
WARNING: You are using pip version 19.2.3, however version 21.1.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
from polly.omixatlas import OmixAtlas

1.1.2 Connect to Polly OmixAtlas

Get Authentication Tokens

#REFRESH_TOKEN = "ENTER REFRESH TOKEN WITHIN THE DOUBLE QUOTES"
repo_client = OmixAtlas(REFRESH_TOKEN)

See a list of available Omix Atlases

pd.DataFrame.from_dict(repo_client.get_all_omixatlas()['data'])
repo_name repo_id indexes dataset_count disease_count diseases organism_count organisms sources datatypes sample_count
0 liveromix_atlas 1615965444377 {'gct_metadata': 'liveromix_atlas_gct_metadata... 6760 761 [normal, carcinoma, hepatocellular, obesity, n... 22 [homo sapiens, mus musculus, rattus norvegicus... [geo, lincs, tcga, metabolomics workbench, met... [transcriptomics, mutation, metabolomics, sing... 1738079

1.1.3 Get Liver OmixAtlas Summary

LiverOmixAltasID = "1615965444377"
LiverOmixAtlas_summary = repo_client.omixatlas_summary(LiverOmixAltasID)
pd.DataFrame.from_dict(LiverOmixAtlas_summary, orient="index")
repo_name repo_id indexes dataset_count disease_count diseases organism_count organisms sources datatypes sample_count
data liveromix_atlas 1615965444377 {'gct_metadata': 'liveromix_atlas_gct_metadata... 6760 761 [normal, carcinoma, hepatocellular, obesity, n... 22 [homo sapiens, mus musculus, rattus norvegicus... [geo, lincs, tcga, metabolomics workbench, met... [transcriptomics, mutation, metabolomics, sing... 1738079
#Data sources in Liver OmixAtlas
LiverOmixAtlas_summary['data'].get('sources')
['geo',
 'lincs',
 'tcga',
 'metabolomics workbench',
 'metabolights',
 'ccle',
 'gene expression omnibus (geo)',
 'cptac',
 'depmap',
 'human protein atlas']
#Data Types in Liver OmixAtlas
LiverOmixAtlas_summary['data'].get('datatypes')
['transcriptomics',
 'mutation',
 'metabolomics',
 'single cell',
 'proteomics',
 'lipidomics',
 'mirna',
 'drug screens',
 'gene dependency',
 'gene effect']
LiverOmixAtlas_summary['data'].get('indexes')
{'gct_metadata': 'liveromix_atlas_gct_metadata',
 'h5ad_metadata': 'liveromix_atlas_h5ad_metadata',
 'csv': 'liveromix_atlas_csv',
 'files': 'liveromix_atlas_files',
 'json': 'liveromix_atlas_json',
 'ipynb': 'liveromix_atlas_ipynb',
 'gct_data': 'liveromix_atlas_gct_data',
 'h5ad_data': 'liveromix_atlas_h5ad_data'}

1.1.4 Querying Single Cell RNASeq data on Liver OmixAtlas

All data in Liver Atlas are structured with metadata fields tagged with ontologies and controlled vocabularies, which simplifies finding relevant datasets

To access, filter, and search the metadata in Liver OmixAtlas, the function mentioned below can be used: query_metadata (“query written in SQL”)

For integrative single cell analysis, the authors used single cell transcriptomics data from:
1. 5 normal patients GSE115469
2. 12 cancer patients GSE125449

The study investigated a total of 8439 cells from normal patients and and 9946 cells from liver cancer tissues.

For querying the Single Cell RNASeq datasets we utilised the GEO Accession ID that was made available in the publication.

Fetching liver tumor dataset

sc_tumor_query = """select * FROM liveromix_atlas_files 
        WHERE kw_data_type = 'Single cell'
        AND organism = 'Homo sapiens'
        AND (disease = 'Carcinoma, Hepatocellular'
            OR disease = 'Tumor')
        """

sc_tumor_query_df = repo_client.query_metadata(sc_tumor_query)

In this case the query returned the datasets that are required, but requires additional filtering

sc_tumor_query_df
file_type tissue disease dataset_id organism dataset_source platform description kw_data_type kw_cell_type curation_version source_process publication kw_cell_line kw_drug kw_gene kw_repo kw_package kw_key kw_bucket kw_filetype kw_region kw_location kw_timestamp summary overall_design pubmed_id total_num_samples total_num_cells
0 h5ad [liver, peripheral blood] [Carcinoma, Hepatocellular] GSE98638_GPL16791 [Homo sapiens] GEO GPL16791 [Landscape of infiltrating T cells in liver ca... Single cell [helper T cell, regulatory T cell] g2 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [None] [CD8A, CD4, IL2RA, TTR, PTCHD3, PTH] liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE98638_GPL16791.... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626796397637 NaN NaN NaN NaN NaN
1 h5ad [liver] [Cholangiocarcinoma, Carcinoma, Hepatocellular] GSE125449_GPL20301 Homo sapiens GEO GPL20301 Tumor cell biodiversity drives microenvironmen... Single cell [] NaN NaN NaN NaN NaN NaN liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE125449_GPL20301... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626795852462 Single-cell transcriptome profiling of liver c... [A total of 19 tumors were profiled. Set 1 con... 31588021 NaN NaN
2 h5ad [liver] Tumor GSE125449_GPL18573 Homo sapiens GEO GPL18573 Tumor cell biodiversity drives microenvironmen... Single cell [] NaN NaN NaN NaN NaN NaN liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE125449_GPL18573... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626622880810 Single-cell transcriptome profiling of liver c... [A total of 19 tumors were profiled. Set 1 con... 31588021 NaN NaN
3 h5ad [blood] [Carcinoma, Hepatocellular] GSE107747_GPL16791 [Homo sapiens] GEO GPL16791 Single-cell RNA sequencing of circulating tumo... Single cell [neoplastic cell] g3 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [None] [None] liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE107747_GPL16791... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626795662655 NaN NaN NaN 2 10184
4 h5ad [liver] [Carcinoma, Hepatocellular] GSE112271_GPL16791 [Homo sapiens] GEO GPL16791 Single-cell RNA of multiregional sampling in h... Single cell [None] g3 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [None] [None] liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE112271_GPL16791... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626795689768 NaN NaN NaN 7 31279

We perform a partial match using the GSE ID provided in the publication

sc_tumor_query_df_filt = sc_tumor_query_df[sc_tumor_query_df['dataset_id'].str.contains('GSE125449')]
sc_tumor_query_df_filt
file_type tissue disease dataset_id organism dataset_source platform description kw_data_type kw_cell_type curation_version source_process publication kw_cell_line kw_drug kw_gene kw_repo kw_package kw_key kw_bucket kw_filetype kw_region kw_location kw_timestamp summary overall_design pubmed_id total_num_samples total_num_cells
1 h5ad [liver] [Cholangiocarcinoma, Carcinoma, Hepatocellular] GSE125449_GPL20301 Homo sapiens GEO GPL20301 Tumor cell biodiversity drives microenvironmen... Single cell [] NaN NaN NaN NaN NaN NaN liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE125449_GPL20301... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626795852462 Single-cell transcriptome profiling of liver c... [A total of 19 tumors were profiled. Set 1 con... 31588021 NaN NaN
2 h5ad [liver] Tumor GSE125449_GPL18573 Homo sapiens GEO GPL18573 Tumor cell biodiversity drives microenvironmen... Single cell [] NaN NaN NaN NaN NaN NaN liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE125449_GPL18573... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626622880810 Single-cell transcriptome profiling of liver c... [A total of 19 tumors were profiled. Set 1 con... 31588021 NaN NaN
sc_tumor_id = list(sc_tumor_query_df_filt['dataset_id'])
sc_tumor_id
['GSE125449_GPL20301', 'GSE125449_GPL18573']

Download single cell tumor data from OmixAtlas

url = repo_client.download_data(LiverOmixAltasID, sc_tumor_id[0]).get('data')
file_name = sc_tumor_id[0]+".h5ad"
os.system(f"wget -O '{file_name}' '{url}'")
0
url = repo_client.download_data(LiverOmixAltasID, sc_tumor_id[1]).get('data')
file_name = sc_tumor_id[1]+".h5ad"
os.system(f"wget -O '{file_name}' '{url}'")
0

Fetching normal liver dataset

sc_normal_query = """select * FROM liveromix_atlas_files 
        WHERE disease = 'Normal' 
        AND kw_data_type = 'Single cell'
        AND organism = 'Homo sapiens'"""

sc_normal_query_df = repo_client.query_metadata(sc_normal_query)
sc_normal_query_df
file_type tissue disease dataset_id organism dataset_source platform description kw_data_type kw_cell_type curation_version source_process publication kw_cell_line kw_drug kw_gene total_num_samples total_num_cells kw_repo kw_package kw_key kw_bucket kw_filetype kw_region kw_location kw_timestamp authors year geo_series_id geo_platform_id pubmed_ids title summary submission_date species cell_types
0 h5ad [skin, liver, lung] [Normal] GSE133345_GPL20795 [Homo sapiens] GEO GPL20795 Deciphering human macrophage development at si... Single cell [tissue-resident macrophage, embryonic cell] g3 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [None] [PTPRC, GYPA] 9 1268 liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE133345_GPL20795... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626363659950 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 h5ad [liver] Normal GSE115469_GPL16791 Homo sapiens GEO 10x Single cell RNA sequencing of human liver reve... Single cell [] NaN NaN https://www.nature.com/articles/s41467-018-063... NaN NaN NaN NaN NaN liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE115469_GPL16791... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626622850659 Sonya A. MacParland, Ian D McGilvray 2018 NaN NaN NaN NaN NaN NaN NaN NaN
2 h5ad [liver] [AIDS Dementia Complex, Normal] GSE148796_GPL18573 [Homo sapiens] GEO GPL18573 Identification of pathogenic TRAIL-expressing ... Single cell [macrophage, T cell, natural killer cell, hema... g3 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [None] [CD4, RAG2, TNFSF10] 4 12129 liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE148796_GPL18573... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626796167819 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 h5ad [liver, bile] [Normal] GSE141183_GPL16791 [Homo sapiens] GEO GPL16791 RNA-sequencing with human liver organoids deri... Single cell [pluripotent stem cell, hepatocyte] g3 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [Bile Acids and Salts, bosentan] [None] 1 5120 liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE141183_GPL16791... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626795963571 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 h5ad [liver] [Normal] GSE130073_GPL16791 [Homo sapiens] GEO GPL16791 Single cell RNA-sequencing of iPS cells derive... Single cell [stem cell, hepatocyte, Kupffer cell, hepatic ... g3 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [retinoic acid] [None] 1 3985 liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE130073_GPL16791... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626795871475 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 h5ad [liver, muscle] [Normal] GSE107552_GPL11154 [Homo sapiens] GEO GPL11154 Mapping human pluripotent stem cell differenti... Single cell [progenitor cell, stromal cell, pluripotent st... g3 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [sodium(1+)] [None] 10 1934 liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE107552_GPL11154... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626363174715 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 h5ad [skin, spleen, heart, blood, stomach, esophagu... [Normal] GSE159929_GPL20795 [Homo sapiens] GEO GPL20795 Single-cell transcriptome profiling of an adul... Single cell [B cell, stromal cell, T cell, fibroblast] g3 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [None] [TRBV20OR9-2, BCR] 15 82889 liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE159929_GPL20795... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626364686464 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 h5ad [liver] Normal GSE124395_GPL16791 Homo sapiens GEO mCelSeq2 A human liver cell atlas reveals heterogeneity... Single cell [] NaN NaN https://www.nature.com/articles/s41586-019-1373-2 NaN NaN NaN NaN NaN liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE124395_GPL16791... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626622864853 Nadim Aizarani, Dominic Grün 2019 NaN NaN NaN NaN NaN NaN NaN NaN
8 h5ad [spleen, liver, bone marrow, peripheral blood] [Normal] GSE137539_GPL24676 [Homo sapiens] GEO GPL24676 Single-cell transcriptome profiling reveals ne... Single cell [granulocyte monocyte progenitor cell, mature ... g3 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [None] [KIT, CD33] 3 29622 liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE137539_GPL24676... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626795936357 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 h5ad [liver] [Normal] GSE150226_GPL19415 [Homo sapiens, Mus musculus] GEO GPL19415 Heterogeneity and chimerism of endothelial cel... Single cell [endothelial cell] g3 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [None] [None] 6 3723 liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE150226_GPL19415... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626796207901 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 h5ad [thymus, liver] [Normal] GSE133341_GPL20795 [Homo sapiens] GEO GPL20795 Single-cell RNA Sequencing Resolves Spatiotemp... Single cell [stromal cell, pluripotent stem cell, T cell, ... g3 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [None] [None] 11 29722 liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE133341_GPL20795... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626363621023 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 h5ad [placenta, spleen, thymus, liver, stomach, bon... [Normal] GSE108097_GPL22245 [Homo sapiens, Mus musculus] GEO GPL22245 Mapping Mouse Cell Atlas by Microwell-seq Single cell [mesenchymal stem cell] g3 connector https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [ES-E14, CJ7] [None] [None] 1 390 liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE108097_GPL22245... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626363202288 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 NaN [liver, heart] [Normal] GSE90749_GPL11154 Homo sapiens GEO NaN Induced Pluripotent Stem Cell Differentiation ... Single cell [pluripotent stem cell, mononuclear cell, hepa... g3 NaN https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [None] [SORT1, PSRC1, CELSR2] 204 204 liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE90749_GPL11154.... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626796394998 NaN NaN GSE90749 GPL11154 [] Induced Pluripotent Stem Cell Differentiation ... Genome-wide association studies (GWAS) have hi... Dec 01 2016 Human [NA, MSC, Tissue_stem_cells, Fibroblasts, Smoo...
13 NaN [liver] [Normal] GSE81252_GPL16791 Homo sapiens GEO NaN Multilineage communication regulates human liv... Single cell [mesenchymal cell, endothelial cell, mesenchym... g3 NaN https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [None] [None] [VEGFA] 761 761 liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE81252_GPL16791.... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626796391895 NaN NaN GSE81252 GPL16791 [27669147, 28614297] Multilineage communication regulates human liv... Conventional 2-D differentiation from pluripot... May 09 2016 Human [Hepatocytes, iPS_cells, Endothelial_cells, Sm...

We perform a partial match using the GSE ID provided in the publication

sc_normal_query_df_filt = sc_normal_query_df[sc_normal_query_df['dataset_id'].str.contains('GSE115469')]
sc_normal_query_df_filt
file_type tissue disease dataset_id organism dataset_source platform description kw_data_type kw_cell_type curation_version source_process publication kw_cell_line kw_drug kw_gene total_num_samples total_num_cells kw_repo kw_package kw_key kw_bucket kw_filetype kw_region kw_location kw_timestamp authors year geo_series_id geo_platform_id pubmed_ids title summary submission_date species cell_types
1 h5ad [liver] Normal GSE115469_GPL16791 Homo sapiens GEO 10x Single cell RNA sequencing of human liver reve... Single cell [] NaN NaN https://www.nature.com/articles/s41467-018-063... NaN NaN NaN NaN NaN liveromix_atlas liver_atlas/data liver_atlas/data/SingleCell/GSE115469_GPL16791... discover-prod-datalake-v1 h5ad us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626622850659 Sonya A. MacParland, Ian D McGilvray 2018 NaN NaN NaN NaN NaN NaN NaN NaN
sc_normal_id = list(sc_normal_query_df_filt['dataset_id'])
sc_normal_id
['GSE115469_GPL16791']

Download single cell normal data from OmixAtlas

url = repo_client.download_data(LiverOmixAltasID, sc_normal_id[0]).get('data')
file_name = sc_normal_id[0]+".h5ad"
os.system(f"wget -O '{file_name}' '{url}'")
0

1.1.5 Querying Bulk RNASeq data on Liver OmixAtlas

For instance, we are required to perform RNASeq data analysis using transcriptomics data from TCGA.
Using the curated fields we narrow down our search to find the datasets that correspond to Liver Hepatocellular Carcinoma in Homo Sapiens

liver_HCC_query = """select * FROM liveromix_atlas_files 
                    WHERE organism = 'Homo sapiens' 
                    AND disease = 'Carcinoma, Hepatocellular' """
repo_client.query_metadata(liver_HCC_query)
Showing 1 - 100 of 1185 matching results
disease kw_drug kw_cell_line kw_cell_type publication tissue organism dataset_id kw_data_type description dataset_source curation_version platform year total_num_samples experimental_design kw_repo kw_package kw_key kw_bucket kw_filetype kw_region kw_location kw_timestamp study_id release_date kw_instrument_type publication_title kw_analysis_type is_public kw_sample_source operation data_repository kw_gene data_required file_type source_process publication_name author abstract overall_design summary type donor_age donor_sex sample_type
0 [Carcinoma, Hepatocellular, Hepatoblastoma] [repaglinide, vanillin, SB 216763, Nedocromil,... [Hep 3B2.1-7, Hep-G2, HLF, HuH-1, HuH-6, Huh-7... [None] https://www.biorxiv.org/content/10.1101/720243v1 [liver] Homo sapiens DEPMAP_19Q4_primary-screen-mfi_liver Drug Screens Median fluorescence intensities for liver cell... DEPMAP g2 Flow cytometry 2019 21 {'categorical_variables': {'kw_curated_cell_li... liveromix_atlas liver_atlas/data liver_atlas/data/DrugScreens/DEPMAP_19Q4_prima... discover-prod-datalake-v1 gct us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626017983908 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 [Carcinoma, Hepatocellular, Liver Cirrhosis, N... [None] [None] [None] https://www.ebi.ac.uk/metabolights/MTBLS105 [liver, blood plasma] [Homo sapiens] MTBLS105_m_mtbls105_GC_SIM_mass_spectrometry Metabolomics This study evaluates changes in metabolite lev... Metabolights g2 NaN NaN 267 NaN liveromix_atlas liver_atlas/data liver_atlas/data/Metabolomics/MTBLS105_m_mtbls... discover-prod-datalake-v1 gct us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626017990182 MTBLS105 2015:08:13 5975C Series GC/MSD (Agilent);Agilent GC/MS, L... GC-MS based plasma metabolomics for HCC biomar... mass spectrometry true blood plasma NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 [Carcinoma, Hepatocellular, Liver Cirrhosis, F... [None] [None] [None] https://www.ebi.ac.uk/metabolights/MTBLS17 [liver, bile] [Homo sapiens] MTBLS17_m_live_mtbls17pos_metabolite profiling... Metabolomics Characterizing the metabolic changes pertainin... Metabolights g2 NaN NaN 1050 NaN liveromix_atlas liver_atlas/data liver_atlas/data/Metabolomics/MTBLS17_m_live_m... discover-prod-datalake-v1 gct us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626018020896 MTBLS17 2013:06:30 UPLC Q-TOF Premier (Waters) Utilization of metabolomics to identify serum ... mass spectrometry true blood serum NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 [Hepatic Encephalopathy, Carcinoma, Hepatocell... [None] [None] [None] https://www.ebi.ac.uk/metabolights/MTBLS582 [liver] [Homo sapiens] MTBLS582_m_mtbls582_NEG_mass_spectrometry Lipidomics Obesity is tightly linked to hepatic steatosis... Metabolights g2 NaN NaN 12 NaN liveromix_atlas liver_atlas/data liver_atlas/data/Metabolomics/MTBLS582_m_mtbls... discover-prod-datalake-v1 gct us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626018028414 MTBLS582 2018:04:26 Exactive (Thermo Scientific) Global Analyses of Selective Insulin Resistanc... mass spectrometry true Hep-G2 cell NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 [Carcinoma, Hepatocellular, Liver Cirrhosis, F... [None] [None] [None] https://www.ebi.ac.uk/metabolights/MTBLS19 [liver, bile] [Homo sapiens] MTBLS19_m_neg_MTBLS19_metabolite profiling_mas... Metabolomics Although hepatocellular carcinoma (HCC) has be... Metabolights g2 NaN NaN 360 NaN liveromix_atlas liver_atlas/data liver_atlas/data/Metabolomics/MTBLS19_m_neg_MT... discover-prod-datalake-v1 gct us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626018026372 MTBLS19 2013:06:30 UPLC Q-TOF Premier (Waters) LC−MS Based Serum Metabolomics for Identificat... mass spectrometry true blood serum NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 [Carcinoma, Hepatocellular] [BGJ-398] [Hep-G2] [None] https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [liver] Homo sapiens lincs_GSE70138_HEPG2_BRD-K42728290 Transcriptomics Gene expression profile of Hep-G2 on treating ... LINCS g3 LINCS1000 2018 388 {'categorical_variables': {'drug': [{'name': '... liveromix_atlas liver_atlas/data liver_atlas/data/Transcriptomics/lincs_GSE7013... discover-prod-datalake-v1 gct us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626152712356 NaN NaN NaN NaN NaN true NaN {'is_normalized': 'true', 'batch_corrected_var... lincs NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 Male Tumor
96 [Carcinoma, Hepatocellular] [BKM120] [Hep-G2] [None] https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [liver] Homo sapiens lincs_GSE70138_HEPG2_BRD-K42191735 Transcriptomics Gene expression profile of Hep-G2 on treating ... LINCS g3 LINCS1000 2018 388 {'categorical_variables': {'drug': [{'name': '... liveromix_atlas liver_atlas/data liver_atlas/data/Transcriptomics/lincs_GSE7013... discover-prod-datalake-v1 gct us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626151670741 NaN NaN NaN NaN NaN true NaN {'is_normalized': 'true', 'batch_corrected_var... lincs NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 Male Tumor
97 [Carcinoma, Hepatocellular] [Osimertinib] [Hep-G2] [None] https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [liver] Homo sapiens lincs_GSE70138_HEPG2_BRD-K42805893 Transcriptomics Gene expression profile of Hep-G2 on treating ... LINCS g3 LINCS1000 2018 387 {'categorical_variables': {'drug': [{'name': '... liveromix_atlas liver_atlas/data liver_atlas/data/Transcriptomics/lincs_GSE7013... discover-prod-datalake-v1 gct us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626153236520 NaN NaN NaN NaN NaN true NaN {'is_normalized': 'true', 'batch_corrected_var... lincs NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 Male Tumor
98 [Carcinoma, Hepatocellular] [BX795] [Hep-G2] [None] https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [liver] Homo sapiens lincs_GSE70138_HEPG2_BRD-K47983010 Transcriptomics Gene expression profile of Hep-G2 on treating ... LINCS g3 LINCS1000 2018 388 {'categorical_variables': {'drug': [{'name': '... liveromix_atlas liver_atlas/data liver_atlas/data/Transcriptomics/lincs_GSE7013... discover-prod-datalake-v1 gct us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626158523464 NaN NaN NaN NaN NaN true NaN {'is_normalized': 'true', 'batch_corrected_var... lincs NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 Male Tumor
99 [Carcinoma, Hepatocellular] [2-((aminocarbonyl)amino)-5-(4-fluorophenyl)-3... [Hep-G2] [None] https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... [liver] Homo sapiens lincs_GSE70138_HEPG2_BRD-K51575138 Transcriptomics Gene expression profile of Hep-G2 on treating ... LINCS g3 LINCS1000 2018 388 {'categorical_variables': {'drug': [{'name': '... liveromix_atlas liver_atlas/data liver_atlas/data/Transcriptomics/lincs_GSE7013... discover-prod-datalake-v1 gct us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626166038957 NaN NaN NaN NaN NaN true NaN {'is_normalized': 'true', 'batch_corrected_var... lincs NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 Male Tumor

100 rows × 46 columns

The initial search returns Liver heaptocellular carcinoma detaasets from multiple sources and different data types.
We are interested in liver cancer dataset from TCGA of type Transcriptomics.

Therefore, we narrow down our search to only return datasets from source TCGA and type Transcriptomics.

liver_HCC_query = """select * FROM liveromix_atlas_files 
                    WHERE organism = 'Homo sapiens' 
                    AND disease = 'Carcinoma, Hepatocellular' 
                    AND dataset_source = 'TCGA'
                    AND kw_data_type = 'Transcriptomics'"""

liver_HCC_query_df = repo_client.query_metadata(liver_HCC_query)
liver_HCC_query_df
disease kw_drug kw_cell_type kw_cell_line publication tissue organism dataset_id kw_data_type description dataset_source curation_version total_num_samples kw_repo kw_package kw_key kw_bucket kw_filetype kw_region kw_location kw_timestamp
0 [Carcinoma, Hepatocellular, Carcinoma, Adenoca... [Doxorubicin, sorafenib tosylate, gemcitabine,... [None] [None] https://www.cell.com/cell/fulltext/S0092-8674(... [liver] Homo sapiens LIHC_RNASeq_TCGA Transcriptomics Liver hepatocellular carcinoma RNASeq data TCGA g3 424 liveromix_atlas liver_atlas/data liver_atlas/data/RNAseq/LIHC_RNASeq_TCGA.gct discover-prod-datalake-v1 gct us-west-2 https://discover-prod-datalake-v1.s3-us-west-2... 1626708777124
liver_hcc_id = list(liver_HCC_query_df['dataset_id'])
liver_hcc_id
['LIHC_RNASeq_TCGA']
url = repo_client.download_data(LiverOmixAltasID, liver_hcc_id[0]).get('data')
file_name = liver_hcc_id[0]+".gct"
os.system(f"wget -O '{file_name}' '{url}'")
0

1.2 Outside Polly

1.2.1 Querying Single Cell RNASeq data on NCBI GEO

We query the datasets using the original source. The normal and tumor single cell datasets for liver patients were found on GEO. We used the GSE Accession ID to fetch the data and metadata in the format made available by the submitters.

Fetching liver tumor single cell data from GEO NCBI

# Fetching liver tumor data from GEO
get_data_geo('GSE125449')

Fetching normal liver single cell data from GEO NCBI

# Fetching liver tumor data from GEO
get_data_geo('GSE115469')

Data structure and format

cd GSE125449/ && ls -l
total 45268
-rw-r--r-- 1 polly users    26016 Jun 30 05:14 GSE125449_Set1_barcodes.tsv.gz
-rw-r--r-- 1 polly users   158399 Jun 30 05:14 GSE125449_Set1_genes.tsv.gz
-rw-r--r-- 1 polly users 26707779 Jun 30 05:14 GSE125449_Set1_matrix.mtx.gz
-rw-r--r-- 1 polly users    33526 Jun 30 05:14 GSE125449_Set1_samples.txt.gz
-rw-r--r-- 1 polly users    23861 Jun 30 05:14 GSE125449_Set2_barcodes.tsv.gz
-rw-r--r-- 1 polly users   153470 Jun 30 05:14 GSE125449_Set2_genes.tsv.gz
-rw-r--r-- 1 polly users 19204961 Jun 30 05:14 GSE125449_Set2_matrix.mtx.gz
-rw-r--r-- 1 polly users    31132 Jun 30 05:14 GSE125449_Set2_samples.txt.gz
cd ../GSE115469/ && ls -l 
total 105536
-rw-r--r-- 1 polly users 108065293 Jun 30 05:15 GSE115469_Data.csv.gz

1.2.2 Querying Bulk RNASeq data on TCGA

Fetching mRNA data from TCGA

library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-LIHC",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - FPKM",
                  legacy = FALSE)
GDCdownload(query)
data <- GDCprepare(query)

Time taken to fetch data: ~ 10 mins

Downloading data for project TCGA-LIHC
GDCdownload will download 424 files. A total of 207.74981 MB
Downloading as: Tue_Jun_29_11_38_54_2021.tar.gz
Downloading: 210 MB
|=====================================================================|100%                      
Completed after 3 m 
Starting to add information to samples
 => Add clinical information to samples
 => Adding TCGA molecular information from marker papers
 => Information will have prefix 'paper_' 
From the 60483 genes we couldn't map 3881
user  system elapsed 
265.095  61.711 519.107

Data Structure and format

class: RangedSummarizedExperiment 
dim: 56602 424 
metadata(1): data_release
assays(1): HTSeq - FPKM
rownames(56602): ENSG00000000003 ENSG00000000005 ... ENSG00000281912
  ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name original_ensembl_gene_id
colnames(424): TCGA-DD-AAE4-01A-11R-A41C-07 TCGA-CC-A3M9-01A-11R-A213-07 ...
  TCGA-DD-AAE0-01A-11R-A41C-07 TCGA-DD-A3A8-11A-11R-A22L-07
colData names(64): barcode patient ... released sample.aux

2. Single Cell data integration

2.1 Reading single cell datasets

Reading h5ad as an anndata object and converting it to a Seurat object contained assay, gene and sample metadata

suppressMessages(library("Seurat"))
suppressMessages(library("anndata"))
suppressMessages(library("dplyr"))
suppressMessages(library("SingleCellExperiment"))
suppressMessages(library("clustermole"))
options(warn=-1)
#Create a Seurat object 
get_seurat_obj <- function(adata) {

  # Get the expression matrix
  exprs <- t(as.matrix(adata$X))
  colnames(exprs) <- adata$obs_names
  rownames(exprs) <- adata$var_names
  # Create the Seurat object
  seurat <- CreateSeuratObject(exprs)
  # Set the expression assay
  seurat <- SetAssayData(seurat, "data", exprs)
  # Add observation metadata
  seurat <- AddMetaData(seurat, adata$obs)

  return(seurat)
}
adata_normal <- read_h5ad('GSE115469_GPL16791.h5ad')
adata_tumor_1 <- read_h5ad('GSE125449_GPL18573.h5ad')
adata_tumor_2 <- read_h5ad('GSE125449_GPL20301.h5ad')
normal_seurat <- get_seurat_obj(adata_normal)
tumor_seurat_1 <- get_seurat_obj(adata_tumor_1)
tumor_seurat_2 <- get_seurat_obj(adata_tumor_2)
#Add condition column to metadata
normal_seurat@meta.data$condition <- c('Normal')
tumor_seurat_1@meta.data$condition <- c('Tumor')
tumor_seurat_2@meta.data$condition <- c('Tumor')

normal_seurat <- PercentageFeatureSet(normal_seurat, pattern = "^MT-", col.name = "percent.mt")
tumor_seurat_1 <- PercentageFeatureSet(tumor_seurat_1, pattern = "^MT-", col.name = "percent.mt")
tumor_seurat_2 <- PercentageFeatureSet(tumor_seurat_2, pattern = "^MT-", col.name = "percent.mt")

2.2 Identification of cell types for normal single cell dataset

# find markers for every cluster compared to all remaining cells, report only the positive ones
Idents(object = normal_seurat) <- normal_seurat@meta.data$clusters
liver.markers <- FindAllMarkers(normal_seurat,min.pct = 0.25, logfc.threshold = 0.25)
head(liver.markers)
A data.frame: 6 × 7
p_valavg_log2FCpct.1pct.2p_val_adjclustergene
<dbl><dbl><dbl><dbl><dbl><int><chr>
HSD11B100.74738400.8960.20701HSD11B1
APOM00.69099390.9130.20401APOM
PON300.51024620.8580.17701PON3
TTC3600.47610220.8270.19301TTC36
F1000.40804880.5830.10401F10
BCHE00.38248720.7490.15601BCHE
Type <- c()
clusters <- c()
for (i in seq(1,20, by=1)) {
    liver.markers.filt <- as.data.frame(liver.markers %>% filter(cluster == as.character(i)) %>% top_n(n = 25, wt = avg_log2FC))
    my_overlaps <- clustermole_overlaps(genes = liver.markers.filt$gene, species = "hs")
    type <- as.character(my_overlaps[1,5])
    Type <- c(Type, type)
    clusters <- c(clusters, i)
}


clust_map <- as.data.frame(cbind(clusters,Type))
head(clust_map)
A data.frame: 6 × 2
clustersType
<fct><fct>
11Hepatocytes
22DURANTE_ADULT_OLFACTORY_NEUROEPITHELIUM_CD8_T_CELLS
33Hepatocytes
44AIZARANI_LIVER_C25_KUPFFER_CELLS_4
55Hepatocytes
66AIZARANI_LIVER_C11_HEPATOCYTES_1
#The cell type names must be cleaned in order for them to be clustered together
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('NK_NKT_CELLS','NK Cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('B cell\\D+','B cell',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_CD71\\D+','Erythroid cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_NK_CELLS','NK Cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('KUPFFER_CELLS','Kupffer Cells',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('MVECS','Endothelial cells',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_CD8_T_CELLS','CD8 T cell',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_CD4_T_CELLS','CD4 T cell',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('AIZARANI_LIVER_C\\d+_','',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('_\\d+$','',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('HEPATOCYTES','Hepatocytes',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\bPlasma cell\\b','Plasma cells',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_PLASMA_CELLS','Plasma cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_PLASMA_CELL','Plasma cells',x))                               
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('EPCAM_POS_\\D+','EPCAM+',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_B_CELL','B cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('CUI_DEVELOPING_HEART_C3_FIBROBLAST_LIKE_CELL','Others',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('Erythrocytes_\\D+','Erythrocytes',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('FAN_EMBRYONIC_CTX_NSC','Others',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('Lymph_endothel_cell','Other endothelial cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_Epithelial_cells','Epithelial cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('HAY_BONE_MARROW_PRO_B','Others',x))
cell_type_clust <- aggregate(clusters ~ Type, clust_map, paste, collapse=',')
cell_type_clust
A data.frame: 12 × 2
Typeclusters
<chr><chr>
B cell 16
CD8 T cell 2
Cholangiocytes 17
Endothelial cells 11,13
Erythroid cells 19
Gamma delta T cells 18
Hepatic stellate cells20
Hepatocytes 1,3,5,6,14,15
Kupffer Cells 4,10
LSECS 12
NK Cells 8,9
Plasma cells 7
#Add cell types to seurat metadata
row.meta <- rownames(normal_seurat@meta.data)
normal_seurat@meta.data <- merge(normal_seurat@meta.data, clust_map, by = 'clusters')
rownames(normal_seurat@meta.data) <- row.meta

2.3 Single cell data integration

seurat <- merge(normal_seurat, y = c(tumor_seurat_1, tumor_seurat_2))

seurat.list <- SplitObject(seurat, split.by = "condition")
#Clear memory 
rm(normal_seurat, tumor_seurat_1, tumor_seurat_2)

Performing normalisation and scaling using SCTransform

seurat.list[["Normal"]] <- SCTransform(seurat.list[["Normal"]], vars.to.regress = 'percent.mt')
seurat.list[["Tumor"]] <- SCTransform(seurat.list[["Tumor"]], vars.to.regress = 'percent.mt')
features <- SelectIntegrationFeatures(object.list = seurat.list, nfeatures = 3000)
seurat.list <- PrepSCTIntegration(object.list = seurat.list, anchor.features = features)
liver.anchors <- FindIntegrationAnchors(object.list = seurat.list, normalization.method = "SCT", anchor.features = features)
liver.combined.sct <- IntegrateData(anchorset = liver.anchors, normalization.method = "SCT")
Merging dataset 1 into 2

Extracting anchors for merged samples

Finding integration vectors

Finding integration vector weights

Integrating data

Compute PCA and UMAP using first 40 principal components

liver.combined.sct <- RunPCA(liver.combined.sct, verbose = FALSE)
liver.combined.sct <- RunUMAP(liver.combined.sct, reduction = "pca", dims = 1:40)

Perform SNN clustering of cells using first 40 principal components

liver.combined.sct <- FindNeighbors(liver.combined.sct, dims = 1:40)
liver.combined.sct <- FindClusters(liver.combined.sct, resolution = 0.8)

Plot the UMAP clustering of cell type and cluster populations

DimPlot(liver.combined.sct, reduction = "umap", group.by = 'seurat_clusters', cols = DiscretePalette(30), label = TRUE)
DimPlot(liver.combined.sct, reduction = "umap", group.by = 'Type', cols = DiscretePalette(19))

Marker analysis of fibroblast population

# find markers for fibroblast clusters compared to all remaining cells, report only the positive ones
DefaultAssay(liver.combined.sct) <- "RNA"
Idents(object = liver.combined.sct) <- liver.combined.sct@meta.data$seurat_clusters
clus7.markers <- FindMarkers(liver.combined.sct, ident.1 = '7', min.pct = 0.25)
clus15markers <- FindMarkers(liver.combined.sct, ident.1 = '15', min.pct = 0.25)

Selecting markers from Fibroblast sub populations

VlnPlot(liver.combined.sct, features = "COL1A1", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "COL1A2", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "COL3A1", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "BGN", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "LUM", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "SPARCL1", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "GJA4", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "OAZ2", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "ADAMTS4", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "GPR4", idents = c("7","15"))

We observed that fibrogenic genes (e.g., COL1A1, COL1A2, COL3A1 ) and proteoglycan encoding genes (e.g., LUM, BGN ) were overexpressed in cluster 15. Cluster 7, however showed high overall expression of SPARCL1, GJA4.

In order to assess the clinical significance bulk RNASeq data was fetched from Liver OmixAtlas for Heaptocellular Carcinoma.

3. Survival analysis using bulk RNASeq data

Reading the mRNA data

# Parsing mRNA expression data 
gct = parse('./LIHC_RNASeq_TCGA.gct')

# column metadata
clinical_data = gct.col_metadata_df

# expression data
expression_data = gct.data_df

Getting paired normal and tumor samples

clinical_data_paired = clinical_data[clinical_data.duplicated(subset = 'patient', keep = False)]
clinical_data_paired.head()
chd patient barcode sample shortLetterCode definition sample_submitter_id sample_type_id sample_id sample_type state initial_weight pathology_report_uuid submitter_id oct_embedded is_ffpe tissue_type synchronous_malignancy ajcc_pathologic_stage tumor_stage last_known_disease_status tissue_or_organ_of_origin primary_diagnosis prior_malignancy prior_treatment ajcc_staging_system_edition ajcc_pathologic_t morphology ajcc_pathologic_n ajcc_pathologic_m classification_of_tumor diagnosis_id icd_10_code site_of_resection_or_biopsy tumor_grade progression_or_recurrence alcohol_history exposure_id weight height bmi ... therapy_regimen.drug_4 therapy_regimen_other.drug_4 total_dose.drug_4 tx_on_clinical_trial.drug_4 AFP>300 Baylor GCC whole genome seq bcr_drug_barcode.drug_5 bcr_drug_uuid.drug_5 form_completion_date.drug_5 pharmaceutical_therapy_drug_name.drug_5 clinical_trial_drug_classification.drug_5 pharmaceutical_therapy_type.drug_5 pharmaceutical_tx_started_days_to.drug_5 pharmaceutical_tx_ongoing_indicator.drug_5 pharmaceutical_tx_ended_days_to.drug_5 treatment_best_response.drug_5 days_to_stem_cell_transplantation.drug_5 pharm_regimen.drug_5 pharm_regimen_other.drug_5 pharma_adjuvant_cycles_count.drug_5 pharma_type_other.drug_5 pharmaceutical_tx_dose_units.drug_5 pharmaceutical_tx_total_dose_units.drug_5 prescribed_dose.drug_5 regimen_number.drug_5 route_of_administration.drug_5 stem_cell_transplantation.drug_5 stem_cell_transplantation_type.drug_5 therapy_regimen.drug_5 therapy_regimen_other.drug_5 total_dose.drug_5 tx_on_clinical_trial.drug_5 kw_curated_disease kw_curated_drug kw_curated_tissue kw_curated_cell_type kw_curated_cell_line kw_curated_genetic_mod_type kw_curated_modified_gene kw_curated_gene
cid
TCGA-BC-A10Q-01A-11R-A131-07 TCGA-BC-A10Q TCGA-BC-A10Q-01A-11R-A131-07 TCGA-BC-A10Q-01A TP Primary solid Tumor TCGA-BC-A10Q-01A 1 bf225dc1-4309-4304-849e-cbcc23c8442c Primary Tumor released 340 10117C34-85C2-49DE-B2AD-F8D0D4570DA9 TCGA-BC-A10Q false FALSE Not Reported No NaN not reported not reported Liver Hepatocellular carcinoma, NOS no No 5th T2 8170/3 NX MX not reported c896279a-640a-57f8-91c5-4d1b674cdcca C22.0 Liver not reported not reported Not Reported 41801e4c-8d70-5266-82ae-382fc1dd4a20 NaN NaN NaN ... PROGRESSION [Not Applicable] [Not Available] [Not Available] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Carcinoma, Hepatocellular Doxorubicin liver none none none none none
TCGA-BC-A10Q-11A-11R-A131-07 TCGA-BC-A10Q TCGA-BC-A10Q-11A-11R-A131-07 TCGA-BC-A10Q-11A NT Solid Tissue Normal TCGA-BC-A10Q-11A 11 5d3bf988-3331-4412-bc17-b8f4650a8623 Solid Tissue Normal released 340 NaN TCGA-BC-A10Q false FALSE Not Reported No NaN not reported not reported Liver Hepatocellular carcinoma, NOS no No 5th T2 8170/3 NX MX not reported c896279a-640a-57f8-91c5-4d1b674cdcca C22.0 Liver not reported not reported Not Reported 41801e4c-8d70-5266-82ae-382fc1dd4a20 NaN NaN NaN ... PROGRESSION [Not Applicable] [Not Available] [Not Available] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Normal Doxorubicin liver none none none none none
TCGA-BC-A10R-01A-11R-A131-07 TCGA-BC-A10R TCGA-BC-A10R-01A-11R-A131-07 TCGA-BC-A10R-01A TP Primary solid Tumor TCGA-BC-A10R-01A 1 995180f2-5aa7-47b9-bb3f-585f94b2457a Primary Tumor released 180 65BE2C95-CADA-44FC-9313-14529AFF773E TCGA-BC-A10R false FALSE Not Reported No NaN not reported not reported Liver Hepatocellular carcinoma, NOS no No 5th T3 8170/3 NX MX not reported ef53ae43-7ab9-546f-b40c-71018f4bf37a C22.0 Liver not reported not reported Not Reported d17d5f9e-bcd9-5955-baaa-f03189d6006a NaN NaN NaN ... NaN NaN NaN NaN No NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Carcinoma, Hepatocellular none liver none none none none none
TCGA-BC-A10R-11A-11R-A131-07 TCGA-BC-A10R TCGA-BC-A10R-11A-11R-A131-07 TCGA-BC-A10R-11A NT Solid Tissue Normal TCGA-BC-A10R-11A 11 1ea00561-3d1d-49d0-a3f7-d34eb8fec235 Solid Tissue Normal released 1620 NaN TCGA-BC-A10R false FALSE Not Reported No NaN not reported not reported Liver Hepatocellular carcinoma, NOS no No 5th T3 8170/3 NX MX not reported ef53ae43-7ab9-546f-b40c-71018f4bf37a C22.0 Liver not reported not reported Not Reported d17d5f9e-bcd9-5955-baaa-f03189d6006a NaN NaN NaN ... NaN NaN NaN NaN No NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Normal none liver none none none none none
TCGA-BC-A10T-01A-11R-A131-07 TCGA-BC-A10T TCGA-BC-A10T-01A-11R-A131-07 TCGA-BC-A10T-01A TP Primary solid Tumor TCGA-BC-A10T-01A 1 d2f6cea0-ecec-49f1-a851-e332e79ce098 Primary Tumor released 1000 D8103C33-F303-4D97-81AF-CC3D06E79285 TCGA-BC-A10T false FALSE Not Reported No NaN not reported not reported Liver Hepatocellular carcinoma, NOS no No 5th T4 8170/3 NX MX not reported b5925a40-cb9f-5954-9cfb-c7ab86404e60 C22.0 Liver not reported not reported Not Reported a1d6658a-accb-5864-a6b2-31e176486998 86.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Carcinoma, Hepatocellular none liver none none none none none

5 rows × 301 columns

#Subset the expression data using the paired normal and tumor samples
expression_data_paired = expression_data[expression_data.columns.intersection(clinical_data_paired.index)]
expression_data_paired.to_csv('TCGA_LIHC_paired.csv', sep = '\t')
clinical_data_paired.to_csv('TCGA_LIHC_clinical.csv', sep = '\t')

Preparing TCGA clinical data for survival analysis

#Load R packages
suppressMessages(library("readr"))
suppressMessages(library("gplots"))
suppressMessages(library("survival"))
suppressMessages(library("survminer"))
suppressMessages(library("RColorBrewer"))
suppressMessages(library("gProfileR"))
suppressMessages(library("genefilter"))
expression_data <- as.data.frame(read_delim("TCGA_LIHC_paired.csv", "\t", escape_double = FALSE, trim_ws = TRUE, col_types = cols()))
rownames(expression_data) <- expression_data[,1]
expression_data <- expression_data[,-1]

clinical_data <- as.data.frame(read_delim("TCGA_LIHC_clinical.csv", "\t", escape_double = FALSE, trim_ws = TRUE, col_types = cols()))
rownames(clinical_data) <- clinical_data[,1]
clinical_data <- clinical_data[,-1]
# create a new boolean variable that has TRUE for dead patients
# and FALSE for live patients
clinical_data$deceased = clinical_data$vital_status == "Dead"

# create an "overall survival" variable that is equal to days_to_death
# for dead patients, and to days_to_last_follow_up for patients who
# are still alive
clinical_data$overall_survival = ifelse(clinical_data$deceased,
                                   clinical_data$days_to_death,
                                   clinical_data$days_to_last_follow_up)

# show first 10 samples
clinical_data_survival = clinical_data[,c('vital_status','deceased','days_to_death','days_to_last_follow_up','overall_survival')]
head(clinical_data_survival)
A data.frame: 6 × 5
vital_statusdeceaseddays_to_deathdays_to_last_follow_upoverall_survival
<chr><lgl><dbl><dbl><dbl>
TCGA-BC-A10Q-01A-11R-A131-07DeadTRUE1135NA1135
TCGA-BC-A10Q-11A-11R-A131-07DeadTRUE1135NA1135
TCGA-BC-A10R-01A-11R-A131-07DeadTRUE 308NA 308
TCGA-BC-A10R-11A-11R-A131-07DeadTRUE 308NA 308
TCGA-BC-A10T-01A-11R-A131-07DeadTRUE 837NA 837
TCGA-BC-A10T-11A-11R-A131-07DeadTRUE 837NA 837

Examining the association of cluster specific gene sets with clinical data

  1. Specific Cluster 24 gene set G24 : ( COL1A, CFH, LUM and PDGFRA ).
  2. Specific Cluster 8 gene set G3 : ( SPARCL1, GJA4, ADAMTS4 and GPR4 ).
survival_gene_exp = function(gene_name) {

    expression_mat = t(expression_data)
    # get the expression values for the selected gene
    clinical_data$gene_name = expression_mat[rownames(clinical_data), gene_name]

    # find the median value of the gene and print it
    median_value = median(clinical_data$gene_name)

    # divide patients in two groups, up and down regulated.
    # if the patient expression is greater or equal to them median we put it
    # among the "up-regulated", otherwise among the "down-regulated"
    clinical_data$gene = ifelse(clinical_data$gene_name >= median_value, "UP", "DOWN")

    # we can fit a survival model, like we did in the previous section
    fit = survfit(Surv(overall_survival, deceased) ~ gene, data=clinical_data)

    return(list(fit, clinical_data))

}

Survival Analysis of Cluster 8 genes

We observe that overexpression of SPARCL1 gene showed a favourable prognosis in liver cancer patients.

# and finally, we produce a Kaplan-Meier plot
surv = survival_gene_exp("SPARCL1")
ggsurvplot(surv[[1]], data=surv[[2]], pval=T, risk.table=T, title=paste("SPARCL1"))

surv = survival_gene_exp("GJA4")
ggsurvplot(surv[[1]], data=surv[[2]], pval=T, risk.table=T, title=paste("GJA4"))

surv = survival_gene_exp("ADAMTS4")
ggsurvplot(surv[[1]], data=surv[[2]], pval=T, risk.table=T, title=paste("ADAMTS4"))

surv = survival_gene_exp("GPR4")
ggsurvplot(surv[[1]], data=surv[[2]], pval=T, risk.table=T, title=paste("GPR4"))

Appendix

Package installations

NOTE : For R packages, the cells must be run in the following order as shown below

!pip3 install cmapPy --user
!pip3 install GEOparse --user
!pip3 install sh --user
!pip3 install wget --user
!sudo R -e 'BiocManager::install(c("genefilter", "SingleCellExperiment","GSEABase", "GSVA", "singscore"))'
!sudo R -e 'install.packages(c("anndata","clustermole","gridExtra"), repos = "https://cloud.r-project.org/")'
!sudo R -e 'install.packages(c("survival", "survminer", "gProfileR", "RColorBrewer", "gplots"), repos = "https://cloud.r-project.org/")'