Integrative single-cell transcriptome analysis reveals a subpopulation of fibroblasts associated with favorable prognosis of liver cancer patients

Authors H. Wang, C. Feng, M. Lu

Introduction

An integrative single cell transcriptomic data was performed between normal and tumor patients from the liver tissue. Here we show the power of integration-ready biomedical molecular data to identify markers associated with fibroblast subpopulation.

Normal and tumor single cell datasets are shown to readily integrate due to the standardised data structure on Liver OmixAtlas.

In order to evaluate the clinical significance of those markers, we utilise curated transcriptomics data for liver hepatocellular carcinoma to perform survival analysis in cancer patients. The overall analysis showed that overexpression of SPARCL1 is associated with favourable prognosis of liver cancer patients.

Import functions

Please follow the instructions below

1. Check the appendix section below to install necessary packages
2. Refresh the kernel after installation is complete

import re
from cmapPy.pandasGEXpress.parse_gct import parse
from matplotlib.pyplot import rc_context
import itertools
import pandas as pd
import numpy as np
import GEOparse
from sh import gunzip
import wget
import urllib.request
import os

Defining custom functions to use

def get_data_geo(accession):

    if not os.path.isfile('./metadata/'+accession+'_family.soft.gz'): 

        GEOparse.get_GEO(geo=accession, destdir="./metadata/", silent=True)  
        gunzip('./metadata/'+accession+'_family.soft.gz')

    os.makedirs(accession, exist_ok = True)

    with open('./metadata/'+accession+'_family.soft', 'rt') as f:
        for line in f:
            if('!Series_supplementary_file' in line):
                file = str(line.partition("=")[2]).strip()
                urllib.request.urlretrieve(file, accession+'/'+file.split('/')[-1])
            else:
                continue
    f.close()

1. Querying Biomedical Molecular data

1.1 On Polly Liver OmixAtlas

1.1.1 Install Polly Python

!pip3 install polly-python --user

Looking in indexes: https://pypi.org/simple, http://54.245.179.143:80/
Collecting polly-python
  Downloading https://files.pythonhosted.org/packages/a5/5c/3a3d40cdd4d61e99e7da4a0c2899a6d5e81c093667abf25aba2f157d18db/polly_python-0.0.4-py3-none-any.whl
Requirement already satisfied: idna in /usr/local/lib/python3.6/dist-packages (from polly-python) (2.8)
Collecting python-magic (from polly-python)
  Downloading https://files.pythonhosted.org/packages/d3/99/c89223c6547df268596899334ee77b3051f606077317023617b1c43162fb/python_magic-0.4.24-py2.py3-none-any.whl
Collecting postpy2 (from polly-python)
  Downloading https://files.pythonhosted.org/packages/7e/25/d98f5ea87d937bf44963be78b504dbb3ddea1c6ddfcc0985b02d062b41dd/postpy2-0.0.6-py3-none-any.whl
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from polly-python) (1.1.5)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from polly-python) (2020.6.20)
Requirement already satisfied: chardet in /usr/local/lib/python3.6/dist-packages (from polly-python) (3.0.4)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.6/dist-packages (from polly-python) (1.26.3)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from polly-python) (2.25.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas->polly-python) (2.8.0)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.6/dist-packages (from pandas->polly-python) (1.19.5)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->polly-python) (2019.3)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.7.3->pandas->polly-python) (1.14.0)
Installing collected packages: python-magic, postpy2, polly-python
Successfully installed polly-python-0.0.4 postpy2-0.0.6 python-magic-0.4.24
[33mWARNING: You are using pip version 19.2.3, however version 21.1.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m

from polly.omixatlas import OmixAtlas

1.1.2 Connect to Polly OmixAtlas

Get Authentication Tokens

#REFRESH_TOKEN = "ENTER REFRESH TOKEN WITHIN THE DOUBLE QUOTES"
repo_client = OmixAtlas(REFRESH_TOKEN)

See a list of available Omix Atlases

pd.DataFrame.from_dict(repo_client.get_all_omixatlas()['data'])

	repo_name	repo_id	indexes	dataset_count	disease_count	diseases	organism_count	organisms	sources	datatypes	sample_count
0	liveromix_atlas	1615965444377	{'gct_metadata': 'liveromix_atlas_gct_metadata...	6760	761	[normal, carcinoma, hepatocellular, obesity, n...	22	[homo sapiens, mus musculus, rattus norvegicus...	[geo, lincs, tcga, metabolomics workbench, met...	[transcriptomics, mutation, metabolomics, sing...	1738079

1.1.3 Get Liver OmixAtlas Summary

LiverOmixAltasID = "1615965444377"
LiverOmixAtlas_summary = repo_client.omixatlas_summary(LiverOmixAltasID)
pd.DataFrame.from_dict(LiverOmixAtlas_summary, orient="index")

	repo_name	repo_id	indexes	dataset_count	disease_count	diseases	organism_count	organisms	sources	datatypes	sample_count
data	liveromix_atlas	1615965444377	{'gct_metadata': 'liveromix_atlas_gct_metadata...	6760	761	[normal, carcinoma, hepatocellular, obesity, n...	22	[homo sapiens, mus musculus, rattus norvegicus...	[geo, lincs, tcga, metabolomics workbench, met...	[transcriptomics, mutation, metabolomics, sing...	1738079

#Data sources in Liver OmixAtlas
LiverOmixAtlas_summary['data'].get('sources')

['geo',
 'lincs',
 'tcga',
 'metabolomics workbench',
 'metabolights',
 'ccle',
 'gene expression omnibus (geo)',
 'cptac',
 'depmap',
 'human protein atlas']

#Data Types in Liver OmixAtlas
LiverOmixAtlas_summary['data'].get('datatypes')

['transcriptomics',
 'mutation',
 'metabolomics',
 'single cell',
 'proteomics',
 'lipidomics',
 'mirna',
 'drug screens',
 'gene dependency',
 'gene effect']

LiverOmixAtlas_summary['data'].get('indexes')

{'gct_metadata': 'liveromix_atlas_gct_metadata',
 'h5ad_metadata': 'liveromix_atlas_h5ad_metadata',
 'csv': 'liveromix_atlas_csv',
 'files': 'liveromix_atlas_files',
 'json': 'liveromix_atlas_json',
 'ipynb': 'liveromix_atlas_ipynb',
 'gct_data': 'liveromix_atlas_gct_data',
 'h5ad_data': 'liveromix_atlas_h5ad_data'}

1.1.4 Querying Single Cell RNASeq data on Liver OmixAtlas

All data in Liver Atlas are structured with metadata fields tagged with ontologies and controlled vocabularies, which simplifies finding relevant datasets

To access, filter, and search the metadata in Liver OmixAtlas, the function mentioned below can be used: query_metadata (“query written in SQL”)

For integrative single cell analysis, the authors used single cell transcriptomics data from:
1. 5 normal patients GSE115469
2. 12 cancer patients GSE125449

The study investigated a total of 8439 cells from normal patients and and 9946 cells from liver cancer tissues.

For querying the Single Cell RNASeq datasets we utilised the GEO Accession ID that was made available in the publication.

Fetching liver tumor dataset

sc_tumor_query = """select * FROM liveromix_atlas_files 
        WHERE kw_data_type = 'Single cell'
        AND organism = 'Homo sapiens'
        AND (disease = 'Carcinoma, Hepatocellular'
            OR disease = 'Tumor')
        """

sc_tumor_query_df = repo_client.query_metadata(sc_tumor_query)

In this case the query returned the datasets that are required, but requires additional filtering

sc_tumor_query_df

	file_type	tissue	disease	dataset_id	organism	dataset_source	platform	description	kw_data_type	kw_cell_type	curation_version	source_process	publication	kw_cell_line	kw_drug	kw_gene	kw_repo	kw_package	kw_key	kw_bucket	kw_filetype	kw_region	kw_location	kw_timestamp	summary	overall_design	pubmed_id	total_num_samples	total_num_cells
0	h5ad	[liver, peripheral blood]	[Carcinoma, Hepatocellular]	GSE98638_GPL16791	[Homo sapiens]	GEO	GPL16791	[Landscape of infiltrating T cells in liver ca...	Single cell	[helper T cell, regulatory T cell]	g2	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[None]	[CD8A, CD4, IL2RA, TTR, PTCHD3, PTH]	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE98638_GPL16791....	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626796397637	NaN	NaN	NaN	NaN	NaN
1	h5ad	[liver]	[Cholangiocarcinoma, Carcinoma, Hepatocellular]	GSE125449_GPL20301	Homo sapiens	GEO	GPL20301	Tumor cell biodiversity drives microenvironmen...	Single cell	[]	NaN	NaN	NaN	NaN	NaN	NaN	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE125449_GPL20301...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626795852462	Single-cell transcriptome profiling of liver c...	[A total of 19 tumors were profiled. Set 1 con...	31588021	NaN	NaN
2	h5ad	[liver]	Tumor	GSE125449_GPL18573	Homo sapiens	GEO	GPL18573	Tumor cell biodiversity drives microenvironmen...	Single cell	[]	NaN	NaN	NaN	NaN	NaN	NaN	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE125449_GPL18573...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626622880810	Single-cell transcriptome profiling of liver c...	[A total of 19 tumors were profiled. Set 1 con...	31588021	NaN	NaN
3	h5ad	[blood]	[Carcinoma, Hepatocellular]	GSE107747_GPL16791	[Homo sapiens]	GEO	GPL16791	Single-cell RNA sequencing of circulating tumo...	Single cell	[neoplastic cell]	g3	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[None]	[None]	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE107747_GPL16791...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626795662655	NaN	NaN	NaN	2	10184
4	h5ad	[liver]	[Carcinoma, Hepatocellular]	GSE112271_GPL16791	[Homo sapiens]	GEO	GPL16791	Single-cell RNA of multiregional sampling in h...	Single cell	[None]	g3	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[None]	[None]	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE112271_GPL16791...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626795689768	NaN	NaN	NaN	7	31279

We perform a partial match using the GSE ID provided in the publication

sc_tumor_query_df_filt = sc_tumor_query_df[sc_tumor_query_df['dataset_id'].str.contains('GSE125449')]
sc_tumor_query_df_filt

	file_type	tissue	disease	dataset_id	organism	dataset_source	platform	description	kw_data_type	kw_cell_type	curation_version	source_process	publication	kw_cell_line	kw_drug	kw_gene	kw_repo	kw_package	kw_key	kw_bucket	kw_filetype	kw_region	kw_location	kw_timestamp	summary	overall_design	pubmed_id	total_num_samples	total_num_cells
1	h5ad	[liver]	[Cholangiocarcinoma, Carcinoma, Hepatocellular]	GSE125449_GPL20301	Homo sapiens	GEO	GPL20301	Tumor cell biodiversity drives microenvironmen...	Single cell	[]	NaN	NaN	NaN	NaN	NaN	NaN	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE125449_GPL20301...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626795852462	Single-cell transcriptome profiling of liver c...	[A total of 19 tumors were profiled. Set 1 con...	31588021	NaN	NaN
2	h5ad	[liver]	Tumor	GSE125449_GPL18573	Homo sapiens	GEO	GPL18573	Tumor cell biodiversity drives microenvironmen...	Single cell	[]	NaN	NaN	NaN	NaN	NaN	NaN	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE125449_GPL18573...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626622880810	Single-cell transcriptome profiling of liver c...	[A total of 19 tumors were profiled. Set 1 con...	31588021	NaN	NaN

sc_tumor_id = list(sc_tumor_query_df_filt['dataset_id'])
sc_tumor_id

['GSE125449_GPL20301', 'GSE125449_GPL18573']

Download single cell tumor data from OmixAtlas

url = repo_client.download_data(LiverOmixAltasID, sc_tumor_id[0]).get('data')
file_name = sc_tumor_id[0]+".h5ad"
os.system(f"wget -O '{file_name}' '{url}'")

url = repo_client.download_data(LiverOmixAltasID, sc_tumor_id[1]).get('data')
file_name = sc_tumor_id[1]+".h5ad"
os.system(f"wget -O '{file_name}' '{url}'")

Fetching normal liver dataset

sc_normal_query = """select * FROM liveromix_atlas_files 
        WHERE disease = 'Normal' 
        AND kw_data_type = 'Single cell'
        AND organism = 'Homo sapiens'"""

sc_normal_query_df = repo_client.query_metadata(sc_normal_query)
sc_normal_query_df

	file_type	tissue	disease	dataset_id	organism	dataset_source	platform	description	kw_data_type	kw_cell_type	curation_version	source_process	publication	kw_cell_line	kw_drug	kw_gene	total_num_samples	total_num_cells	kw_repo	kw_package	kw_key	kw_bucket	kw_filetype	kw_region	kw_location	kw_timestamp	authors	year	geo_series_id	geo_platform_id	pubmed_ids	title	summary	submission_date	species	cell_types
0	h5ad	[skin, liver, lung]	[Normal]	GSE133345_GPL20795	[Homo sapiens]	GEO	GPL20795	Deciphering human macrophage development at si...	Single cell	[tissue-resident macrophage, embryonic cell]	g3	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[None]	[PTPRC, GYPA]	9	1268	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE133345_GPL20795...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626363659950	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	h5ad	[liver]	Normal	GSE115469_GPL16791	Homo sapiens	GEO	10x	Single cell RNA sequencing of human liver reve...	Single cell	[]	NaN	NaN	https://www.nature.com/articles/s41467-018-063...	NaN	NaN	NaN	NaN	NaN	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE115469_GPL16791...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626622850659	Sonya A. MacParland, Ian D McGilvray	2018	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	h5ad	[liver]	[AIDS Dementia Complex, Normal]	GSE148796_GPL18573	[Homo sapiens]	GEO	GPL18573	Identification of pathogenic TRAIL-expressing ...	Single cell	[macrophage, T cell, natural killer cell, hema...	g3	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[None]	[CD4, RAG2, TNFSF10]	4	12129	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE148796_GPL18573...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626796167819	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	h5ad	[liver, bile]	[Normal]	GSE141183_GPL16791	[Homo sapiens]	GEO	GPL16791	RNA-sequencing with human liver organoids deri...	Single cell	[pluripotent stem cell, hepatocyte]	g3	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[Bile Acids and Salts, bosentan]	[None]	1	5120	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE141183_GPL16791...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626795963571	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	h5ad	[liver]	[Normal]	GSE130073_GPL16791	[Homo sapiens]	GEO	GPL16791	Single cell RNA-sequencing of iPS cells derive...	Single cell	[stem cell, hepatocyte, Kupffer cell, hepatic ...	g3	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[retinoic acid]	[None]	1	3985	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE130073_GPL16791...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626795871475	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	h5ad	[liver, muscle]	[Normal]	GSE107552_GPL11154	[Homo sapiens]	GEO	GPL11154	Mapping human pluripotent stem cell differenti...	Single cell	[progenitor cell, stromal cell, pluripotent st...	g3	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[sodium(1+)]	[None]	10	1934	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE107552_GPL11154...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626363174715	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	h5ad	[skin, spleen, heart, blood, stomach, esophagu...	[Normal]	GSE159929_GPL20795	[Homo sapiens]	GEO	GPL20795	Single-cell transcriptome profiling of an adul...	Single cell	[B cell, stromal cell, T cell, fibroblast]	g3	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[None]	[TRBV20OR9-2, BCR]	15	82889	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE159929_GPL20795...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626364686464	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	h5ad	[liver]	Normal	GSE124395_GPL16791	Homo sapiens	GEO	mCelSeq2	A human liver cell atlas reveals heterogeneity...	Single cell	[]	NaN	NaN	https://www.nature.com/articles/s41586-019-1373-2	NaN	NaN	NaN	NaN	NaN	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE124395_GPL16791...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626622864853	Nadim Aizarani, Dominic Grün	2019	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	h5ad	[spleen, liver, bone marrow, peripheral blood]	[Normal]	GSE137539_GPL24676	[Homo sapiens]	GEO	GPL24676	Single-cell transcriptome profiling reveals ne...	Single cell	[granulocyte monocyte progenitor cell, mature ...	g3	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[None]	[KIT, CD33]	3	29622	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE137539_GPL24676...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626795936357	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	h5ad	[liver]	[Normal]	GSE150226_GPL19415	[Homo sapiens, Mus musculus]	GEO	GPL19415	Heterogeneity and chimerism of endothelial cel...	Single cell	[endothelial cell]	g3	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[None]	[None]	6	3723	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE150226_GPL19415...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626796207901	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	h5ad	[thymus, liver]	[Normal]	GSE133341_GPL20795	[Homo sapiens]	GEO	GPL20795	Single-cell RNA Sequencing Resolves Spatiotemp...	Single cell	[stromal cell, pluripotent stem cell, T cell, ...	g3	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[None]	[None]	11	29722	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE133341_GPL20795...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626363621023	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	h5ad	[placenta, spleen, thymus, liver, stomach, bon...	[Normal]	GSE108097_GPL22245	[Homo sapiens, Mus musculus]	GEO	GPL22245	Mapping Mouse Cell Atlas by Microwell-seq	Single cell	[mesenchymal stem cell]	g3	connector	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[ES-E14, CJ7]	[None]	[None]	1	390	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE108097_GPL22245...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626363202288	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	NaN	[liver, heart]	[Normal]	GSE90749_GPL11154	Homo sapiens	GEO	NaN	Induced Pluripotent Stem Cell Differentiation ...	Single cell	[pluripotent stem cell, mononuclear cell, hepa...	g3	NaN	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[None]	[SORT1, PSRC1, CELSR2]	204	204	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE90749_GPL11154....	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626796394998	NaN	NaN	GSE90749	GPL11154	[]	Induced Pluripotent Stem Cell Differentiation ...	Genome-wide association studies (GWAS) have hi...	Dec 01 2016	Human	[NA, MSC, Tissue_stem_cells, Fibroblasts, Smoo...
13	NaN	[liver]	[Normal]	GSE81252_GPL16791	Homo sapiens	GEO	NaN	Multilineage communication regulates human liv...	Single cell	[mesenchymal cell, endothelial cell, mesenchym...	g3	NaN	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[None]	[None]	[VEGFA]	761	761	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE81252_GPL16791....	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626796391895	NaN	NaN	GSE81252	GPL16791	[27669147, 28614297]	Multilineage communication regulates human liv...	Conventional 2-D differentiation from pluripot...	May 09 2016	Human	[Hepatocytes, iPS_cells, Endothelial_cells, Sm...

We perform a partial match using the GSE ID provided in the publication

sc_normal_query_df_filt = sc_normal_query_df[sc_normal_query_df['dataset_id'].str.contains('GSE115469')]
sc_normal_query_df_filt

	file_type	tissue	disease	dataset_id	organism	dataset_source	platform	description	kw_data_type	kw_cell_type	curation_version	source_process	publication	kw_cell_line	kw_drug	kw_gene	total_num_samples	total_num_cells	kw_repo	kw_package	kw_key	kw_bucket	kw_filetype	kw_region	kw_location	kw_timestamp	authors	year	geo_series_id	geo_platform_id	pubmed_ids	title	summary	submission_date	species	cell_types
1	h5ad	[liver]	Normal	GSE115469_GPL16791	Homo sapiens	GEO	10x	Single cell RNA sequencing of human liver reve...	Single cell	[]	NaN	NaN	https://www.nature.com/articles/s41467-018-063...	NaN	NaN	NaN	NaN	NaN	liveromix_atlas	liver_atlas/data	liver_atlas/data/SingleCell/GSE115469_GPL16791...	discover-prod-datalake-v1	h5ad	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626622850659	Sonya A. MacParland, Ian D McGilvray	2018	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

sc_normal_id = list(sc_normal_query_df_filt['dataset_id'])
sc_normal_id

['GSE115469_GPL16791']

Download single cell normal data from OmixAtlas

url = repo_client.download_data(LiverOmixAltasID, sc_normal_id[0]).get('data')
file_name = sc_normal_id[0]+".h5ad"
os.system(f"wget -O '{file_name}' '{url}'")

1.1.5 Querying Bulk RNASeq data on Liver OmixAtlas

For instance, we are required to perform RNASeq data analysis using transcriptomics data from TCGA.
Using the curated fields we narrow down our search to find the datasets that correspond to Liver Hepatocellular Carcinoma in Homo Sapiens

liver_HCC_query = """select * FROM liveromix_atlas_files 
                    WHERE organism = 'Homo sapiens' 
                    AND disease = 'Carcinoma, Hepatocellular' """
repo_client.query_metadata(liver_HCC_query)

Showing 1 - 100 of 1185 matching results

	disease	kw_drug	kw_cell_line	kw_cell_type	publication	tissue	organism	dataset_id	kw_data_type	description	dataset_source	curation_version	platform	year	total_num_samples	experimental_design	kw_repo	kw_package	kw_key	kw_bucket	kw_filetype	kw_region	kw_location	kw_timestamp	study_id	release_date	kw_instrument_type	publication_title	kw_analysis_type	is_public	kw_sample_source	operation	data_repository	kw_gene	data_required	file_type	source_process	publication_name	author	abstract	overall_design	summary	type	donor_age	donor_sex	sample_type
0	[Carcinoma, Hepatocellular, Hepatoblastoma]	[repaglinide, vanillin, SB 216763, Nedocromil,...	[Hep 3B2.1-7, Hep-G2, HLF, HuH-1, HuH-6, Huh-7...	[None]	https://www.biorxiv.org/content/10.1101/720243v1	[liver]	Homo sapiens	DEPMAP_19Q4_primary-screen-mfi_liver	Drug Screens	Median fluorescence intensities for liver cell...	DEPMAP	g2	Flow cytometry	2019	21	{'categorical_variables': {'kw_curated_cell_li...	liveromix_atlas	liver_atlas/data	liver_atlas/data/DrugScreens/DEPMAP_19Q4_prima...	discover-prod-datalake-v1	gct	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626017983908	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	[Carcinoma, Hepatocellular, Liver Cirrhosis, N...	[None]	[None]	[None]	https://www.ebi.ac.uk/metabolights/MTBLS105	[liver, blood plasma]	[Homo sapiens]	MTBLS105_m_mtbls105_GC_SIM_mass_spectrometry	Metabolomics	This study evaluates changes in metabolite lev...	Metabolights	g2	NaN	NaN	267	NaN	liveromix_atlas	liver_atlas/data	liver_atlas/data/Metabolomics/MTBLS105_m_mtbls...	discover-prod-datalake-v1	gct	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626017990182	MTBLS105	2015:08:13	5975C Series GC/MSD (Agilent);Agilent GC/MS, L...	GC-MS based plasma metabolomics for HCC biomar...	mass spectrometry	true	blood plasma	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	[Carcinoma, Hepatocellular, Liver Cirrhosis, F...	[None]	[None]	[None]	https://www.ebi.ac.uk/metabolights/MTBLS17	[liver, bile]	[Homo sapiens]	MTBLS17_m_live_mtbls17pos_metabolite profiling...	Metabolomics	Characterizing the metabolic changes pertainin...	Metabolights	g2	NaN	NaN	1050	NaN	liveromix_atlas	liver_atlas/data	liver_atlas/data/Metabolomics/MTBLS17_m_live_m...	discover-prod-datalake-v1	gct	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626018020896	MTBLS17	2013:06:30	UPLC Q-TOF Premier (Waters)	Utilization of metabolomics to identify serum ...	mass spectrometry	true	blood serum	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	[Hepatic Encephalopathy, Carcinoma, Hepatocell...	[None]	[None]	[None]	https://www.ebi.ac.uk/metabolights/MTBLS582	[liver]	[Homo sapiens]	MTBLS582_m_mtbls582_NEG_mass_spectrometry	Lipidomics	Obesity is tightly linked to hepatic steatosis...	Metabolights	g2	NaN	NaN	12	NaN	liveromix_atlas	liver_atlas/data	liver_atlas/data/Metabolomics/MTBLS582_m_mtbls...	discover-prod-datalake-v1	gct	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626018028414	MTBLS582	2018:04:26	Exactive (Thermo Scientific)	Global Analyses of Selective Insulin Resistanc...	mass spectrometry	true	Hep-G2 cell	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	[Carcinoma, Hepatocellular, Liver Cirrhosis, F...	[None]	[None]	[None]	https://www.ebi.ac.uk/metabolights/MTBLS19	[liver, bile]	[Homo sapiens]	MTBLS19_m_neg_MTBLS19_metabolite profiling_mas...	Metabolomics	Although hepatocellular carcinoma (HCC) has be...	Metabolights	g2	NaN	NaN	360	NaN	liveromix_atlas	liver_atlas/data	liver_atlas/data/Metabolomics/MTBLS19_m_neg_MT...	discover-prod-datalake-v1	gct	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626018026372	MTBLS19	2013:06:30	UPLC Q-TOF Premier (Waters)	LC−MS Based Serum Metabolomics for Identificat...	mass spectrometry	true	blood serum	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	[Carcinoma, Hepatocellular]	[BGJ-398]	[Hep-G2]	[None]	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[liver]	Homo sapiens	lincs_GSE70138_HEPG2_BRD-K42728290	Transcriptomics	Gene expression profile of Hep-G2 on treating ...	LINCS	g3	LINCS1000	2018	388	{'categorical_variables': {'drug': [{'name': '...	liveromix_atlas	liver_atlas/data	liver_atlas/data/Transcriptomics/lincs_GSE7013...	discover-prod-datalake-v1	gct	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626152712356	NaN	NaN	NaN	NaN	NaN	true	NaN	{'is_normalized': 'true', 'batch_corrected_var...	lincs	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15	Male	Tumor
96	[Carcinoma, Hepatocellular]	[BKM120]	[Hep-G2]	[None]	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[liver]	Homo sapiens	lincs_GSE70138_HEPG2_BRD-K42191735	Transcriptomics	Gene expression profile of Hep-G2 on treating ...	LINCS	g3	LINCS1000	2018	388	{'categorical_variables': {'drug': [{'name': '...	liveromix_atlas	liver_atlas/data	liver_atlas/data/Transcriptomics/lincs_GSE7013...	discover-prod-datalake-v1	gct	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626151670741	NaN	NaN	NaN	NaN	NaN	true	NaN	{'is_normalized': 'true', 'batch_corrected_var...	lincs	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15	Male	Tumor
97	[Carcinoma, Hepatocellular]	[Osimertinib]	[Hep-G2]	[None]	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[liver]	Homo sapiens	lincs_GSE70138_HEPG2_BRD-K42805893	Transcriptomics	Gene expression profile of Hep-G2 on treating ...	LINCS	g3	LINCS1000	2018	387	{'categorical_variables': {'drug': [{'name': '...	liveromix_atlas	liver_atlas/data	liver_atlas/data/Transcriptomics/lincs_GSE7013...	discover-prod-datalake-v1	gct	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626153236520	NaN	NaN	NaN	NaN	NaN	true	NaN	{'is_normalized': 'true', 'batch_corrected_var...	lincs	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15	Male	Tumor
98	[Carcinoma, Hepatocellular]	[BX795]	[Hep-G2]	[None]	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[liver]	Homo sapiens	lincs_GSE70138_HEPG2_BRD-K47983010	Transcriptomics	Gene expression profile of Hep-G2 on treating ...	LINCS	g3	LINCS1000	2018	388	{'categorical_variables': {'drug': [{'name': '...	liveromix_atlas	liver_atlas/data	liver_atlas/data/Transcriptomics/lincs_GSE7013...	discover-prod-datalake-v1	gct	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626158523464	NaN	NaN	NaN	NaN	NaN	true	NaN	{'is_normalized': 'true', 'batch_corrected_var...	lincs	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15	Male	Tumor
99	[Carcinoma, Hepatocellular]	[2-((aminocarbonyl)amino)-5-(4-fluorophenyl)-3...	[Hep-G2]	[None]	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...	[liver]	Homo sapiens	lincs_GSE70138_HEPG2_BRD-K51575138	Transcriptomics	Gene expression profile of Hep-G2 on treating ...	LINCS	g3	LINCS1000	2018	388	{'categorical_variables': {'drug': [{'name': '...	liveromix_atlas	liver_atlas/data	liver_atlas/data/Transcriptomics/lincs_GSE7013...	discover-prod-datalake-v1	gct	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626166038957	NaN	NaN	NaN	NaN	NaN	true	NaN	{'is_normalized': 'true', 'batch_corrected_var...	lincs	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15	Male	Tumor

100 rows × 46 columns

The initial search returns Liver heaptocellular carcinoma detaasets from multiple sources and different data types.
We are interested in liver cancer dataset from TCGA of type Transcriptomics.

Therefore, we narrow down our search to only return datasets from source TCGA and type Transcriptomics.

liver_HCC_query = """select * FROM liveromix_atlas_files 
                    WHERE organism = 'Homo sapiens' 
                    AND disease = 'Carcinoma, Hepatocellular' 
                    AND dataset_source = 'TCGA'
                    AND kw_data_type = 'Transcriptomics'"""

liver_HCC_query_df = repo_client.query_metadata(liver_HCC_query)
liver_HCC_query_df

	disease	kw_drug	kw_cell_type	kw_cell_line	publication	tissue	organism	dataset_id	kw_data_type	description	dataset_source	curation_version	total_num_samples	kw_repo	kw_package	kw_key	kw_bucket	kw_filetype	kw_region	kw_location	kw_timestamp
0	[Carcinoma, Hepatocellular, Carcinoma, Adenoca...	[Doxorubicin, sorafenib tosylate, gemcitabine,...	[None]	[None]	https://www.cell.com/cell/fulltext/S0092-8674(...	[liver]	Homo sapiens	LIHC_RNASeq_TCGA	Transcriptomics	Liver hepatocellular carcinoma RNASeq data	TCGA	g3	424	liveromix_atlas	liver_atlas/data	liver_atlas/data/RNAseq/LIHC_RNASeq_TCGA.gct	discover-prod-datalake-v1	gct	us-west-2	https://discover-prod-datalake-v1.s3-us-west-2...	1626708777124

liver_hcc_id = list(liver_HCC_query_df['dataset_id'])
liver_hcc_id

['LIHC_RNASeq_TCGA']

url = repo_client.download_data(LiverOmixAltasID, liver_hcc_id[0]).get('data')
file_name = liver_hcc_id[0]+".gct"
os.system(f"wget -O '{file_name}' '{url}'")

1.2 Outside Polly

1.2.1 Querying Single Cell RNASeq data on NCBI GEO

We query the datasets using the original source. The normal and tumor single cell datasets for liver patients were found on GEO. We used the GSE Accession ID to fetch the data and metadata in the format made available by the submitters.

Fetching liver tumor single cell data from GEO NCBI

# Fetching liver tumor data from GEO
get_data_geo('GSE125449')

Fetching normal liver single cell data from GEO NCBI

# Fetching liver tumor data from GEO
get_data_geo('GSE115469')

Data structure and format

cd GSE125449/ && ls -l

total 45268
-rw-r--r-- 1 polly users    26016 Jun 30 05:14 [0m[01;31mGSE125449_Set1_barcodes.tsv.gz[0m
-rw-r--r-- 1 polly users   158399 Jun 30 05:14 [01;31mGSE125449_Set1_genes.tsv.gz[0m
-rw-r--r-- 1 polly users 26707779 Jun 30 05:14 [01;31mGSE125449_Set1_matrix.mtx.gz[0m
-rw-r--r-- 1 polly users    33526 Jun 30 05:14 [01;31mGSE125449_Set1_samples.txt.gz[0m
-rw-r--r-- 1 polly users    23861 Jun 30 05:14 [01;31mGSE125449_Set2_barcodes.tsv.gz[0m
-rw-r--r-- 1 polly users   153470 Jun 30 05:14 [01;31mGSE125449_Set2_genes.tsv.gz[0m
-rw-r--r-- 1 polly users 19204961 Jun 30 05:14 [01;31mGSE125449_Set2_matrix.mtx.gz[0m
-rw-r--r-- 1 polly users    31132 Jun 30 05:14 [01;31mGSE125449_Set2_samples.txt.gz[0m

cd ../GSE115469/ && ls -l

total 105536
-rw-r--r-- 1 polly users 108065293 Jun 30 05:15 [0m[01;31mGSE115469_Data.csv.gz[0m

1.2.2 Querying Bulk RNASeq data on TCGA

Fetching mRNA data from TCGA

library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-LIHC",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - FPKM",
                  legacy = FALSE)
GDCdownload(query)
data <- GDCprepare(query)

Time taken to fetch data: ~ 10 mins

Downloading data for project TCGA-LIHC
GDCdownload will download 424 files. A total of 207.74981 MB
Downloading as: Tue_Jun_29_11_38_54_2021.tar.gz
Downloading: 210 MB
|=====================================================================|100%                      
Completed after 3 m 
Starting to add information to samples
 => Add clinical information to samples
 => Adding TCGA molecular information from marker papers
 => Information will have prefix 'paper_' 
From the 60483 genes we couldn't map 3881
user  system elapsed 
265.095  61.711 519.107

Data Structure and format

class: RangedSummarizedExperiment 
dim: 56602 424 
metadata(1): data_release
assays(1): HTSeq - FPKM
rownames(56602): ENSG00000000003 ENSG00000000005 ... ENSG00000281912
  ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name original_ensembl_gene_id
colnames(424): TCGA-DD-AAE4-01A-11R-A41C-07 TCGA-CC-A3M9-01A-11R-A213-07 ...
  TCGA-DD-AAE0-01A-11R-A41C-07 TCGA-DD-A3A8-11A-11R-A22L-07
colData names(64): barcode patient ... released sample.aux

2. Single Cell data integration

2.1 Reading single cell datasets

Reading h5ad as an anndata object and converting it to a Seurat object contained assay, gene and sample metadata

suppressMessages(library("Seurat"))
suppressMessages(library("anndata"))
suppressMessages(library("dplyr"))
suppressMessages(library("SingleCellExperiment"))
suppressMessages(library("clustermole"))
options(warn=-1)

#Create a Seurat object 
get_seurat_obj <- function(adata) {

  # Get the expression matrix
  exprs <- t(as.matrix(adata$X))
  colnames(exprs) <- adata$obs_names
  rownames(exprs) <- adata$var_names
  # Create the Seurat object
  seurat <- CreateSeuratObject(exprs)
  # Set the expression assay
  seurat <- SetAssayData(seurat, "data", exprs)
  # Add observation metadata
  seurat <- AddMetaData(seurat, adata$obs)

  return(seurat)
}

adata_normal <- read_h5ad('GSE115469_GPL16791.h5ad')
adata_tumor_1 <- read_h5ad('GSE125449_GPL18573.h5ad')
adata_tumor_2 <- read_h5ad('GSE125449_GPL20301.h5ad')

normal_seurat <- get_seurat_obj(adata_normal)
tumor_seurat_1 <- get_seurat_obj(adata_tumor_1)
tumor_seurat_2 <- get_seurat_obj(adata_tumor_2)

#Add condition column to metadata
normal_seurat@meta.data$condition <- c('Normal')
tumor_seurat_1@meta.data$condition <- c('Tumor')
tumor_seurat_2@meta.data$condition <- c('Tumor')

normal_seurat <- PercentageFeatureSet(normal_seurat, pattern = "^MT-", col.name = "percent.mt")
tumor_seurat_1 <- PercentageFeatureSet(tumor_seurat_1, pattern = "^MT-", col.name = "percent.mt")
tumor_seurat_2 <- PercentageFeatureSet(tumor_seurat_2, pattern = "^MT-", col.name = "percent.mt")

2.2 Identification of cell types for normal single cell dataset

# find markers for every cluster compared to all remaining cells, report only the positive ones
Idents(object = normal_seurat) <- normal_seurat@meta.data$clusters
liver.markers <- FindAllMarkers(normal_seurat,min.pct = 0.25, logfc.threshold = 0.25)

head(liver.markers)

A data.frame: 6 × 7
	p_val	avg_log2FC	pct.1	pct.2	p_val_adj	cluster	gene
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<int>	<chr>
HSD11B1	0	0.7473840	0.896	0.207	0	1	HSD11B1
APOM	0	0.6909939	0.913	0.204	0	1	APOM
PON3	0	0.5102462	0.858	0.177	0	1	PON3
TTC36	0	0.4761022	0.827	0.193	0	1	TTC36
F10	0	0.4080488	0.583	0.104	0	1	F10
BCHE	0	0.3824872	0.749	0.156	0	1	BCHE

Type <- c()
clusters <- c()
for (i in seq(1,20, by=1)) {
    liver.markers.filt <- as.data.frame(liver.markers %>% filter(cluster == as.character(i)) %>% top_n(n = 25, wt = avg_log2FC))
    my_overlaps <- clustermole_overlaps(genes = liver.markers.filt$gene, species = "hs")
    type <- as.character(my_overlaps[1,5])
    Type <- c(Type, type)
    clusters <- c(clusters, i)
}


clust_map <- as.data.frame(cbind(clusters,Type))
head(clust_map)

A data.frame: 6 × 2
	clusters	Type
	<fct>	<fct>
1	1	Hepatocytes
2	2	DURANTE_ADULT_OLFACTORY_NEUROEPITHELIUM_CD8_T_CELLS
3	3	Hepatocytes
4	4	AIZARANI_LIVER_C25_KUPFFER_CELLS_4
5	5	Hepatocytes
6	6	AIZARANI_LIVER_C11_HEPATOCYTES_1

#The cell type names must be cleaned in order for them to be clustered together
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('NK_NKT_CELLS','NK Cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('B cell\\D+','B cell',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_CD71\\D+','Erythroid cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_NK_CELLS','NK Cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('KUPFFER_CELLS','Kupffer Cells',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('MVECS','Endothelial cells',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_CD8_T_CELLS','CD8 T cell',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_CD4_T_CELLS','CD4 T cell',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('AIZARANI_LIVER_C\\d+_','',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('_\\d+$','',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('HEPATOCYTES','Hepatocytes',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\bPlasma cell\\b','Plasma cells',x)) 
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_PLASMA_CELLS','Plasma cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_PLASMA_CELL','Plasma cells',x))                               
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('EPCAM_POS_\\D+','EPCAM+',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_B_CELL','B cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('CUI_DEVELOPING_HEART_C3_FIBROBLAST_LIKE_CELL','Others',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('Erythrocytes_\\D+','Erythrocytes',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('FAN_EMBRYONIC_CTX_NSC','Others',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('Lymph_endothel_cell','Other endothelial cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_Epithelial_cells','Epithelial cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('HAY_BONE_MARROW_PRO_B','Others',x))

cell_type_clust <- aggregate(clusters ~ Type, clust_map, paste, collapse=',')
cell_type_clust

A data.frame: 12 × 2
Type	clusters
<chr>	<chr>
B cell	16
CD8 T cell	2
Cholangiocytes	17
Endothelial cells	11,13
Erythroid cells	19
Gamma delta T cells	18
Hepatic stellate cells	20
Hepatocytes	1,3,5,6,14,15
Kupffer Cells	4,10
LSECS	12
NK Cells	8,9
Plasma cells	7

#Add cell types to seurat metadata
row.meta <- rownames(normal_seurat@meta.data)
normal_seurat@meta.data <- merge(normal_seurat@meta.data, clust_map, by = 'clusters')
rownames(normal_seurat@meta.data) <- row.meta

2.3 Single cell data integration

seurat <- merge(normal_seurat, y = c(tumor_seurat_1, tumor_seurat_2))

seurat.list <- SplitObject(seurat, split.by = "condition")

#Clear memory 
rm(normal_seurat, tumor_seurat_1, tumor_seurat_2)

Performing normalisation and scaling using SCTransform

seurat.list[["Normal"]] <- SCTransform(seurat.list[["Normal"]], vars.to.regress = 'percent.mt')
seurat.list[["Tumor"]] <- SCTransform(seurat.list[["Tumor"]], vars.to.regress = 'percent.mt')

features <- SelectIntegrationFeatures(object.list = seurat.list, nfeatures = 3000)
seurat.list <- PrepSCTIntegration(object.list = seurat.list, anchor.features = features)
liver.anchors <- FindIntegrationAnchors(object.list = seurat.list, normalization.method = "SCT", anchor.features = features)

liver.combined.sct <- IntegrateData(anchorset = liver.anchors, normalization.method = "SCT")

Merging dataset 1 into 2

Extracting anchors for merged samples

Finding integration vectors

Finding integration vector weights

Integrating data

Compute PCA and UMAP using first 40 principal components

liver.combined.sct <- RunPCA(liver.combined.sct, verbose = FALSE)
liver.combined.sct <- RunUMAP(liver.combined.sct, reduction = "pca", dims = 1:40)

Perform SNN clustering of cells using first 40 principal components

liver.combined.sct <- FindNeighbors(liver.combined.sct, dims = 1:40)
liver.combined.sct <- FindClusters(liver.combined.sct, resolution = 0.8)

Plot the UMAP clustering of cell type and cluster populations

DimPlot(liver.combined.sct, reduction = "umap", group.by = 'seurat_clusters', cols = DiscretePalette(30), label = TRUE)
DimPlot(liver.combined.sct, reduction = "umap", group.by = 'Type', cols = DiscretePalette(19))

Marker analysis of fibroblast population

# find markers for fibroblast clusters compared to all remaining cells, report only the positive ones
DefaultAssay(liver.combined.sct) <- "RNA"
Idents(object = liver.combined.sct) <- liver.combined.sct@meta.data$seurat_clusters
clus7.markers <- FindMarkers(liver.combined.sct, ident.1 = '7', min.pct = 0.25)
clus15markers <- FindMarkers(liver.combined.sct, ident.1 = '15', min.pct = 0.25)

Selecting markers from Fibroblast sub populations

VlnPlot(liver.combined.sct, features = "COL1A1", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "COL1A2", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "COL3A1", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "BGN", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "LUM", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "SPARCL1", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "GJA4", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "OAZ2", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "ADAMTS4", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "GPR4", idents = c("7","15"))

We observed that fibrogenic genes (e.g., COL1A1, COL1A2, COL3A1 ) and proteoglycan encoding genes (e.g., LUM, BGN ) were overexpressed in cluster 15. Cluster 7, however showed high overall expression of SPARCL1, GJA4.

In order to assess the clinical significance bulk RNASeq data was fetched from Liver OmixAtlas for Heaptocellular Carcinoma.

3. Survival analysis using bulk RNASeq data

Reading the mRNA data

# Parsing mRNA expression data 
gct = parse('./LIHC_RNASeq_TCGA.gct')

# column metadata
clinical_data = gct.col_metadata_df

# expression data
expression_data = gct.data_df

Getting paired normal and tumor samples

clinical_data_paired = clinical_data[clinical_data.duplicated(subset = 'patient', keep = False)]
clinical_data_paired.head()

chd	patient	barcode	sample	shortLetterCode	definition	sample_submitter_id	sample_type_id	sample_id	sample_type	state	initial_weight	pathology_report_uuid	submitter_id	oct_embedded	is_ffpe	tissue_type	synchronous_malignancy	ajcc_pathologic_stage	tumor_stage	last_known_disease_status	tissue_or_organ_of_origin	primary_diagnosis	prior_malignancy	prior_treatment	ajcc_staging_system_edition	ajcc_pathologic_t	morphology	ajcc_pathologic_n	ajcc_pathologic_m	classification_of_tumor	diagnosis_id	icd_10_code	site_of_resection_or_biopsy	tumor_grade	progression_or_recurrence	alcohol_history	exposure_id	weight	height	bmi	...	therapy_regimen.drug_4	therapy_regimen_other.drug_4	total_dose.drug_4	tx_on_clinical_trial.drug_4	AFP>300	Baylor GCC whole genome seq	bcr_drug_barcode.drug_5	bcr_drug_uuid.drug_5	form_completion_date.drug_5	pharmaceutical_therapy_drug_name.drug_5	clinical_trial_drug_classification.drug_5	pharmaceutical_therapy_type.drug_5	pharmaceutical_tx_started_days_to.drug_5	pharmaceutical_tx_ongoing_indicator.drug_5	pharmaceutical_tx_ended_days_to.drug_5	treatment_best_response.drug_5	days_to_stem_cell_transplantation.drug_5	pharm_regimen.drug_5	pharm_regimen_other.drug_5	pharma_adjuvant_cycles_count.drug_5	pharma_type_other.drug_5	pharmaceutical_tx_dose_units.drug_5	pharmaceutical_tx_total_dose_units.drug_5	prescribed_dose.drug_5	regimen_number.drug_5	route_of_administration.drug_5	stem_cell_transplantation.drug_5	stem_cell_transplantation_type.drug_5	therapy_regimen.drug_5	therapy_regimen_other.drug_5	total_dose.drug_5	tx_on_clinical_trial.drug_5	kw_curated_disease	kw_curated_drug	kw_curated_tissue	kw_curated_cell_type	kw_curated_cell_line	kw_curated_genetic_mod_type	kw_curated_modified_gene	kw_curated_gene
cid
TCGA-BC-A10Q-01A-11R-A131-07	TCGA-BC-A10Q	TCGA-BC-A10Q-01A-11R-A131-07	TCGA-BC-A10Q-01A	TP	Primary solid Tumor	TCGA-BC-A10Q-01A	1	bf225dc1-4309-4304-849e-cbcc23c8442c	Primary Tumor	released	340	10117C34-85C2-49DE-B2AD-F8D0D4570DA9	TCGA-BC-A10Q	false	FALSE	Not Reported	No	NaN	not reported	not reported	Liver	Hepatocellular carcinoma, NOS	no	No	5th	T2	8170/3	NX	MX	not reported	c896279a-640a-57f8-91c5-4d1b674cdcca	C22.0	Liver	not reported	not reported	Not Reported	41801e4c-8d70-5266-82ae-382fc1dd4a20	NaN	NaN	NaN	...	PROGRESSION	[Not Applicable]	[Not Available]	[Not Available]	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Carcinoma, Hepatocellular	Doxorubicin	liver	none	none	none	none	none
TCGA-BC-A10Q-11A-11R-A131-07	TCGA-BC-A10Q	TCGA-BC-A10Q-11A-11R-A131-07	TCGA-BC-A10Q-11A	NT	Solid Tissue Normal	TCGA-BC-A10Q-11A	11	5d3bf988-3331-4412-bc17-b8f4650a8623	Solid Tissue Normal	released	340	NaN	TCGA-BC-A10Q	false	FALSE	Not Reported	No	NaN	not reported	not reported	Liver	Hepatocellular carcinoma, NOS	no	No	5th	T2	8170/3	NX	MX	not reported	c896279a-640a-57f8-91c5-4d1b674cdcca	C22.0	Liver	not reported	not reported	Not Reported	41801e4c-8d70-5266-82ae-382fc1dd4a20	NaN	NaN	NaN	...	PROGRESSION	[Not Applicable]	[Not Available]	[Not Available]	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Normal	Doxorubicin	liver	none	none	none	none	none
TCGA-BC-A10R-01A-11R-A131-07	TCGA-BC-A10R	TCGA-BC-A10R-01A-11R-A131-07	TCGA-BC-A10R-01A	TP	Primary solid Tumor	TCGA-BC-A10R-01A	1	995180f2-5aa7-47b9-bb3f-585f94b2457a	Primary Tumor	released	180	65BE2C95-CADA-44FC-9313-14529AFF773E	TCGA-BC-A10R	false	FALSE	Not Reported	No	NaN	not reported	not reported	Liver	Hepatocellular carcinoma, NOS	no	No	5th	T3	8170/3	NX	MX	not reported	ef53ae43-7ab9-546f-b40c-71018f4bf37a	C22.0	Liver	not reported	not reported	Not Reported	d17d5f9e-bcd9-5955-baaa-f03189d6006a	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Carcinoma, Hepatocellular	none	liver	none	none	none	none	none
TCGA-BC-A10R-11A-11R-A131-07	TCGA-BC-A10R	TCGA-BC-A10R-11A-11R-A131-07	TCGA-BC-A10R-11A	NT	Solid Tissue Normal	TCGA-BC-A10R-11A	11	1ea00561-3d1d-49d0-a3f7-d34eb8fec235	Solid Tissue Normal	released	1620	NaN	TCGA-BC-A10R	false	FALSE	Not Reported	No	NaN	not reported	not reported	Liver	Hepatocellular carcinoma, NOS	no	No	5th	T3	8170/3	NX	MX	not reported	ef53ae43-7ab9-546f-b40c-71018f4bf37a	C22.0	Liver	not reported	not reported	Not Reported	d17d5f9e-bcd9-5955-baaa-f03189d6006a	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Normal	none	liver	none	none	none	none	none
TCGA-BC-A10T-01A-11R-A131-07	TCGA-BC-A10T	TCGA-BC-A10T-01A-11R-A131-07	TCGA-BC-A10T-01A	TP	Primary solid Tumor	TCGA-BC-A10T-01A	1	d2f6cea0-ecec-49f1-a851-e332e79ce098	Primary Tumor	released	1000	D8103C33-F303-4D97-81AF-CC3D06E79285	TCGA-BC-A10T	false	FALSE	Not Reported	No	NaN	not reported	not reported	Liver	Hepatocellular carcinoma, NOS	no	No	5th	T4	8170/3	NX	MX	not reported	b5925a40-cb9f-5954-9cfb-c7ab86404e60	C22.0	Liver	not reported	not reported	Not Reported	a1d6658a-accb-5864-a6b2-31e176486998	86.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Carcinoma, Hepatocellular	none	liver	none	none	none	none	none

5 rows × 301 columns

#Subset the expression data using the paired normal and tumor samples
expression_data_paired = expression_data[expression_data.columns.intersection(clinical_data_paired.index)]

expression_data_paired.to_csv('TCGA_LIHC_paired.csv', sep = '\t')
clinical_data_paired.to_csv('TCGA_LIHC_clinical.csv', sep = '\t')

Preparing TCGA clinical data for survival analysis

#Load R packages
suppressMessages(library("readr"))
suppressMessages(library("gplots"))
suppressMessages(library("survival"))
suppressMessages(library("survminer"))
suppressMessages(library("RColorBrewer"))
suppressMessages(library("gProfileR"))
suppressMessages(library("genefilter"))

expression_data <- as.data.frame(read_delim("TCGA_LIHC_paired.csv", "\t", escape_double = FALSE, trim_ws = TRUE, col_types = cols()))
rownames(expression_data) <- expression_data[,1]
expression_data <- expression_data[,-1]

clinical_data <- as.data.frame(read_delim("TCGA_LIHC_clinical.csv", "\t", escape_double = FALSE, trim_ws = TRUE, col_types = cols()))
rownames(clinical_data) <- clinical_data[,1]
clinical_data <- clinical_data[,-1]

# create a new boolean variable that has TRUE for dead patients
# and FALSE for live patients
clinical_data$deceased = clinical_data$vital_status == "Dead"

# create an "overall survival" variable that is equal to days_to_death
# for dead patients, and to days_to_last_follow_up for patients who
# are still alive
clinical_data$overall_survival = ifelse(clinical_data$deceased,
                                   clinical_data$days_to_death,
                                   clinical_data$days_to_last_follow_up)

# show first 10 samples
clinical_data_survival = clinical_data[,c('vital_status','deceased','days_to_death','days_to_last_follow_up','overall_survival')]
head(clinical_data_survival)

A data.frame: 6 × 5
	vital_status	deceased	days_to_death	days_to_last_follow_up	overall_survival
	<chr>	<lgl>	<dbl>	<dbl>	<dbl>
TCGA-BC-A10Q-01A-11R-A131-07	Dead	TRUE	1135	NA	1135
TCGA-BC-A10Q-11A-11R-A131-07	Dead	TRUE	1135	NA	1135
TCGA-BC-A10R-01A-11R-A131-07	Dead	TRUE	308	NA	308
TCGA-BC-A10R-11A-11R-A131-07	Dead	TRUE	308	NA	308
TCGA-BC-A10T-01A-11R-A131-07	Dead	TRUE	837	NA	837
TCGA-BC-A10T-11A-11R-A131-07	Dead	TRUE	837	NA	837

Examining the association of cluster specific gene sets with clinical data

Specific Cluster 24 gene set G24 : ( COL1A, CFH, LUM and PDGFRA ).
Specific Cluster 8 gene set G3 : ( SPARCL1, GJA4, ADAMTS4 and GPR4 ).

survival_gene_exp = function(gene_name) {

    expression_mat = t(expression_data)
    # get the expression values for the selected gene
    clinical_data$gene_name = expression_mat[rownames(clinical_data), gene_name]

    # find the median value of the gene and print it
    median_value = median(clinical_data$gene_name)

    # divide patients in two groups, up and down regulated.
    # if the patient expression is greater or equal to them median we put it
    # among the "up-regulated", otherwise among the "down-regulated"
    clinical_data$gene = ifelse(clinical_data$gene_name >= median_value, "UP", "DOWN")

    # we can fit a survival model, like we did in the previous section
    fit = survfit(Surv(overall_survival, deceased) ~ gene, data=clinical_data)

    return(list(fit, clinical_data))

}

Survival Analysis of Cluster 8 genes

We observe that overexpression of SPARCL1 gene showed a favourable prognosis in liver cancer patients.

# and finally, we produce a Kaplan-Meier plot
surv = survival_gene_exp("SPARCL1")
ggsurvplot(surv[[1]], data=surv[[2]], pval=T, risk.table=T, title=paste("SPARCL1"))

surv = survival_gene_exp("GJA4")
ggsurvplot(surv[[1]], data=surv[[2]], pval=T, risk.table=T, title=paste("GJA4"))

surv = survival_gene_exp("ADAMTS4")
ggsurvplot(surv[[1]], data=surv[[2]], pval=T, risk.table=T, title=paste("ADAMTS4"))

surv = survival_gene_exp("GPR4")
ggsurvplot(surv[[1]], data=surv[[2]], pval=T, risk.table=T, title=paste("GPR4"))

Appendix

Package installations

NOTE : For R packages, the cells must be run in the following order as shown below

!pip3 install cmapPy --user
!pip3 install GEOparse --user
!pip3 install sh --user
!pip3 install wget --user

!sudo R -e 'BiocManager::install(c("genefilter", "SingleCellExperiment","GSEABase", "GSVA", "singscore"))'

!sudo R -e 'install.packages(c("anndata","clustermole","gridExtra"), repos = "https://cloud.r-project.org/")'

!sudo R -e 'install.packages(c("survival", "survminer", "gProfileR", "RColorBrewer", "gplots"), repos = "https://cloud.r-project.org/")'