Integrative single-cell transcriptome analysis reveals a subpopulation of fibroblasts associated with favorable prognosis of liver cancer patients
Authors H. Wang, C. Feng, M. Lu
Introduction
An integrative single cell transcriptomic data was performed between normal and tumor patients from the liver tissue. Here we show the power of integration-ready biomedical molecular data to identify markers associated with fibroblast subpopulation.
Normal and tumor single cell datasets are shown to readily integrate due to the standardised data structure on Liver OmixAtlas.
In order to evaluate the clinical significance of those markers, we utilise curated transcriptomics data for liver hepatocellular carcinoma to perform survival analysis in cancer patients. The overall analysis showed that overexpression of SPARCL1 is associated with favourable prognosis of liver cancer patients.
Import functions
Please follow the instructions below
1. Check the appendix section below to install necessary packages
2. Refresh the kernel after installation is complete
import re
from cmapPy.pandasGEXpress.parse_gct import parse
from matplotlib.pyplot import rc_context
import itertools
import pandas as pd
import numpy as np
import GEOparse
from sh import gunzip
import wget
import urllib.request
import os
Defining custom functions to use
def get_data_geo(accession):
if not os.path.isfile('./metadata/'+accession+'_family.soft.gz'):
GEOparse.get_GEO(geo=accession, destdir="./metadata/", silent=True)
gunzip('./metadata/'+accession+'_family.soft.gz')
os.makedirs(accession, exist_ok = True)
with open('./metadata/'+accession+'_family.soft', 'rt') as f:
for line in f:
if('!Series_supplementary_file' in line):
file = str(line.partition("=")[2]).strip()
urllib.request.urlretrieve(file, accession+'/'+file.split('/')[-1])
else:
continue
f.close()
1. Querying Biomedical Molecular data
1.1 On Polly Liver OmixAtlas
1.1.1 Install Polly Python
Looking in indexes: https://pypi.org/simple, http://54.245.179.143:80/
Collecting polly-python
Downloading https://files.pythonhosted.org/packages/a5/5c/3a3d40cdd4d61e99e7da4a0c2899a6d5e81c093667abf25aba2f157d18db/polly_python-0.0.4-py3-none-any.whl
Requirement already satisfied: idna in /usr/local/lib/python3.6/dist-packages (from polly-python) (2.8)
Collecting python-magic (from polly-python)
Downloading https://files.pythonhosted.org/packages/d3/99/c89223c6547df268596899334ee77b3051f606077317023617b1c43162fb/python_magic-0.4.24-py2.py3-none-any.whl
Collecting postpy2 (from polly-python)
Downloading https://files.pythonhosted.org/packages/7e/25/d98f5ea87d937bf44963be78b504dbb3ddea1c6ddfcc0985b02d062b41dd/postpy2-0.0.6-py3-none-any.whl
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from polly-python) (1.1.5)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from polly-python) (2020.6.20)
Requirement already satisfied: chardet in /usr/local/lib/python3.6/dist-packages (from polly-python) (3.0.4)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.6/dist-packages (from polly-python) (1.26.3)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from polly-python) (2.25.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas->polly-python) (2.8.0)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.6/dist-packages (from pandas->polly-python) (1.19.5)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->polly-python) (2019.3)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.7.3->pandas->polly-python) (1.14.0)
Installing collected packages: python-magic, postpy2, polly-python
Successfully installed polly-python-0.0.4 postpy2-0.0.6 python-magic-0.4.24
[33mWARNING: You are using pip version 19.2.3, however version 21.1.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
1.1.2 Connect to Polly OmixAtlas
Get Authentication Tokens
#REFRESH_TOKEN = "ENTER REFRESH TOKEN WITHIN THE DOUBLE QUOTES"
repo_client = OmixAtlas(REFRESH_TOKEN)
See a list of available Omix Atlases
repo_name | repo_id | indexes | dataset_count | disease_count | diseases | organism_count | organisms | sources | datatypes | sample_count | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | liveromix_atlas | 1615965444377 | {'gct_metadata': 'liveromix_atlas_gct_metadata... | 6760 | 761 | [normal, carcinoma, hepatocellular, obesity, n... | 22 | [homo sapiens, mus musculus, rattus norvegicus... | [geo, lincs, tcga, metabolomics workbench, met... | [transcriptomics, mutation, metabolomics, sing... | 1738079 |
1.1.3 Get Liver OmixAtlas Summary
LiverOmixAltasID = "1615965444377"
LiverOmixAtlas_summary = repo_client.omixatlas_summary(LiverOmixAltasID)
pd.DataFrame.from_dict(LiverOmixAtlas_summary, orient="index")
repo_name | repo_id | indexes | dataset_count | disease_count | diseases | organism_count | organisms | sources | datatypes | sample_count | |
---|---|---|---|---|---|---|---|---|---|---|---|
data | liveromix_atlas | 1615965444377 | {'gct_metadata': 'liveromix_atlas_gct_metadata... | 6760 | 761 | [normal, carcinoma, hepatocellular, obesity, n... | 22 | [homo sapiens, mus musculus, rattus norvegicus... | [geo, lincs, tcga, metabolomics workbench, met... | [transcriptomics, mutation, metabolomics, sing... | 1738079 |
['geo',
'lincs',
'tcga',
'metabolomics workbench',
'metabolights',
'ccle',
'gene expression omnibus (geo)',
'cptac',
'depmap',
'human protein atlas']
['transcriptomics',
'mutation',
'metabolomics',
'single cell',
'proteomics',
'lipidomics',
'mirna',
'drug screens',
'gene dependency',
'gene effect']
{'gct_metadata': 'liveromix_atlas_gct_metadata',
'h5ad_metadata': 'liveromix_atlas_h5ad_metadata',
'csv': 'liveromix_atlas_csv',
'files': 'liveromix_atlas_files',
'json': 'liveromix_atlas_json',
'ipynb': 'liveromix_atlas_ipynb',
'gct_data': 'liveromix_atlas_gct_data',
'h5ad_data': 'liveromix_atlas_h5ad_data'}
1.1.4 Querying Single Cell RNASeq data on Liver OmixAtlas
All data in Liver Atlas are structured with metadata fields tagged with ontologies and controlled vocabularies, which simplifies finding relevant datasets
To access, filter, and search the metadata in Liver OmixAtlas, the function mentioned below can be used: query_metadata
(“query written in SQL”)
For integrative single cell analysis, the authors used single cell transcriptomics data from:
1. 5 normal patients GSE115469
2. 12 cancer patients GSE125449
The study investigated a total of 8439 cells from normal patients and and 9946 cells from liver cancer tissues.
For querying the Single Cell RNASeq datasets we utilised the GEO Accession ID that was made available in the publication.
Fetching liver tumor dataset
sc_tumor_query = """select * FROM liveromix_atlas_files
WHERE kw_data_type = 'Single cell'
AND organism = 'Homo sapiens'
AND (disease = 'Carcinoma, Hepatocellular'
OR disease = 'Tumor')
"""
sc_tumor_query_df = repo_client.query_metadata(sc_tumor_query)
In this case the query returned the datasets that are required, but requires additional filtering
file_type | tissue | disease | dataset_id | organism | dataset_source | platform | description | kw_data_type | kw_cell_type | curation_version | source_process | publication | kw_cell_line | kw_drug | kw_gene | kw_repo | kw_package | kw_key | kw_bucket | kw_filetype | kw_region | kw_location | kw_timestamp | summary | overall_design | pubmed_id | total_num_samples | total_num_cells | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | h5ad | [liver, peripheral blood] | [Carcinoma, Hepatocellular] | GSE98638_GPL16791 | [Homo sapiens] | GEO | GPL16791 | [Landscape of infiltrating T cells in liver ca... | Single cell | [helper T cell, regulatory T cell] | g2 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [None] | [CD8A, CD4, IL2RA, TTR, PTCHD3, PTH] | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE98638_GPL16791.... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626796397637 | NaN | NaN | NaN | NaN | NaN |
1 | h5ad | [liver] | [Cholangiocarcinoma, Carcinoma, Hepatocellular] | GSE125449_GPL20301 | Homo sapiens | GEO | GPL20301 | Tumor cell biodiversity drives microenvironmen... | Single cell | [] | NaN | NaN | NaN | NaN | NaN | NaN | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE125449_GPL20301... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626795852462 | Single-cell transcriptome profiling of liver c... | [A total of 19 tumors were profiled. Set 1 con... | 31588021 | NaN | NaN |
2 | h5ad | [liver] | Tumor | GSE125449_GPL18573 | Homo sapiens | GEO | GPL18573 | Tumor cell biodiversity drives microenvironmen... | Single cell | [] | NaN | NaN | NaN | NaN | NaN | NaN | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE125449_GPL18573... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626622880810 | Single-cell transcriptome profiling of liver c... | [A total of 19 tumors were profiled. Set 1 con... | 31588021 | NaN | NaN |
3 | h5ad | [blood] | [Carcinoma, Hepatocellular] | GSE107747_GPL16791 | [Homo sapiens] | GEO | GPL16791 | Single-cell RNA sequencing of circulating tumo... | Single cell | [neoplastic cell] | g3 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [None] | [None] | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE107747_GPL16791... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626795662655 | NaN | NaN | NaN | 2 | 10184 |
4 | h5ad | [liver] | [Carcinoma, Hepatocellular] | GSE112271_GPL16791 | [Homo sapiens] | GEO | GPL16791 | Single-cell RNA of multiregional sampling in h... | Single cell | [None] | g3 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [None] | [None] | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE112271_GPL16791... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626795689768 | NaN | NaN | NaN | 7 | 31279 |
We perform a partial match using the GSE ID provided in the publication
sc_tumor_query_df_filt = sc_tumor_query_df[sc_tumor_query_df['dataset_id'].str.contains('GSE125449')]
sc_tumor_query_df_filt
file_type | tissue | disease | dataset_id | organism | dataset_source | platform | description | kw_data_type | kw_cell_type | curation_version | source_process | publication | kw_cell_line | kw_drug | kw_gene | kw_repo | kw_package | kw_key | kw_bucket | kw_filetype | kw_region | kw_location | kw_timestamp | summary | overall_design | pubmed_id | total_num_samples | total_num_cells | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | h5ad | [liver] | [Cholangiocarcinoma, Carcinoma, Hepatocellular] | GSE125449_GPL20301 | Homo sapiens | GEO | GPL20301 | Tumor cell biodiversity drives microenvironmen... | Single cell | [] | NaN | NaN | NaN | NaN | NaN | NaN | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE125449_GPL20301... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626795852462 | Single-cell transcriptome profiling of liver c... | [A total of 19 tumors were profiled. Set 1 con... | 31588021 | NaN | NaN |
2 | h5ad | [liver] | Tumor | GSE125449_GPL18573 | Homo sapiens | GEO | GPL18573 | Tumor cell biodiversity drives microenvironmen... | Single cell | [] | NaN | NaN | NaN | NaN | NaN | NaN | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE125449_GPL18573... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626622880810 | Single-cell transcriptome profiling of liver c... | [A total of 19 tumors were profiled. Set 1 con... | 31588021 | NaN | NaN |
['GSE125449_GPL20301', 'GSE125449_GPL18573']
Download single cell tumor data from OmixAtlas
url = repo_client.download_data(LiverOmixAltasID, sc_tumor_id[0]).get('data')
file_name = sc_tumor_id[0]+".h5ad"
os.system(f"wget -O '{file_name}' '{url}'")
0
url = repo_client.download_data(LiverOmixAltasID, sc_tumor_id[1]).get('data')
file_name = sc_tumor_id[1]+".h5ad"
os.system(f"wget -O '{file_name}' '{url}'")
0
Fetching normal liver dataset
sc_normal_query = """select * FROM liveromix_atlas_files
WHERE disease = 'Normal'
AND kw_data_type = 'Single cell'
AND organism = 'Homo sapiens'"""
sc_normal_query_df = repo_client.query_metadata(sc_normal_query)
sc_normal_query_df
file_type | tissue | disease | dataset_id | organism | dataset_source | platform | description | kw_data_type | kw_cell_type | curation_version | source_process | publication | kw_cell_line | kw_drug | kw_gene | total_num_samples | total_num_cells | kw_repo | kw_package | kw_key | kw_bucket | kw_filetype | kw_region | kw_location | kw_timestamp | authors | year | geo_series_id | geo_platform_id | pubmed_ids | title | summary | submission_date | species | cell_types | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | h5ad | [skin, liver, lung] | [Normal] | GSE133345_GPL20795 | [Homo sapiens] | GEO | GPL20795 | Deciphering human macrophage development at si... | Single cell | [tissue-resident macrophage, embryonic cell] | g3 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [None] | [PTPRC, GYPA] | 9 | 1268 | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE133345_GPL20795... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626363659950 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | h5ad | [liver] | Normal | GSE115469_GPL16791 | Homo sapiens | GEO | 10x | Single cell RNA sequencing of human liver reve... | Single cell | [] | NaN | NaN | https://www.nature.com/articles/s41467-018-063... | NaN | NaN | NaN | NaN | NaN | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE115469_GPL16791... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626622850659 | Sonya A. MacParland, Ian D McGilvray | 2018 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | h5ad | [liver] | [AIDS Dementia Complex, Normal] | GSE148796_GPL18573 | [Homo sapiens] | GEO | GPL18573 | Identification of pathogenic TRAIL-expressing ... | Single cell | [macrophage, T cell, natural killer cell, hema... | g3 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [None] | [CD4, RAG2, TNFSF10] | 4 | 12129 | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE148796_GPL18573... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626796167819 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | h5ad | [liver, bile] | [Normal] | GSE141183_GPL16791 | [Homo sapiens] | GEO | GPL16791 | RNA-sequencing with human liver organoids deri... | Single cell | [pluripotent stem cell, hepatocyte] | g3 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [Bile Acids and Salts, bosentan] | [None] | 1 | 5120 | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE141183_GPL16791... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626795963571 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | h5ad | [liver] | [Normal] | GSE130073_GPL16791 | [Homo sapiens] | GEO | GPL16791 | Single cell RNA-sequencing of iPS cells derive... | Single cell | [stem cell, hepatocyte, Kupffer cell, hepatic ... | g3 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [retinoic acid] | [None] | 1 | 3985 | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE130073_GPL16791... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626795871475 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | h5ad | [liver, muscle] | [Normal] | GSE107552_GPL11154 | [Homo sapiens] | GEO | GPL11154 | Mapping human pluripotent stem cell differenti... | Single cell | [progenitor cell, stromal cell, pluripotent st... | g3 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [sodium(1+)] | [None] | 10 | 1934 | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE107552_GPL11154... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626363174715 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | h5ad | [skin, spleen, heart, blood, stomach, esophagu... | [Normal] | GSE159929_GPL20795 | [Homo sapiens] | GEO | GPL20795 | Single-cell transcriptome profiling of an adul... | Single cell | [B cell, stromal cell, T cell, fibroblast] | g3 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [None] | [TRBV20OR9-2, BCR] | 15 | 82889 | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE159929_GPL20795... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626364686464 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | h5ad | [liver] | Normal | GSE124395_GPL16791 | Homo sapiens | GEO | mCelSeq2 | A human liver cell atlas reveals heterogeneity... | Single cell | [] | NaN | NaN | https://www.nature.com/articles/s41586-019-1373-2 | NaN | NaN | NaN | NaN | NaN | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE124395_GPL16791... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626622864853 | Nadim Aizarani, Dominic Grün | 2019 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | h5ad | [spleen, liver, bone marrow, peripheral blood] | [Normal] | GSE137539_GPL24676 | [Homo sapiens] | GEO | GPL24676 | Single-cell transcriptome profiling reveals ne... | Single cell | [granulocyte monocyte progenitor cell, mature ... | g3 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [None] | [KIT, CD33] | 3 | 29622 | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE137539_GPL24676... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626795936357 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9 | h5ad | [liver] | [Normal] | GSE150226_GPL19415 | [Homo sapiens, Mus musculus] | GEO | GPL19415 | Heterogeneity and chimerism of endothelial cel... | Single cell | [endothelial cell] | g3 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [None] | [None] | 6 | 3723 | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE150226_GPL19415... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626796207901 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 | h5ad | [thymus, liver] | [Normal] | GSE133341_GPL20795 | [Homo sapiens] | GEO | GPL20795 | Single-cell RNA Sequencing Resolves Spatiotemp... | Single cell | [stromal cell, pluripotent stem cell, T cell, ... | g3 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [None] | [None] | 11 | 29722 | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE133341_GPL20795... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626363621023 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11 | h5ad | [placenta, spleen, thymus, liver, stomach, bon... | [Normal] | GSE108097_GPL22245 | [Homo sapiens, Mus musculus] | GEO | GPL22245 | Mapping Mouse Cell Atlas by Microwell-seq | Single cell | [mesenchymal stem cell] | g3 | connector | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [ES-E14, CJ7] | [None] | [None] | 1 | 390 | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE108097_GPL22245... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626363202288 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
12 | NaN | [liver, heart] | [Normal] | GSE90749_GPL11154 | Homo sapiens | GEO | NaN | Induced Pluripotent Stem Cell Differentiation ... | Single cell | [pluripotent stem cell, mononuclear cell, hepa... | g3 | NaN | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [None] | [SORT1, PSRC1, CELSR2] | 204 | 204 | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE90749_GPL11154.... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626796394998 | NaN | NaN | GSE90749 | GPL11154 | [] | Induced Pluripotent Stem Cell Differentiation ... | Genome-wide association studies (GWAS) have hi... | Dec 01 2016 | Human | [NA, MSC, Tissue_stem_cells, Fibroblasts, Smoo... |
13 | NaN | [liver] | [Normal] | GSE81252_GPL16791 | Homo sapiens | GEO | NaN | Multilineage communication regulates human liv... | Single cell | [mesenchymal cell, endothelial cell, mesenchym... | g3 | NaN | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [None] | [None] | [VEGFA] | 761 | 761 | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE81252_GPL16791.... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626796391895 | NaN | NaN | GSE81252 | GPL16791 | [27669147, 28614297] | Multilineage communication regulates human liv... | Conventional 2-D differentiation from pluripot... | May 09 2016 | Human | [Hepatocytes, iPS_cells, Endothelial_cells, Sm... |
We perform a partial match using the GSE ID provided in the publication
sc_normal_query_df_filt = sc_normal_query_df[sc_normal_query_df['dataset_id'].str.contains('GSE115469')]
sc_normal_query_df_filt
file_type | tissue | disease | dataset_id | organism | dataset_source | platform | description | kw_data_type | kw_cell_type | curation_version | source_process | publication | kw_cell_line | kw_drug | kw_gene | total_num_samples | total_num_cells | kw_repo | kw_package | kw_key | kw_bucket | kw_filetype | kw_region | kw_location | kw_timestamp | authors | year | geo_series_id | geo_platform_id | pubmed_ids | title | summary | submission_date | species | cell_types | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | h5ad | [liver] | Normal | GSE115469_GPL16791 | Homo sapiens | GEO | 10x | Single cell RNA sequencing of human liver reve... | Single cell | [] | NaN | NaN | https://www.nature.com/articles/s41467-018-063... | NaN | NaN | NaN | NaN | NaN | liveromix_atlas | liver_atlas/data | liver_atlas/data/SingleCell/GSE115469_GPL16791... | discover-prod-datalake-v1 | h5ad | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626622850659 | Sonya A. MacParland, Ian D McGilvray | 2018 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
['GSE115469_GPL16791']
Download single cell normal data from OmixAtlas
url = repo_client.download_data(LiverOmixAltasID, sc_normal_id[0]).get('data')
file_name = sc_normal_id[0]+".h5ad"
os.system(f"wget -O '{file_name}' '{url}'")
0
1.1.5 Querying Bulk RNASeq data on Liver OmixAtlas
For instance, we are required to perform RNASeq data analysis using transcriptomics data from TCGA.
Using the curated fields we narrow down our search to find the datasets that correspond to Liver Hepatocellular Carcinoma in Homo Sapiens
liver_HCC_query = """select * FROM liveromix_atlas_files
WHERE organism = 'Homo sapiens'
AND disease = 'Carcinoma, Hepatocellular' """
repo_client.query_metadata(liver_HCC_query)
Showing 1 - 100 of 1185 matching results
disease | kw_drug | kw_cell_line | kw_cell_type | publication | tissue | organism | dataset_id | kw_data_type | description | dataset_source | curation_version | platform | year | total_num_samples | experimental_design | kw_repo | kw_package | kw_key | kw_bucket | kw_filetype | kw_region | kw_location | kw_timestamp | study_id | release_date | kw_instrument_type | publication_title | kw_analysis_type | is_public | kw_sample_source | operation | data_repository | kw_gene | data_required | file_type | source_process | publication_name | author | abstract | overall_design | summary | type | donor_age | donor_sex | sample_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [Carcinoma, Hepatocellular, Hepatoblastoma] | [repaglinide, vanillin, SB 216763, Nedocromil,... | [Hep 3B2.1-7, Hep-G2, HLF, HuH-1, HuH-6, Huh-7... | [None] | https://www.biorxiv.org/content/10.1101/720243v1 | [liver] | Homo sapiens | DEPMAP_19Q4_primary-screen-mfi_liver | Drug Screens | Median fluorescence intensities for liver cell... | DEPMAP | g2 | Flow cytometry | 2019 | 21 | {'categorical_variables': {'kw_curated_cell_li... | liveromix_atlas | liver_atlas/data | liver_atlas/data/DrugScreens/DEPMAP_19Q4_prima... | discover-prod-datalake-v1 | gct | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626017983908 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | [Carcinoma, Hepatocellular, Liver Cirrhosis, N... | [None] | [None] | [None] | https://www.ebi.ac.uk/metabolights/MTBLS105 | [liver, blood plasma] | [Homo sapiens] | MTBLS105_m_mtbls105_GC_SIM_mass_spectrometry | Metabolomics | This study evaluates changes in metabolite lev... | Metabolights | g2 | NaN | NaN | 267 | NaN | liveromix_atlas | liver_atlas/data | liver_atlas/data/Metabolomics/MTBLS105_m_mtbls... | discover-prod-datalake-v1 | gct | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626017990182 | MTBLS105 | 2015:08:13 | 5975C Series GC/MSD (Agilent);Agilent GC/MS, L... | GC-MS based plasma metabolomics for HCC biomar... | mass spectrometry | true | blood plasma | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | [Carcinoma, Hepatocellular, Liver Cirrhosis, F... | [None] | [None] | [None] | https://www.ebi.ac.uk/metabolights/MTBLS17 | [liver, bile] | [Homo sapiens] | MTBLS17_m_live_mtbls17pos_metabolite profiling... | Metabolomics | Characterizing the metabolic changes pertainin... | Metabolights | g2 | NaN | NaN | 1050 | NaN | liveromix_atlas | liver_atlas/data | liver_atlas/data/Metabolomics/MTBLS17_m_live_m... | discover-prod-datalake-v1 | gct | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626018020896 | MTBLS17 | 2013:06:30 | UPLC Q-TOF Premier (Waters) | Utilization of metabolomics to identify serum ... | mass spectrometry | true | blood serum | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | [Hepatic Encephalopathy, Carcinoma, Hepatocell... | [None] | [None] | [None] | https://www.ebi.ac.uk/metabolights/MTBLS582 | [liver] | [Homo sapiens] | MTBLS582_m_mtbls582_NEG_mass_spectrometry | Lipidomics | Obesity is tightly linked to hepatic steatosis... | Metabolights | g2 | NaN | NaN | 12 | NaN | liveromix_atlas | liver_atlas/data | liver_atlas/data/Metabolomics/MTBLS582_m_mtbls... | discover-prod-datalake-v1 | gct | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626018028414 | MTBLS582 | 2018:04:26 | Exactive (Thermo Scientific) | Global Analyses of Selective Insulin Resistanc... | mass spectrometry | true | Hep-G2 cell | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | [Carcinoma, Hepatocellular, Liver Cirrhosis, F... | [None] | [None] | [None] | https://www.ebi.ac.uk/metabolights/MTBLS19 | [liver, bile] | [Homo sapiens] | MTBLS19_m_neg_MTBLS19_metabolite profiling_mas... | Metabolomics | Although hepatocellular carcinoma (HCC) has be... | Metabolights | g2 | NaN | NaN | 360 | NaN | liveromix_atlas | liver_atlas/data | liver_atlas/data/Metabolomics/MTBLS19_m_neg_MT... | discover-prod-datalake-v1 | gct | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626018026372 | MTBLS19 | 2013:06:30 | UPLC Q-TOF Premier (Waters) | LC−MS Based Serum Metabolomics for Identificat... | mass spectrometry | true | blood serum | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | [Carcinoma, Hepatocellular] | [BGJ-398] | [Hep-G2] | [None] | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [liver] | Homo sapiens | lincs_GSE70138_HEPG2_BRD-K42728290 | Transcriptomics | Gene expression profile of Hep-G2 on treating ... | LINCS | g3 | LINCS1000 | 2018 | 388 | {'categorical_variables': {'drug': [{'name': '... | liveromix_atlas | liver_atlas/data | liver_atlas/data/Transcriptomics/lincs_GSE7013... | discover-prod-datalake-v1 | gct | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626152712356 | NaN | NaN | NaN | NaN | NaN | true | NaN | {'is_normalized': 'true', 'batch_corrected_var... | lincs | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15 | Male | Tumor |
96 | [Carcinoma, Hepatocellular] | [BKM120] | [Hep-G2] | [None] | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [liver] | Homo sapiens | lincs_GSE70138_HEPG2_BRD-K42191735 | Transcriptomics | Gene expression profile of Hep-G2 on treating ... | LINCS | g3 | LINCS1000 | 2018 | 388 | {'categorical_variables': {'drug': [{'name': '... | liveromix_atlas | liver_atlas/data | liver_atlas/data/Transcriptomics/lincs_GSE7013... | discover-prod-datalake-v1 | gct | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626151670741 | NaN | NaN | NaN | NaN | NaN | true | NaN | {'is_normalized': 'true', 'batch_corrected_var... | lincs | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15 | Male | Tumor |
97 | [Carcinoma, Hepatocellular] | [Osimertinib] | [Hep-G2] | [None] | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [liver] | Homo sapiens | lincs_GSE70138_HEPG2_BRD-K42805893 | Transcriptomics | Gene expression profile of Hep-G2 on treating ... | LINCS | g3 | LINCS1000 | 2018 | 387 | {'categorical_variables': {'drug': [{'name': '... | liveromix_atlas | liver_atlas/data | liver_atlas/data/Transcriptomics/lincs_GSE7013... | discover-prod-datalake-v1 | gct | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626153236520 | NaN | NaN | NaN | NaN | NaN | true | NaN | {'is_normalized': 'true', 'batch_corrected_var... | lincs | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15 | Male | Tumor |
98 | [Carcinoma, Hepatocellular] | [BX795] | [Hep-G2] | [None] | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [liver] | Homo sapiens | lincs_GSE70138_HEPG2_BRD-K47983010 | Transcriptomics | Gene expression profile of Hep-G2 on treating ... | LINCS | g3 | LINCS1000 | 2018 | 388 | {'categorical_variables': {'drug': [{'name': '... | liveromix_atlas | liver_atlas/data | liver_atlas/data/Transcriptomics/lincs_GSE7013... | discover-prod-datalake-v1 | gct | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626158523464 | NaN | NaN | NaN | NaN | NaN | true | NaN | {'is_normalized': 'true', 'batch_corrected_var... | lincs | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15 | Male | Tumor |
99 | [Carcinoma, Hepatocellular] | [2-((aminocarbonyl)amino)-5-(4-fluorophenyl)-3... | [Hep-G2] | [None] | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi... | [liver] | Homo sapiens | lincs_GSE70138_HEPG2_BRD-K51575138 | Transcriptomics | Gene expression profile of Hep-G2 on treating ... | LINCS | g3 | LINCS1000 | 2018 | 388 | {'categorical_variables': {'drug': [{'name': '... | liveromix_atlas | liver_atlas/data | liver_atlas/data/Transcriptomics/lincs_GSE7013... | discover-prod-datalake-v1 | gct | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626166038957 | NaN | NaN | NaN | NaN | NaN | true | NaN | {'is_normalized': 'true', 'batch_corrected_var... | lincs | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15 | Male | Tumor |
100 rows × 46 columns
The initial search returns Liver heaptocellular carcinoma detaasets from multiple sources and different data types.
We are interested in liver cancer dataset from TCGA of type Transcriptomics.
Therefore, we narrow down our search to only return datasets from source TCGA and type Transcriptomics.
liver_HCC_query = """select * FROM liveromix_atlas_files
WHERE organism = 'Homo sapiens'
AND disease = 'Carcinoma, Hepatocellular'
AND dataset_source = 'TCGA'
AND kw_data_type = 'Transcriptomics'"""
liver_HCC_query_df = repo_client.query_metadata(liver_HCC_query)
liver_HCC_query_df
disease | kw_drug | kw_cell_type | kw_cell_line | publication | tissue | organism | dataset_id | kw_data_type | description | dataset_source | curation_version | total_num_samples | kw_repo | kw_package | kw_key | kw_bucket | kw_filetype | kw_region | kw_location | kw_timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [Carcinoma, Hepatocellular, Carcinoma, Adenoca... | [Doxorubicin, sorafenib tosylate, gemcitabine,... | [None] | [None] | https://www.cell.com/cell/fulltext/S0092-8674(... | [liver] | Homo sapiens | LIHC_RNASeq_TCGA | Transcriptomics | Liver hepatocellular carcinoma RNASeq data | TCGA | g3 | 424 | liveromix_atlas | liver_atlas/data | liver_atlas/data/RNAseq/LIHC_RNASeq_TCGA.gct | discover-prod-datalake-v1 | gct | us-west-2 | https://discover-prod-datalake-v1.s3-us-west-2... | 1626708777124 |
['LIHC_RNASeq_TCGA']
url = repo_client.download_data(LiverOmixAltasID, liver_hcc_id[0]).get('data')
file_name = liver_hcc_id[0]+".gct"
os.system(f"wget -O '{file_name}' '{url}'")
0
1.2 Outside Polly
1.2.1 Querying Single Cell RNASeq data on NCBI GEO
We query the datasets using the original source. The normal and tumor single cell datasets for liver patients were found on GEO. We used the GSE Accession ID to fetch the data and metadata in the format made available by the submitters.
Fetching liver tumor single cell data from GEO NCBI
Fetching normal liver single cell data from GEO NCBI
Data structure and format
total 45268
-rw-r--r-- 1 polly users 26016 Jun 30 05:14 [0m[01;31mGSE125449_Set1_barcodes.tsv.gz[0m
-rw-r--r-- 1 polly users 158399 Jun 30 05:14 [01;31mGSE125449_Set1_genes.tsv.gz[0m
-rw-r--r-- 1 polly users 26707779 Jun 30 05:14 [01;31mGSE125449_Set1_matrix.mtx.gz[0m
-rw-r--r-- 1 polly users 33526 Jun 30 05:14 [01;31mGSE125449_Set1_samples.txt.gz[0m
-rw-r--r-- 1 polly users 23861 Jun 30 05:14 [01;31mGSE125449_Set2_barcodes.tsv.gz[0m
-rw-r--r-- 1 polly users 153470 Jun 30 05:14 [01;31mGSE125449_Set2_genes.tsv.gz[0m
-rw-r--r-- 1 polly users 19204961 Jun 30 05:14 [01;31mGSE125449_Set2_matrix.mtx.gz[0m
-rw-r--r-- 1 polly users 31132 Jun 30 05:14 [01;31mGSE125449_Set2_samples.txt.gz[0m
total 105536
-rw-r--r-- 1 polly users 108065293 Jun 30 05:15 [0m[01;31mGSE115469_Data.csv.gz[0m
1.2.2 Querying Bulk RNASeq data on TCGA
Fetching mRNA data from TCGA
library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-LIHC",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - FPKM",
legacy = FALSE)
GDCdownload(query)
data <- GDCprepare(query)
Time taken to fetch data: ~ 10 mins
Downloading data for project TCGA-LIHC
GDCdownload will download 424 files. A total of 207.74981 MB
Downloading as: Tue_Jun_29_11_38_54_2021.tar.gz
Downloading: 210 MB
|=====================================================================|100%
Completed after 3 m
Starting to add information to samples
=> Add clinical information to samples
=> Adding TCGA molecular information from marker papers
=> Information will have prefix 'paper_'
From the 60483 genes we couldn't map 3881
user system elapsed
265.095 61.711 519.107
Data Structure and format
class: RangedSummarizedExperiment
dim: 56602 424
metadata(1): data_release
assays(1): HTSeq - FPKM
rownames(56602): ENSG00000000003 ENSG00000000005 ... ENSG00000281912
ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name original_ensembl_gene_id
colnames(424): TCGA-DD-AAE4-01A-11R-A41C-07 TCGA-CC-A3M9-01A-11R-A213-07 ...
TCGA-DD-AAE0-01A-11R-A41C-07 TCGA-DD-A3A8-11A-11R-A22L-07
colData names(64): barcode patient ... released sample.aux
2. Single Cell data integration
2.1 Reading single cell datasets
Reading h5ad as an anndata object and converting it to a Seurat object contained assay, gene and sample metadata
suppressMessages(library("Seurat"))
suppressMessages(library("anndata"))
suppressMessages(library("dplyr"))
suppressMessages(library("SingleCellExperiment"))
suppressMessages(library("clustermole"))
options(warn=-1)
#Create a Seurat object
get_seurat_obj <- function(adata) {
# Get the expression matrix
exprs <- t(as.matrix(adata$X))
colnames(exprs) <- adata$obs_names
rownames(exprs) <- adata$var_names
# Create the Seurat object
seurat <- CreateSeuratObject(exprs)
# Set the expression assay
seurat <- SetAssayData(seurat, "data", exprs)
# Add observation metadata
seurat <- AddMetaData(seurat, adata$obs)
return(seurat)
}
adata_normal <- read_h5ad('GSE115469_GPL16791.h5ad')
adata_tumor_1 <- read_h5ad('GSE125449_GPL18573.h5ad')
adata_tumor_2 <- read_h5ad('GSE125449_GPL20301.h5ad')
normal_seurat <- get_seurat_obj(adata_normal)
tumor_seurat_1 <- get_seurat_obj(adata_tumor_1)
tumor_seurat_2 <- get_seurat_obj(adata_tumor_2)
#Add condition column to metadata
normal_seurat@meta.data$condition <- c('Normal')
tumor_seurat_1@meta.data$condition <- c('Tumor')
tumor_seurat_2@meta.data$condition <- c('Tumor')
normal_seurat <- PercentageFeatureSet(normal_seurat, pattern = "^MT-", col.name = "percent.mt")
tumor_seurat_1 <- PercentageFeatureSet(tumor_seurat_1, pattern = "^MT-", col.name = "percent.mt")
tumor_seurat_2 <- PercentageFeatureSet(tumor_seurat_2, pattern = "^MT-", col.name = "percent.mt")
2.2 Identification of cell types for normal single cell dataset
# find markers for every cluster compared to all remaining cells, report only the positive ones
Idents(object = normal_seurat) <- normal_seurat@meta.data$clusters
liver.markers <- FindAllMarkers(normal_seurat,min.pct = 0.25, logfc.threshold = 0.25)
p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | cluster | gene | |
---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <chr> | |
HSD11B1 | 0 | 0.7473840 | 0.896 | 0.207 | 0 | 1 | HSD11B1 |
APOM | 0 | 0.6909939 | 0.913 | 0.204 | 0 | 1 | APOM |
PON3 | 0 | 0.5102462 | 0.858 | 0.177 | 0 | 1 | PON3 |
TTC36 | 0 | 0.4761022 | 0.827 | 0.193 | 0 | 1 | TTC36 |
F10 | 0 | 0.4080488 | 0.583 | 0.104 | 0 | 1 | F10 |
BCHE | 0 | 0.3824872 | 0.749 | 0.156 | 0 | 1 | BCHE |
Type <- c()
clusters <- c()
for (i in seq(1,20, by=1)) {
liver.markers.filt <- as.data.frame(liver.markers %>% filter(cluster == as.character(i)) %>% top_n(n = 25, wt = avg_log2FC))
my_overlaps <- clustermole_overlaps(genes = liver.markers.filt$gene, species = "hs")
type <- as.character(my_overlaps[1,5])
Type <- c(Type, type)
clusters <- c(clusters, i)
}
clust_map <- as.data.frame(cbind(clusters,Type))
head(clust_map)
clusters | Type | |
---|---|---|
<fct> | <fct> | |
1 | 1 | Hepatocytes |
2 | 2 | DURANTE_ADULT_OLFACTORY_NEUROEPITHELIUM_CD8_T_CELLS |
3 | 3 | Hepatocytes |
4 | 4 | AIZARANI_LIVER_C25_KUPFFER_CELLS_4 |
5 | 5 | Hepatocytes |
6 | 6 | AIZARANI_LIVER_C11_HEPATOCYTES_1 |
#The cell type names must be cleaned in order for them to be clustered together
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('NK_NKT_CELLS','NK Cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('B cell\\D+','B cell',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_CD71\\D+','Erythroid cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_NK_CELLS','NK Cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('KUPFFER_CELLS','Kupffer Cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('MVECS','Endothelial cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_CD8_T_CELLS','CD8 T cell',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_CD4_T_CELLS','CD4 T cell',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('AIZARANI_LIVER_C\\d+_','',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('_\\d+$','',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('HEPATOCYTES','Hepatocytes',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\bPlasma cell\\b','Plasma cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_PLASMA_CELLS','Plasma cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_PLASMA_CELL','Plasma cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('EPCAM_POS_\\D+','EPCAM+',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_B_CELL','B cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('CUI_DEVELOPING_HEART_C3_FIBROBLAST_LIKE_CELL','Others',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('Erythrocytes_\\D+','Erythrocytes',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('FAN_EMBRYONIC_CTX_NSC','Others',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('Lymph_endothel_cell','Other endothelial cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('\\D+_Epithelial_cells','Epithelial cells',x))
clust_map$Type <- sapply(clust_map$Type, function(x) gsub('HAY_BONE_MARROW_PRO_B','Others',x))
Type | clusters |
---|---|
<chr> | <chr> |
B cell | 16 |
CD8 T cell | 2 |
Cholangiocytes | 17 |
Endothelial cells | 11,13 |
Erythroid cells | 19 |
Gamma delta T cells | 18 |
Hepatic stellate cells | 20 |
Hepatocytes | 1,3,5,6,14,15 |
Kupffer Cells | 4,10 |
LSECS | 12 |
NK Cells | 8,9 |
Plasma cells | 7 |
#Add cell types to seurat metadata
row.meta <- rownames(normal_seurat@meta.data)
normal_seurat@meta.data <- merge(normal_seurat@meta.data, clust_map, by = 'clusters')
rownames(normal_seurat@meta.data) <- row.meta
2.3 Single cell data integration
seurat <- merge(normal_seurat, y = c(tumor_seurat_1, tumor_seurat_2))
seurat.list <- SplitObject(seurat, split.by = "condition")
Performing normalisation and scaling using SCTransform
seurat.list[["Normal"]] <- SCTransform(seurat.list[["Normal"]], vars.to.regress = 'percent.mt')
seurat.list[["Tumor"]] <- SCTransform(seurat.list[["Tumor"]], vars.to.regress = 'percent.mt')
features <- SelectIntegrationFeatures(object.list = seurat.list, nfeatures = 3000)
seurat.list <- PrepSCTIntegration(object.list = seurat.list, anchor.features = features)
liver.anchors <- FindIntegrationAnchors(object.list = seurat.list, normalization.method = "SCT", anchor.features = features)
Merging dataset 1 into 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
Integrating data
Compute PCA and UMAP using first 40 principal components
liver.combined.sct <- RunPCA(liver.combined.sct, verbose = FALSE)
liver.combined.sct <- RunUMAP(liver.combined.sct, reduction = "pca", dims = 1:40)
Perform SNN clustering of cells using first 40 principal components
liver.combined.sct <- FindNeighbors(liver.combined.sct, dims = 1:40)
liver.combined.sct <- FindClusters(liver.combined.sct, resolution = 0.8)
Plot the UMAP clustering of cell type and cluster populations
DimPlot(liver.combined.sct, reduction = "umap", group.by = 'seurat_clusters', cols = DiscretePalette(30), label = TRUE)
DimPlot(liver.combined.sct, reduction = "umap", group.by = 'Type', cols = DiscretePalette(19))
Marker analysis of fibroblast population
# find markers for fibroblast clusters compared to all remaining cells, report only the positive ones
DefaultAssay(liver.combined.sct) <- "RNA"
Idents(object = liver.combined.sct) <- liver.combined.sct@meta.data$seurat_clusters
clus7.markers <- FindMarkers(liver.combined.sct, ident.1 = '7', min.pct = 0.25)
clus15markers <- FindMarkers(liver.combined.sct, ident.1 = '15', min.pct = 0.25)
Selecting markers from Fibroblast sub populations
VlnPlot(liver.combined.sct, features = "COL1A1", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "COL1A2", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "COL3A1", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "BGN", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "LUM", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "SPARCL1", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "GJA4", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "OAZ2", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "ADAMTS4", idents = c("7","15"))
VlnPlot(liver.combined.sct, features = "GPR4", idents = c("7","15"))
We observed that fibrogenic genes (e.g., COL1A1, COL1A2, COL3A1 ) and proteoglycan encoding genes (e.g., LUM, BGN ) were overexpressed in cluster 15. Cluster 7, however showed high overall expression of SPARCL1, GJA4.
In order to assess the clinical significance bulk RNASeq data was fetched from Liver OmixAtlas for Heaptocellular Carcinoma.
3. Survival analysis using bulk RNASeq data
Reading the mRNA data
# Parsing mRNA expression data
gct = parse('./LIHC_RNASeq_TCGA.gct')
# column metadata
clinical_data = gct.col_metadata_df
# expression data
expression_data = gct.data_df
Getting paired normal and tumor samples
clinical_data_paired = clinical_data[clinical_data.duplicated(subset = 'patient', keep = False)]
clinical_data_paired.head()
chd | patient | barcode | sample | shortLetterCode | definition | sample_submitter_id | sample_type_id | sample_id | sample_type | state | initial_weight | pathology_report_uuid | submitter_id | oct_embedded | is_ffpe | tissue_type | synchronous_malignancy | ajcc_pathologic_stage | tumor_stage | last_known_disease_status | tissue_or_organ_of_origin | primary_diagnosis | prior_malignancy | prior_treatment | ajcc_staging_system_edition | ajcc_pathologic_t | morphology | ajcc_pathologic_n | ajcc_pathologic_m | classification_of_tumor | diagnosis_id | icd_10_code | site_of_resection_or_biopsy | tumor_grade | progression_or_recurrence | alcohol_history | exposure_id | weight | height | bmi | ... | therapy_regimen.drug_4 | therapy_regimen_other.drug_4 | total_dose.drug_4 | tx_on_clinical_trial.drug_4 | AFP>300 | Baylor GCC whole genome seq | bcr_drug_barcode.drug_5 | bcr_drug_uuid.drug_5 | form_completion_date.drug_5 | pharmaceutical_therapy_drug_name.drug_5 | clinical_trial_drug_classification.drug_5 | pharmaceutical_therapy_type.drug_5 | pharmaceutical_tx_started_days_to.drug_5 | pharmaceutical_tx_ongoing_indicator.drug_5 | pharmaceutical_tx_ended_days_to.drug_5 | treatment_best_response.drug_5 | days_to_stem_cell_transplantation.drug_5 | pharm_regimen.drug_5 | pharm_regimen_other.drug_5 | pharma_adjuvant_cycles_count.drug_5 | pharma_type_other.drug_5 | pharmaceutical_tx_dose_units.drug_5 | pharmaceutical_tx_total_dose_units.drug_5 | prescribed_dose.drug_5 | regimen_number.drug_5 | route_of_administration.drug_5 | stem_cell_transplantation.drug_5 | stem_cell_transplantation_type.drug_5 | therapy_regimen.drug_5 | therapy_regimen_other.drug_5 | total_dose.drug_5 | tx_on_clinical_trial.drug_5 | kw_curated_disease | kw_curated_drug | kw_curated_tissue | kw_curated_cell_type | kw_curated_cell_line | kw_curated_genetic_mod_type | kw_curated_modified_gene | kw_curated_gene |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cid | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TCGA-BC-A10Q-01A-11R-A131-07 | TCGA-BC-A10Q | TCGA-BC-A10Q-01A-11R-A131-07 | TCGA-BC-A10Q-01A | TP | Primary solid Tumor | TCGA-BC-A10Q-01A | 1 | bf225dc1-4309-4304-849e-cbcc23c8442c | Primary Tumor | released | 340 | 10117C34-85C2-49DE-B2AD-F8D0D4570DA9 | TCGA-BC-A10Q | false | FALSE | Not Reported | No | NaN | not reported | not reported | Liver | Hepatocellular carcinoma, NOS | no | No | 5th | T2 | 8170/3 | NX | MX | not reported | c896279a-640a-57f8-91c5-4d1b674cdcca | C22.0 | Liver | not reported | not reported | Not Reported | 41801e4c-8d70-5266-82ae-382fc1dd4a20 | NaN | NaN | NaN | ... | PROGRESSION | [Not Applicable] | [Not Available] | [Not Available] | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Carcinoma, Hepatocellular | Doxorubicin | liver | none | none | none | none | none |
TCGA-BC-A10Q-11A-11R-A131-07 | TCGA-BC-A10Q | TCGA-BC-A10Q-11A-11R-A131-07 | TCGA-BC-A10Q-11A | NT | Solid Tissue Normal | TCGA-BC-A10Q-11A | 11 | 5d3bf988-3331-4412-bc17-b8f4650a8623 | Solid Tissue Normal | released | 340 | NaN | TCGA-BC-A10Q | false | FALSE | Not Reported | No | NaN | not reported | not reported | Liver | Hepatocellular carcinoma, NOS | no | No | 5th | T2 | 8170/3 | NX | MX | not reported | c896279a-640a-57f8-91c5-4d1b674cdcca | C22.0 | Liver | not reported | not reported | Not Reported | 41801e4c-8d70-5266-82ae-382fc1dd4a20 | NaN | NaN | NaN | ... | PROGRESSION | [Not Applicable] | [Not Available] | [Not Available] | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Normal | Doxorubicin | liver | none | none | none | none | none |
TCGA-BC-A10R-01A-11R-A131-07 | TCGA-BC-A10R | TCGA-BC-A10R-01A-11R-A131-07 | TCGA-BC-A10R-01A | TP | Primary solid Tumor | TCGA-BC-A10R-01A | 1 | 995180f2-5aa7-47b9-bb3f-585f94b2457a | Primary Tumor | released | 180 | 65BE2C95-CADA-44FC-9313-14529AFF773E | TCGA-BC-A10R | false | FALSE | Not Reported | No | NaN | not reported | not reported | Liver | Hepatocellular carcinoma, NOS | no | No | 5th | T3 | 8170/3 | NX | MX | not reported | ef53ae43-7ab9-546f-b40c-71018f4bf37a | C22.0 | Liver | not reported | not reported | Not Reported | d17d5f9e-bcd9-5955-baaa-f03189d6006a | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Carcinoma, Hepatocellular | none | liver | none | none | none | none | none |
TCGA-BC-A10R-11A-11R-A131-07 | TCGA-BC-A10R | TCGA-BC-A10R-11A-11R-A131-07 | TCGA-BC-A10R-11A | NT | Solid Tissue Normal | TCGA-BC-A10R-11A | 11 | 1ea00561-3d1d-49d0-a3f7-d34eb8fec235 | Solid Tissue Normal | released | 1620 | NaN | TCGA-BC-A10R | false | FALSE | Not Reported | No | NaN | not reported | not reported | Liver | Hepatocellular carcinoma, NOS | no | No | 5th | T3 | 8170/3 | NX | MX | not reported | ef53ae43-7ab9-546f-b40c-71018f4bf37a | C22.0 | Liver | not reported | not reported | Not Reported | d17d5f9e-bcd9-5955-baaa-f03189d6006a | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Normal | none | liver | none | none | none | none | none |
TCGA-BC-A10T-01A-11R-A131-07 | TCGA-BC-A10T | TCGA-BC-A10T-01A-11R-A131-07 | TCGA-BC-A10T-01A | TP | Primary solid Tumor | TCGA-BC-A10T-01A | 1 | d2f6cea0-ecec-49f1-a851-e332e79ce098 | Primary Tumor | released | 1000 | D8103C33-F303-4D97-81AF-CC3D06E79285 | TCGA-BC-A10T | false | FALSE | Not Reported | No | NaN | not reported | not reported | Liver | Hepatocellular carcinoma, NOS | no | No | 5th | T4 | 8170/3 | NX | MX | not reported | b5925a40-cb9f-5954-9cfb-c7ab86404e60 | C22.0 | Liver | not reported | not reported | Not Reported | a1d6658a-accb-5864-a6b2-31e176486998 | 86.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Carcinoma, Hepatocellular | none | liver | none | none | none | none | none |
5 rows × 301 columns
#Subset the expression data using the paired normal and tumor samples
expression_data_paired = expression_data[expression_data.columns.intersection(clinical_data_paired.index)]
expression_data_paired.to_csv('TCGA_LIHC_paired.csv', sep = '\t')
clinical_data_paired.to_csv('TCGA_LIHC_clinical.csv', sep = '\t')
Preparing TCGA clinical data for survival analysis
#Load R packages
suppressMessages(library("readr"))
suppressMessages(library("gplots"))
suppressMessages(library("survival"))
suppressMessages(library("survminer"))
suppressMessages(library("RColorBrewer"))
suppressMessages(library("gProfileR"))
suppressMessages(library("genefilter"))
expression_data <- as.data.frame(read_delim("TCGA_LIHC_paired.csv", "\t", escape_double = FALSE, trim_ws = TRUE, col_types = cols()))
rownames(expression_data) <- expression_data[,1]
expression_data <- expression_data[,-1]
clinical_data <- as.data.frame(read_delim("TCGA_LIHC_clinical.csv", "\t", escape_double = FALSE, trim_ws = TRUE, col_types = cols()))
rownames(clinical_data) <- clinical_data[,1]
clinical_data <- clinical_data[,-1]
# create a new boolean variable that has TRUE for dead patients
# and FALSE for live patients
clinical_data$deceased = clinical_data$vital_status == "Dead"
# create an "overall survival" variable that is equal to days_to_death
# for dead patients, and to days_to_last_follow_up for patients who
# are still alive
clinical_data$overall_survival = ifelse(clinical_data$deceased,
clinical_data$days_to_death,
clinical_data$days_to_last_follow_up)
# show first 10 samples
clinical_data_survival = clinical_data[,c('vital_status','deceased','days_to_death','days_to_last_follow_up','overall_survival')]
head(clinical_data_survival)
vital_status | deceased | days_to_death | days_to_last_follow_up | overall_survival | |
---|---|---|---|---|---|
<chr> | <lgl> | <dbl> | <dbl> | <dbl> | |
TCGA-BC-A10Q-01A-11R-A131-07 | Dead | TRUE | 1135 | NA | 1135 |
TCGA-BC-A10Q-11A-11R-A131-07 | Dead | TRUE | 1135 | NA | 1135 |
TCGA-BC-A10R-01A-11R-A131-07 | Dead | TRUE | 308 | NA | 308 |
TCGA-BC-A10R-11A-11R-A131-07 | Dead | TRUE | 308 | NA | 308 |
TCGA-BC-A10T-01A-11R-A131-07 | Dead | TRUE | 837 | NA | 837 |
TCGA-BC-A10T-11A-11R-A131-07 | Dead | TRUE | 837 | NA | 837 |
Examining the association of cluster specific gene sets with clinical data
- Specific Cluster 24 gene set G24 : ( COL1A, CFH, LUM and PDGFRA ).
- Specific Cluster 8 gene set G3 : ( SPARCL1, GJA4, ADAMTS4 and GPR4 ).
survival_gene_exp = function(gene_name) {
expression_mat = t(expression_data)
# get the expression values for the selected gene
clinical_data$gene_name = expression_mat[rownames(clinical_data), gene_name]
# find the median value of the gene and print it
median_value = median(clinical_data$gene_name)
# divide patients in two groups, up and down regulated.
# if the patient expression is greater or equal to them median we put it
# among the "up-regulated", otherwise among the "down-regulated"
clinical_data$gene = ifelse(clinical_data$gene_name >= median_value, "UP", "DOWN")
# we can fit a survival model, like we did in the previous section
fit = survfit(Surv(overall_survival, deceased) ~ gene, data=clinical_data)
return(list(fit, clinical_data))
}
Survival Analysis of Cluster 8 genes
We observe that overexpression of SPARCL1 gene showed a favourable prognosis in liver cancer patients.
# and finally, we produce a Kaplan-Meier plot
surv = survival_gene_exp("SPARCL1")
ggsurvplot(surv[[1]], data=surv[[2]], pval=T, risk.table=T, title=paste("SPARCL1"))
surv = survival_gene_exp("GJA4")
ggsurvplot(surv[[1]], data=surv[[2]], pval=T, risk.table=T, title=paste("GJA4"))
surv = survival_gene_exp("ADAMTS4")
ggsurvplot(surv[[1]], data=surv[[2]], pval=T, risk.table=T, title=paste("ADAMTS4"))
surv = survival_gene_exp("GPR4")
ggsurvplot(surv[[1]], data=surv[[2]], pval=T, risk.table=T, title=paste("GPR4"))
Appendix
Package installations
NOTE : For R packages, the cells must be run in the following order as shown below
!pip3 install cmapPy --user
!pip3 install GEOparse --user
!pip3 install sh --user
!pip3 install wget --user
!sudo R -e 'BiocManager::install(c("genefilter", "SingleCellExperiment","GSEABase", "GSVA", "singscore"))'