Downloading Datasets
OmixAtlas class enables users to interact with functional properties of the omixatlas such as create and update an Omixatlas, get summary of it's contents, add, insert, update the schema, add, update or delete datasets, query metadata, download data, save data to workspace etc.
Parameters:
-
token
(str
, default:None
) –token copy from polly.
Usage
from polly.OmixAtlas import OmixAtlas
omixatlas = OmixAtlas(token)
download_data
To download any dataset, the following function can be used to get the signed URL of the dataset. The data can be downloaded by clicking on this URL. NOTE: This signed URL expires after 60 minutes from when it is generated.
The repo_name OR repo_id of an OmixAtlas can be identified by calling the get_all_omixatlas() function. The dataset_id can be obtained by querying the metadata at the dataset level using query_metadata().
This data can be parsed into a data frame for better accessibility using the code under the examples section.
Parameters:
-
repo_key
(str
) –repo_id OR repo_name. This is a mandatory field.
-
payload
(dict
) –The payload is a JSON file which should be as per the structure defined for schema.
-
internal_call
(bool
, default:False
) –True if being called internally by other functions. Default is False
Raises:
-
apiErrorException
–Params are either empty or its datatype is not correct or see detail.
download_metadata
This function is used to download the dataset level metadata into a json file.
The key present in the json file can be controlled using the metadata_key
argument of the function.
Users should use original_name
for data ingestion.
Parameters:
-
repo_key
(str
) –repo_name/repo_id of the repository where dataset belongs to.
-
dataset_id
(str
) –dataset_id of the dataset for which metadata should be downloaded.
-
file_path
(str
) –the system path where the json file should be stored.
-
metadata_key
(str
, default:'field_name'
) –Optional paramter. The metadata_key determines the key used in the json file.
Raises:
-
InvalidParameterException
–Invalid parameter passed
-
InvalidPathException
–Invalid file path passed
-
InvalidDirectoryPathException
–Invalid file path passed
download_dataset
This functions downloads the data for the provided dataset id list from the repo passed to the folder path provided.
Parameters:
-
repo_key
(int / str
) –repo_id OR repo_name. This is a mandatory field.
-
dataset_ids
(list
) –list of dataset_ids from the repo passed that users want to download data of
-
folder_path
(str
, default:''
) –folder path where the datasets will be downloaded to.
Raises:
-
InvalidParameterException
–invalid or missing parameter
-
paramException
–invalid or missing folder_path provided
get_metadata
This function is used to get the sample level metadata as a dataframe.
Parameters:
-
repo_key(str)
–repo_name/repo_id of the repository.
-
dataset_id(str)
–dataset_id of the dataset.
-
table_name(str)
–table name for the desired metadata, 'samples','samples_singlecell' supported for now.
Raises:
-
paramException
–invalid or missing parameter provided
-
RequestFailureException
–Request failed
Examples
# Install polly python
pip install polly-python
# Import libraries
from polly.auth import Polly
from polly.omixatlas import OmixAtlas
# Create omixatlas object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
Polly.auth(AUTH_TOKEN)
omixatlas = OmixAtlas()
Download the data file
Release >= 0.2.9
download_dataset()
list_datasets=['GSE16219_GPL570', 'GSE16226_GPL570', 'GSE162408_GPL11180', 'GSE16246_GPL8600','GSE16249_GPL570']
repo_key="geo"
dataset_ids =list_datasets
folder_path="output_dir/"
omixatlas.download_dataset(repo_key,dataset_ids,folder_path)
Release < v0.2.9
Datasets present on polly can either be of GCT, h5ad or VCF formats.
The following depicts how to download them and parse each of the formats.
Note: from polly-py version 0.2.9
, the datasets can be downloaded using one single fuction
download_dataset
, where a list of dataset_id(s) belonging to a OA/repository can be passed
to be downloaded.
Downloading .gct and opening it in a data frame
from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas(AUTH_TOKEN)
dataset_id = "GSE100003_GPL15207" #dataset which user wants to download.
repo_key = "geo" #repo_name or the repo_id ("7") in string format of the omixatlas from which dataset should be downloaded.
file_name = f"{dataset_id}.gct"
data = omixatlas.download_data(repo_key, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
print("Downloaded data successfully")
else:
raise Exception("Download not successful")
# In order to parse the .gct data, a python package called cmapPy can be used in the following manner.
import pandas as pd
import cmapPy
from cmapPy.pandasGEXpress.parse_gct import parse
gct_obj = parse(file_name) # Parse the file to create a gct object
df_real = gct_obj.data_df # Extract the dataframe from the gct object
col_metadata = gct_obj.col_metadata_df # Extract the column metadata from the gct object
row_metadata = gct_obj.row_metadata_df # Extract the row metadata from the gct object
Downloading .h5ad file and opening it in a data frame
from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas(AUTH_TOKEN)
dataset_id = "GSE121001_GPL19057" #dataset which user wants to download.
repo_key = "sc_data_lake" # repo_id in string format
file_name = f"{dataset_id}.h5ad"
data = omixatlas.download_data(repo_key, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
print("Downloaded data successfully")
else:
raise Exception("Download not successful")
# In order to parse the .h5ad data, a python package called scanpy can be used in the following manner.
import pandas as pd
import scanpy
data = sc.read_h5ad(file_name)
obs = data.obs.head()
var = data.var.head()
Downloading vcf files
from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas(AUTH_TOKEN)
dataset_id = "gnomad_v2.1.1_genome_TP53" #dataset which user wants to download.
repo_key = 1628836648493 #repo_id OR repo_name "gnomad" from which dataset should be downloaded from.
file_name = f"{dataset_id}.vcf"
data = omixatlas.download_data(repo_key, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
print("Downloaded data successfully")
else:
raise Exception("Download not successful")</code></pre>
The downloaded vcf file can be further analysed using the docker environment containing Hail package on Polly.
Download dataset level metadata
The dataset level metadata of a dataset can be downloaded in a JSON format in two formats:-
1. for data ingestion related activities: Users must use metadata_key = original_name
while running the function
2. For visualisations or querying: Users are recommended to use metadata_key = field_name
while running the function
from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas(AUTH_TOKEN)
dataset_id = "GSE12345_GPL123" #dataset which user wants to download.
repo_key = "1628836648493" #repo_id OR repo_name from which dataset should be downloaded from.
output_folder_path = "/metadata_folder/" #the system path where the json file should be stored
omixatlas.download_metadata(repo_key, dataset_id, output_folder_path)
The dataset level metadata for dataset = GSE12345_GPL123 has been downloaded at : = /metadata_folder/GSE12345_GPL123.json
Further, the json file can be viewed by loading into a json object.
Download entire sample level metadata
With polly-py version >0.2.8 the sample level metadata for datasets can be downloaded.
The function get_metadata
can be used to download the sample level metadata. The required parameters are the repo_name
, dataset_id
, and table_name
. Only sample level metadata i.e 'samples' (for gct files) and 'samples_single_cell' (for h5ad files) is supported for now.
from polly.auth import Polly
from polly.omixatlas import OmixAtlas
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
Polly.auth(AUTH_TOKEN)
omixatlas = OmixAtlas()
For h5ad files
For h5ad files, the index of sample level table is samples_singlecell
. it should be used as shown in code below:-
sample_df = omixatlas.get_metadata("single_cell_rnaseq_omixatlas","GSE174577_GPL24247","samples_singlecell") sample_df
sample_id platform title characteristics_ch1 source_name_ch1 organism_ch1 umi_counts umi_counts_log gene_counts gene_counts_log ... kw_column version is_current id_key data_id name src_repo src_dataset_id ...
0 GSM5320047 GPL24247 mouse esophageal organoid tissue: Mouse esopahgus-derived organoid|||cel... mouse esophageal cell Mus musculus 15208.0 4.182100772857666 2839 3.4533183400470375 ... GSM5320047:AAACCCACAGTAGTTC 0 true kw_column gsm5320047_aaacccacagtagttc GSM5320047:AAACCCACAGTAGTTC single_cell_rnaseq_omixatlas GSE174577_GPL24247 ...
1 GSM5320047 GPL24247 mouse esophageal organoid tissue: Mouse esopahgus-derived organoid|||cel... mouse esophageal cell Mus musculus 10844.0 4.035229682922363 2141 3.330819466495837 ... GSM5320047:AAACCCAGTGAGAACC 0 true kw_column gsm5320047_aaacccagtgagaacc GSM5320047:AAACCCAGTGAGAACC single_cell_rnaseq_omixatlas GSE174577_GPL24247 ...
For gct files
Similarly, for gct files, we need to use the table name as samples
as shown in the code below:-