Downloading Datasets

OmixAtlas class enables users to interact with functional properties of the omixatlas such as create and update an Omixatlas, get summary of it's contents, add, insert, update the schema, add, update or delete datasets, query metadata, download data, save data to workspace etc.

Parameters:

token (str, default: None ) –

token copy from polly.

Usage

from polly.OmixAtlas import OmixAtlas

omixatlas = OmixAtlas(token)

download_data

download_data(repo_name, _id, internal_call=False)

To download any dataset, the following function can be used to get the signed URL of the dataset. The data can be downloaded by clicking on this URL. NOTE: This signed URL expires after 60 minutes from when it is generated.

The repo_name OR repo_id of an OmixAtlas can be identified by calling the get_all_omixatlas() function. The dataset_id can be obtained by querying the metadata at the dataset level using query_metadata().

This data can be parsed into a data frame for better accessibility using the code under the examples section.

Parameters:

repo_key (str) –

repo_id OR repo_name. This is a mandatory field.
payload (dict) –

The payload is a JSON file which should be as per the structure defined for schema.
internal_call (bool, default: False ) –

True if being called internally by other functions. Default is False

Raises:

apiErrorException –

Params are either empty or its datatype is not correct or see detail.

download_metadata

download_metadata(repo_key, dataset_id, file_path, metadata_key='field_name')

This function is used to download the dataset level metadata into a json file. The key present in the json file can be controlled using the metadata_key argument of the function. Users should use original_name for data ingestion.

Parameters:

repo_key (str) –

repo_name/repo_id of the repository where dataset belongs to.
dataset_id (str) –

dataset_id of the dataset for which metadata should be downloaded.
file_path (str) –

the system path where the json file should be stored.
metadata_key (str, default: 'field_name' ) –

Optional paramter. The metadata_key determines the key used in the json file.

Raises:

InvalidParameterException –

Invalid parameter passed
InvalidPathException –

Invalid file path passed
InvalidDirectoryPathException –

Invalid file path passed

download_dataset

download_dataset(repo_key, dataset_ids, folder_path='')

This functions downloads the data for the provided dataset id list from the repo passed to the folder path provided.

Parameters:

repo_key (int / str) –

repo_id OR repo_name. This is a mandatory field.
dataset_ids (list) –

list of dataset_ids from the repo passed that users want to download data of
folder_path (str, default: '' ) –

folder path where the datasets will be downloaded to.

Raises:

InvalidParameterException –

invalid or missing parameter
paramException –

invalid or missing folder_path provided

get_metadata

get_metadata(repo_key, dataset_id, table_name)

This function is used to get the sample level metadata as a dataframe.

Parameters:

repo_key(str) –

repo_name/repo_id of the repository.
dataset_id(str) –

dataset_id of the dataset.
table_name(str) –

table name for the desired metadata, 'samples','samples_singlecell' supported for now.

Raises:

paramException –

invalid or missing parameter provided
RequestFailureException –

Request failed

Examples

# Install polly python
pip install polly-python

# Import libraries
from polly.auth import Polly
from polly.omixatlas import OmixAtlas

# Create omixatlas object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
Polly.auth(AUTH_TOKEN)
omixatlas = OmixAtlas()

Download the data file

Release >= 0.2.9

download_dataset()

list_datasets=['GSE16219_GPL570', 'GSE16226_GPL570', 'GSE162408_GPL11180', 'GSE16246_GPL8600','GSE16249_GPL570']
repo_key="geo"
dataset_ids =list_datasets
folder_path="output_dir/"
omixatlas.download_dataset(repo_key,dataset_ids,folder_path)

downloading data file:GSE16219_GPL570_curated.gct: 100%|██████████| 736k/736k [00:00<00:00, 83.5MiB/s] downloading data file:GSE16246_GPL8600.gct: 100%|██████████| 174k/174k [00:00<00:00, 44.8MiB/s] downloading data file:GSE16249_GPL570.gct: 100%|██████████| 1.33M/1.33M [00:00<00:00, 161MiB/s] downloading data file:GSE162408_GPL11180_curated.gct: 100%|██████████| 7.16M/7.16M [00:00<00:00, 75.4MiB/s] downloading data file:GSE16226_GPL570.gct: 100%|██████████| 5.35M/5.35M [00:00<00:00, 59.8MiB/s]

Release < v0.2.9

Datasets present on polly can either be of GCT, h5ad or VCF formats. The following depicts how to download them and parse each of the formats. Note: from polly-py version 0.2.9, the datasets can be downloaded using one single fuction download_dataset, where a list of dataset_id(s) belonging to a OA/repository can be passed to be downloaded.

Downloading .gct and opening it in a data frame

from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas(AUTH_TOKEN)
dataset_id = "GSE100003_GPL15207" #dataset which user wants to download.
repo_key = "geo" #repo_name or the repo_id ("7") in string format of the omixatlas from which dataset should be downloaded.
file_name = f"{dataset_id}.gct"
data = omixatlas.download_data(repo_key, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
    print("Downloaded data successfully")
else:
    raise Exception("Download not successful")

# In order to parse the .gct data, a python package called cmapPy can be used in the following manner.
import pandas as pd
import cmapPy
from cmapPy.pandasGEXpress.parse_gct import parse

gct_obj = parse(file_name) # Parse the file to create a gct object
df_real = gct_obj.data_df # Extract the dataframe from the gct object
col_metadata = gct_obj.col_metadata_df # Extract the column metadata from the gct object
row_metadata = gct_obj.row_metadata_df # Extract the row metadata from the gct object

Downloading .h5ad file and opening it in a data frame

from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas(AUTH_TOKEN)
dataset_id = "GSE121001_GPL19057" #dataset which user wants to download.
repo_key = "sc_data_lake" # repo_id in string format
file_name = f"{dataset_id}.h5ad"
data = omixatlas.download_data(repo_key, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
    print("Downloaded data successfully")
else:
    raise Exception("Download not successful")

# In order to parse the .h5ad data, a python package called scanpy can be used in the following manner.
import pandas as pd
import scanpy
data = sc.read_h5ad(file_name)
obs = data.obs.head()
var = data.var.head()

Downloading vcf files

from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas(AUTH_TOKEN)
dataset_id = "gnomad_v2.1.1_genome_TP53" #dataset which user wants to download.
repo_key = 1628836648493  #repo_id OR repo_name "gnomad" from which dataset should be downloaded from.
file_name = f"{dataset_id}.vcf"
data = omixatlas.download_data(repo_key, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
    print("Downloaded data successfully")
else:
    raise Exception("Download not successful")</code></pre>

The downloaded vcf file can be further analysed using the docker environment containing Hail package on Polly.

Download dataset level metadata

The dataset level metadata of a dataset can be downloaded in a JSON format in two formats:- 1. for data ingestion related activities: Users must use metadata_key = original_name while running the function 2. For visualisations or querying: Users are recommended to use metadata_key = field_name while running the function

from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas(AUTH_TOKEN)
dataset_id = "GSE12345_GPL123" #dataset which user wants to download.
repo_key = "1628836648493" #repo_id OR repo_name from which dataset should be downloaded from.
output_folder_path = "/metadata_folder/" #the system path where the json file should be stored
omixatlas.download_metadata(repo_key, dataset_id, output_folder_path)

The dataset level metadata for dataset = GSE12345_GPL123 has been downloaded at : = /metadata_folder/GSE12345_GPL123.json

Further, the json file can be viewed by loading into a json object.

import json
f = open('/metadata_folder/GSE12345_GPL123.json')
data = json.load(f)
data

Download entire sample level metadata

With polly-py version >0.2.8 the sample level metadata for datasets can be downloaded.

The function get_metadata can be used to download the sample level metadata. The required parameters are the repo_name, dataset_id, and table_name. Only sample level metadata i.e 'samples' (for gct files) and 'samples_single_cell' (for h5ad files) is supported for now.

from polly.auth import Polly
from polly.omixatlas import OmixAtlas
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
Polly.auth(AUTH_TOKEN)
omixatlas = OmixAtlas()

For h5ad files

For h5ad files, the index of sample level table is samples_singlecell. it should be used as shown in code below:-

sample_df = omixatlas.get_metadata("single_cell_rnaseq_omixatlas","GSE174577_GPL24247","samples_singlecell") sample_df

    sample_id   platform    title   characteristics_ch1 source_name_ch1 organism_ch1    umi_counts  umi_counts_log  gene_counts gene_counts_log ... kw_column   version is_current  id_key  data_id name    src_repo    src_dataset_id ...
    0   GSM5320047  GPL24247    mouse esophageal organoid   tissue: Mouse esopahgus-derived organoid|||cel...   mouse esophageal cell   Mus musculus    15208.0 4.182100772857666   2839    3.4533183400470375  ... GSM5320047:AAACCCACAGTAGTTC 0   true    kw_column   gsm5320047_aaacccacagtagttc GSM5320047:AAACCCACAGTAGTTC single_cell_rnaseq_omixatlas    GSE174577_GPL24247  ...
    1   GSM5320047  GPL24247    mouse esophageal organoid   tissue: Mouse esopahgus-derived organoid|||cel...   mouse esophageal cell   Mus musculus    10844.0 4.035229682922363   2141    3.330819466495837   ... GSM5320047:AAACCCAGTGAGAACC 0   true    kw_column   gsm5320047_aaacccagtgagaacc GSM5320047:AAACCCAGTGAGAACC single_cell_rnaseq_omixatlas    GSE174577_GPL24247  ...

For gct files

Similarly, for gct files, we need to use the table name as samples as shown in the code below:-

from polly.auth import Polly
from polly.omixatlas import OmixAtlas
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
Polly.auth(AUTH_TOKEN)
omixatlas = OmixAtlas()
# omixatlas with gct files
dataframe = omixatlas.get_metadata("geo","GSE100053_GPL10558","samples")
dataframe