Skip to content

Downloading Datasets

Download

The OmixAtlas class contains functions to download gct, h5ad, and vcf files, and the metadata of any dataset.

download_data(repo_id, _id)

To download any dataset, the following function can be used to get the signed URL of the dataset. The data can be downloaded by clicking on this URL. NOTE: This signed URL expires after 60 minutes from when it is generated.

The repo_name OR repo_id of an OmixAtlas can be identified by calling the get_all_omixatlas() function. The dataset_id can be obtained by querying the metadata at the dataset level using query_metadata(). This data can be parsed into a data frame for better accessibility using the code under the examples section.

Parameters:

Name Type Description Default
repo_id str/int

repo_id for the omixatlas

required
payload dict

The payload is a JSON file which should be as per the structure defined for schema.

required

Raises:

Type Description
apiErrorException

Params are either empty or its datatype is not correct or see detail.

download_metadata(repo_key, dataset_id, file_path)

This function is used to download the dataset level metadata into a json file. The key of a given field in the downloaded JSON is original_name attribute of that field in schema.

Parameters:

Name Type Description Default
repo_key str

repo_key(repo_name/repo_id) of the repository which is linked.

required
dataset_id(str)

dataset_id of the dataset to be linked.

required
file_path(str)

the system path where the json file is to be written.

required

Examples

download_data()

Downloading .gct and opening it in a data frame

dataset_id = "GSE100003_GPL15207" #dataset which user wants to download.
repo_key = 9 OR "geo" #repo_id OR repo_name from which dataset should be downloaded from.
file_name = f"{dataset_id}.gct"
data = client.download_data(repo_key, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
    print("Downloaded data successfully")
else:
    raise Exception("Download not successful")
# In order to parse the .gct data, a python package called cmapPy can be used in the following manner.
import pandas as pd
import cmapPy
from cmapPy.pandasGEXpress.parse_gct import parse

gct_obj = parse(file_name) # Parse the file to create a gct object
df_real = gct_obj.data_df # Extract the dataframe from the gct object
col_metadata = gct_obj.col_metadata_df # Extract the column metadata from the gct object
row_metadata = gct_obj.row_metadata_df # Extract the row metadata from the gct object

Downloading .h5ad file and opening it in a data frame

dataset_id = "GSE121001_GPL19057" #dataset which user wants to download.
repo_key = 17 OR "sc_data_lake" #repo_id OR repo_name from which dataset should be downloaded from.
file_name = f"{dataset_id}.h5ad"
data = client.download_data(repo_key, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
    print("Downloaded data successfully")
else:
    raise Exception("Download not successful")
# In order to parse the .h5ad data, a python package called scanpy can be used in the following manner.
import pandas as pd
import scanpy
data = sc.read_h5ad(file_name)
obs = data.obs.head()
var = data.var.head()

Downloading vcf files

dataset_id = "gnomad_v2.1.1_genome_TP53" #dataset which user wants to download.
repo_key = 1628836648493 OR "gnomad" #repo_id OR repo_name from which dataset should be downloaded from.
file_name = f"{dataset_id}.vcf"
data = client.download_data(repo_key, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
    print("Downloaded data successfully")
else:
    raise Exception("Download not successful")</code></pre>

The downloaded vcf file can be further analysed using the docker environment containing Hail package on Polly.