Downloading Datasets
Download
The OmixAtlas class contains functions to download gct, h5ad, and vcf files, and the metadata of any dataset.
download_data(repo_id, _id)
To download any dataset, the following function can be used to get the signed URL of the dataset. The data can be downloaded by clicking on this URL. NOTE: This signed URL expires after 60 minutes from when it is generated.
The repo_name OR repo_id of an OmixAtlas can be identified by calling the get_all_omixatlas() function. The dataset_id can be obtained by querying the metadata at the dataset level using query_metadata(). This data can be parsed into a data frame for better accessibility using the code under the examples section.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
repo_id |
str/int
|
repo_id for the omixatlas |
required |
payload |
dict
|
The payload is a JSON file which should be as per the structure defined for schema. |
required |
Raises:
Type | Description |
---|---|
apiErrorException
|
Params are either empty or its datatype is not correct or see detail. |
download_metadata(repo_key, dataset_id, file_path)
This function is used to download the dataset level metadata into a json file. The key of a given field in the downloaded JSON is original_name attribute of that field in schema.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
repo_key |
str
|
repo_key(repo_name/repo_id) of the repository which is linked. |
required |
dataset_id(str) |
dataset_id of the dataset to be linked. |
required | |
file_path(str) |
the system path where the json file is to be written. |
required |
Examples
download_data()
Downloading .gct and opening it in a data frame
dataset_id = "GSE100003_GPL15207" #dataset which user wants to download.
repo_key = 9 OR "geo" #repo_id OR repo_name from which dataset should be downloaded from.
file_name = f"{dataset_id}.gct"
data = client.download_data(repo_key, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
print("Downloaded data successfully")
else:
raise Exception("Download not successful")
# In order to parse the .gct data, a python package called cmapPy can be used in the following manner.
import pandas as pd
import cmapPy
from cmapPy.pandasGEXpress.parse_gct import parse
gct_obj = parse(file_name) # Parse the file to create a gct object
df_real = gct_obj.data_df # Extract the dataframe from the gct object
col_metadata = gct_obj.col_metadata_df # Extract the column metadata from the gct object
row_metadata = gct_obj.row_metadata_df # Extract the row metadata from the gct object
Downloading .h5ad file and opening it in a data frame
dataset_id = "GSE121001_GPL19057" #dataset which user wants to download.
repo_key = 17 OR "sc_data_lake" #repo_id OR repo_name from which dataset should be downloaded from.
file_name = f"{dataset_id}.h5ad"
data = client.download_data(repo_key, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
print("Downloaded data successfully")
else:
raise Exception("Download not successful")
# In order to parse the .h5ad data, a python package called scanpy can be used in the following manner.
import pandas as pd
import scanpy
data = sc.read_h5ad(file_name)
obs = data.obs.head()
var = data.var.head()
Downloading vcf files
dataset_id = "gnomad_v2.1.1_genome_TP53" #dataset which user wants to download.
repo_key = 1628836648493 OR "gnomad" #repo_id OR repo_name from which dataset should be downloaded from.
file_name = f"{dataset_id}.vcf"
data = client.download_data(repo_key, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
print("Downloaded data successfully")
else:
raise Exception("Download not successful")</code></pre>
The downloaded vcf file can be further analysed using the docker environment containing Hail package on Polly.