Data Management Legacy
OmixAtlas class enables users to interact with functional properties of the omixatlas such as create and update an Omixatlas, get summary of it's contents, add, insert, update the schema, add, update or delete datasets, query metadata, download data, save data to workspace etc.
Parameters:
-
token
(str
, default:None
) –token copy from polly.
Usage
from polly.OmixAtlas import OmixAtlas
omixatlas = OmixAtlas(token)
add_datasets
This function is used to add a new data into an OmixAtlas. Once user runs this function successfully, it takes 30 seconds to log the ingestion request and within 2 mins, the ingestion log will be shown in the data ingestion monitoring dashboard. In order to add datasets into Omixatlas the user must be a Data Contributor at the resource level. Please contact polly@support.com if you get Access Denied error message.
Parameters:
-
repo_id
(int / str
) –repo_id for that Omixatlas
-
source_folder_path
(dict
) –source folder paths from data and metadata files are fetched.In this dictionary, there should be two keys called "data" and "metadata" with value consisting of folders where data and metadata files are stored respectively i.e. {"data":"
", "metadata":" "} -
priority
(str
, default:'low'
) –Optional parameter(low/medium/high). Priority at which this data has to be ingested into the OmixAtlas. The default value is "low". Acceptable values are "medium" and "high".
-
validation
(bool
, default:False
) –Optional parameter(True/False) Users was to activate validation. By Default False. Means validation not active by default. Validation needs to be activated only when validated files are being ingested.
Raises:
-
paramError
–If Params are not passed in the desired format or value not valid.
-
RequestException
–If there is issue in data ingestion.
Returns:
-
DataFrame
–pd.DataFrame: DataFrame showing Upload Status of Files
dataset_metadata_template
This function is used to fetch the template of dataset level metadata in a given OmixAtlas. In order to ingest the dataset level metadata appropriately in the OmixAtlas, the user needs to ensure the metadata json file contains the keys as per the dataset level schema.
Parameters:
-
repo_id
(str / int
) –repo_name/repo_id for that Omixatlas
-
source
(all
, default:'all'
) –Source/Sources present in the schema. Default value is "all"
-
data_type
(all
, default:'all'
) –Datatype/Datatypes present in the schema. Default value is "all"
Returns:
-
dict
–A dictionary with the dataset level metadata
Raises:
-
invalidApiResponseException
–attribute/key error
delete_datasets
This function is used to delete datasets from an OmixAtlas. Once user runs this function successfully, they should be able to see the deletion status on the data ingestion monitoring dashboard within ~2 mins. A dataframe with the status of the operation for each file(s) will be displayed after the function execution.
In order to delete datasets into Omixatlas the user must be a Data Admin at the resource level. Please contact polly.support@elucidata.io if you get Access Denied error message.
Note -> Because this function takes list as an input, user must not run this function in a loop.
Parameters:
-
repo_id
(int
) –repo_id for that Omixatlas
-
dataset_ids
(list
) –list of dataset_ids that users want to delete. It is mandatory for users to pass the dataset_id which they want to delete from the repo in this list.
-
dataset_file_path_dict(dict,
(Optional
) –Optional Parameter. In case a given dataset ID has multiple files associated with it, then the user has to specifically give the file_path which needs to be deleted. Users can use the function get_all_file_paths to get paths of all the files which correspond to same dataset_id.
Raises:
-
paramError
–If Params are not passed in the desired format or value not valid.
-
RequestException
–If there is issue in data ingestion.
Returns:
-
–
None
save_to_workspace
Function to download a dataset from OmixAtlas and save it to Workspaces.
Parameters:
-
repo_id
(str
) –repo_id of the Omixatlas
-
dataset_id
(str
) –dataset id that needs to be saved
-
workspace_id
(int
) –workspace id in which the dataset needs to be saved
-
workspace_path
(str
) –path where the workspace resides
Returns:
-
json
(json
) –Info about workspace where data is saved and of which Omixatlas
update_datasets
This function is used to update a new data into an OmixAtlas. Once user runs this function successfully, it takes 30 seconds to log the ingestion request and within 2 mins, the ingestion log will be shown in the data ingestion monitoring dashboard. In order to update datasets the user must be a Data Contributor at the resource level. Please contact polly@support.com if you get Access Denied error message.
Parameters:
-
repo_id
(int / str
) –repo_id for that Omixatlas
-
source_folder_path
(dict
) –source folder paths from data and metadata files are fetched.In this dictionary, there should be two keys called "data" and "metadata" with value consisting of folders where data and metadata files are stored respectively i.e. {"data":"
", "metadata":" "} -
priority
(str
, default:'low'
) –Optional parameter(low/medium/high). Priority at which this data has to be ingested into the OmixAtlas. The default value is "low". Acceptable values are "medium" and "high".
-
file_mapping(dict,
(optional
) –Defaults to empty dict. The dictionary should be in the format {"
":" "}. Full Dataset File Name should be provided in the key. Example entry -> {"GSE2067_GPL96.gct": "GSE2067_GPL96", "GSE2067_GPL97.gct": "GSE2067_GPL97"} -
validation(bool,
(optional
) –Optional parameter(True/False) Users was to activate validation. By Default False. Means validation not active by default. Validation needs to be activated only when validated files are being ingested.
Raises:
-
paramError
–If Params are not passed in the desired format or value not valid.
-
RequestException
–If there is issue in data ingestion.
Returns:
-
DataFrame
–pd.DataFrame: DataFrame showing Upload Status of Files
move_data
This function is used to move datasets from source atlas to destination atlas. This function should only be used when schema of source and destination atlas are compatible with each other. Else, the behaviour of data in destination atlas may not be the same or the ingestion may fail. Please contact polly@support.com if you get Access Denied error message.
Parameters:
-
source_repo_key
(str / int
) –src repo key of the dataset ids. Only repo_id supported now,
-
destination_repo_key
(str / int
) –destination repo key where the data needs to be transferred
-
dataset_ids
(list
) –list of dataset ids to transfer
-
priority
(str
, default:'medium'
) –Optional parameter(low/medium/high). Priority of ingestion. Defaults to "medium".
Returns:
-
None
(str
) –None
Examples
# Install polly python
pip install polly-python
# Import libraries
from polly.auth import Polly
from polly.omixatlas import OmixAtlas
# Create omixatlas object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
Polly.auth(AUTH_TOKEN)
omixatlas = OmixAtlas()
Add data to a new OmixAtlas
Addition of data into a newly created OA, also referred to as ingestion, can be done using polly py function add_datasets
.
The OmixAtlas to which the data is to be added should have the supported schema as per the metadata contained in the data.
Please see this FAQ to check if the metadata and the schema of the OA match. While adding a dataset both metdata file (json) and the data file (h5ad, gct, vcf) are required.
In order to use this function,
-
the metadata and data files should be present in separate folders and the path to these folders should be provided as a dictionary of keys
metadata
anddata
with values as the metadata folder path and data folder path respectively. -
Each metdata file should have the corresponding data file and visa-versa. The metdata and data file of a dataset is expected to have the same name of the dataset_id. For example, GSE100009_GPL11154.json and GSE100009_GPL11154.gct are the metadata file and data file for the dataset id GSE100009_GPL11154 respectively.
Once the files are uploaded for ingestion, the ingestion progress and logs can be monitored and fetched on the ingestion monitoring dashboard.
data_source_folder_path = "data_ingestion_demo/data/"
metadata_source_folder_path = "data_ingestion_demo/metadata/"
source_data_folder = {"data": data_source_folder_path, "metadata": metadata_source_folder_path}
repo_id = repo_id
priority= "medium"
omixatlas.add_datasets(repo_id, source_data_folder, priority= "medium")
File Name Message
0 combined_metadata.json File Uploaded
1 GSE100009_GPL11154_raw.gct File Uploaded
2 GSE100013_GPL16791_raw.gct File Uploaded
Please wait for 30 seconds while your ingestion request is getting logged.
Your ingestion request is successfully logged. You can go to ingestion monitoring dashboard for tracking it's status.
Usage of priority and validation flags
priority
: This is an optional parameter too. This states the priority at which this data has to be ingested into the OmixAtlas. The default value is "low". Acceptable values are "low", "medium" and "high".validation
option can be provided as True or False. By Default, validation in inactive. When validation needs to be activated, user should pass another argumentvalidation = True
while running the function.
Template for dataset level metadata
To check if the keys used in the dataset level metadata is compatible with the schema of the OmixAtlas. The dataset level metadata template can be fetched using the function as shown below. In this example, followed by fetching the template, we are loading the keys in dataset level metadata and checking if the keys are matching schema requirements.
#example: get the metdata template of the destination atlas. here, we are looking at repo_id "9"
data_metadata_template_geo = omixatlas.dataset_metadata_template("9")
# getting the dataset level metadata from the dataset metadata json that is to be ingested
import json
keys_in_json = set()
f = open('/import/data_ingestion_demo/metadata/GSE95448_GPL19057.json')
data = json.load(f)
for key in data:
keys_in_json.add(key)
# comparing the keys in the destination atlas metadata vs the keys in the dataset metadata json that is to be ingested
intersect = keys_in_json.intersection(template_keys)
template_keys.difference(intersect)
Move data from source to destination
This function is used to move datasets from source atlas to destination atlas. This function should only be used when schema of source and destination atlas are compatible with each other. Else, the behaviour of data in destination atlas may not be the same or the ingestion may fail.
# example: moving 3 datasets from source "geo_transcriptomics_omixatlas" to destination "rankine_atlas"
omixatlas.move_data(source_repo_key = "geo_transcriptomics_omixatlas", destination_repo_key = "rankine_atlas",
dataset_ids = ["GSE12332_GPL123", "GSE43234_GPL143", "GSE89768_GPL967"])
Update the data or metadata in omixatlas
The already ingested data in the OA can be updated by re-ingesting either the metadata file of a dataset or the data file of a dataset or both of a dataset based on what needs to be updated. The update progress can also be seen on the ingestion monitoring dashboard. However, if there are no change in the files, the process will not be initiated and not seen in the ingestion monitoring dashboard.
In order to use this fuction,
-
The metadata and data files should be present in separate folder and the path to these folders should be provided as a dictionary of keys "metadata" and "data" with values as the metadata folder path and data folder path respectively and which ever applicable.
-
In case the data or metadata being updated has not been ingested before, then appropriate warning would be provided and it would suggested to use the
add_datasets
fucntion to add the data first.
Ex 1: Update both data and metadata file
metadata_folder_path = "repoid_1654268055800_files_test/metadata_2/"
data_folder_path = "repoid_1654268055800_files_test/data_2/"
repo_id= "1654268055800"
source_folder_path = {"metadata":metadata_folder_path, "data": data_folder_path}
omixatlas.update_datasets(repo_id, source_folder_path, priority)
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
repoid_1654268055800_files_test/data_2/:
ACBC_MSKCC_2015_Copy_Number_AdCC10T.gct
repoid_1654268055800_files_test/metadata_2/:
ACBC_MSKCC_2015_Copy_Number_AdCC10T.json
Processing Metadata files: 100%|██████████| 1/1 [00:00<00:00, 357.57it/s]
Uploading data files: 100%|██████████| 1/1 [00:00<00:00, 6.36files/s]
Please wait for 30 seconds while your ingestion request is getting logged.
Your ingestion request is successfully logged. You can go to ingestion monitoring dashboard for tracking it's status.
File Name Message
0 combined_metadata.json File Uploaded
1 ACBC_MSKCC_2015_Copy_Number_AdCC10T.gct File Uploaded
Ex 2: updating only data files
data_folder_path = "repoid_1654268055800_files_test/data"
repo_id= "1654268055800"
priority = "medium"
source_folder_path = {"data":data_folder_path}
omixatlas.update_datasets(repo_id, source_folder_path, priority)
Please wait for 30 seconds while your ingestion request is getting logged.
Your ingestion request is successfully logged. You can go to ingestion monitoring dashboard for tracking it's status.
File Name Message
0 CCLE_Mutation_C3A_LIVER.gct File Uploaded
Delete data from OmixAtlas
Datasets ingested can be deleted from the omixatlas using the delete_datasets
function. A list of dataset ids can be provided in order to delete multiple datasets. Status of delete operation can be seen on ingestion monitoring dashboard within 2 mins after function is run.
Please note that the user would need to have relevant permissions/roles in order to delete datasets.
repo_id = "1643359804137"
dataset_ids = ["GSE100009_GPL11154", "GSE145009_GPL11124"]
omixatlas.delete_datasets(repo_id, dataset_ids)
With polly-python >= v0.3.0
, while using the delete_datasets
function, in case the datasets have been ingested at multiple file paths in the OmixAtlas, the users need to pass the file path(s) from where the datasets should be deleted.
To fetch the list of file paths where a particular dataset has been ingested in the OmixAtlas, the get_all_file_paths
function can be used.
repo_id = "1673847977346"
dataset_id = "GSE140509_GPL16791"
file_paths = omixatlas.get_all_file_paths(repo_id, dataset_id)
For deleting a file present from a specific folder path, use the dataset_file_path_dict
parameter to pass a dictionary with key as the dataset_id and the value as the list of folder_paths from where the dataset_id files are to be deleted. Users should provide the full paths as shown in the example below.
Here the file GSE140509_GPL16791.gct
from the folder path transcriptomics_209/
is being deleted for the dataset id GSE140509_GPL16791
.
# dataset id present in multiple paths -> one of the paths passed for deletion
dataset_id = ["GSE140509_GPL16791"]
repo_id = "1673847977346"
dataset_file_path_dict = {"GSE140509_GPL16791":["transcriptomics_209/GSE140509_GPL16791.gct"]}
omixatlas.delete_datasets(repo_id, dataset_id,dataset_file_path_dict=dataset_file_path_dict)
DatasetId Message \
0 GSE140509_GPL16791 Request Accepted.
Dataset Will be deleted in the
next version of OmixAtlas
Folder Path
0 transcriptomics_209/GSE140509_GPL16791.gct
Data ingestion monitoring dashboard
The Data ingestion monitoring dashboard on the GUI, allows users to monitor the progress of the ingestion runs (add_datasets, update_datasets, delete_datasets). For each dataset undergoing ingestion (addition/update) or deletion, the logs are available here to be viewed and downloaded.
To know more about ingestion monitoring dashboard, please refer to this section
How and why to save data in workspace?
Workspaces allow to download and save data from the analysis or an omixatlas to be reused again instead of downloading to the local system. Workspaces can act as storage spaces with additional capability of sharing or collaborating with other users. The workspace id can be fetched from the url that comes on opening a workspace on the GUI and needs to be passed as an integer.
repo_id = "9"
dataset_id = "GSE107280_GPL11154"
workspace_id = 12345
workspace_path= "geo_GSE107280_GPL11154"
omixatlas.save_to_workspace(repo_id, dataset_id, workspace_id, workspace_path)
INFO:root:Data Saved to workspace=8223
{'data': {'type': 'workspace-jobs',
'id': '9/12345',
'attributes': {'destination-gct': '12345/geo_GSE107280_GPL11154/GSE107280_GPL11154_curated.gct',
'destination-json': '12345/geo_GSE107280_GPL11154/GSE107280_GPL11154.json'}}}
Import a file from workspace to Polly Notebooks
The files present in the workspace can be viewed from notebook as well and also synced, so that the files that are present in the workspace are available for use in the current analysis notebook as well. Please note only those files that are present in the same workspace as the analysis/notebook can be synced.
# to list files in the folder "repoid_1654268055800_files_test/" in the current workspace
!polly files list --workspace-path "polly://repoid_1654268055800_files_test/" -y
# copy the files from the folder repoid_1654268055800_files_test/ in current workspace to the notebook under folder destination_repoid_1654268055800_files_test/"
!polly files sync -s "polly://repoid_1654268055800_files_test/" -d "destiantion_repoid_1654268055800_files_test/" -y