Data Management Legacy

OmixAtlas class enables users to interact with functional properties of the omixatlas such as create and update an Omixatlas, get summary of it's contents, add, insert, update the schema, add, update or delete datasets, query metadata, download data, save data to workspace etc.

Parameters:

token (str, default: None ) –

token copy from polly.

Usage

from polly.OmixAtlas import OmixAtlas

omixatlas = OmixAtlas(token)

add_datasets

add_datasets(repo_id, source_folder_path, priority='low', validation=False)

This function is used to add a new data into an OmixAtlas. Once user runs this function successfully, it takes 30 seconds to log the ingestion request and within 2 mins, the ingestion log will be shown in the data ingestion monitoring dashboard. In order to add datasets into Omixatlas the user must be a Data Contributor at the resource level. Please contact polly@support.com if you get Access Denied error message.

Parameters:

repo_id (int / str) –

repo_id for that Omixatlas
source_folder_path (dict) –

source folder paths from data and metadata files are fetched.In this dictionary, there should be two keys called "data" and "metadata" with value consisting of folders where data and metadata files are stored respectively i.e. {"data":"", "metadata":""}
priority (str, default: 'low' ) –

Optional parameter(low/medium/high). Priority at which this data has to be ingested into the OmixAtlas. The default value is "low". Acceptable values are "medium" and "high".
validation (bool, default: False ) –

Optional parameter(True/False) Users was to activate validation. By Default False. Means validation not active by default. Validation needs to be activated only when validated files are being ingested.

Raises:

paramError –

If Params are not passed in the desired format or value not valid.
RequestException –

If there is issue in data ingestion.

Returns:

DataFrame –

pd.DataFrame: DataFrame showing Upload Status of Files

dataset_metadata_template

dataset_metadata_template(repo_key, source='all', data_type='all')

This function is used to fetch the template of dataset level metadata in a given OmixAtlas. In order to ingest the dataset level metadata appropriately in the OmixAtlas, the user needs to ensure the metadata json file contains the keys as per the dataset level schema.

Parameters:

repo_id (str / int) –

repo_name/repo_id for that Omixatlas
source (all, default: 'all' ) –

Source/Sources present in the schema. Default value is "all"
data_type (all, default: 'all' ) –

Datatype/Datatypes present in the schema. Default value is "all"

Returns:

dict –

A dictionary with the dataset level metadata

Raises:

invalidApiResponseException –

attribute/key error

delete_datasets

delete_datasets(repo_id, dataset_ids, dataset_file_path_dict={})

This function is used to delete datasets from an OmixAtlas. Once user runs this function successfully, they should be able to see the deletion status on the data ingestion monitoring dashboard within ~2 mins. A dataframe with the status of the operation for each file(s) will be displayed after the function execution.

In order to delete datasets into Omixatlas the user must be a Data Admin at the resource level. Please contact polly.support@elucidata.io if you get Access Denied error message.

Note -> Because this function takes list as an input, user must not run this function in a loop.

Parameters:

repo_id (int) –

repo_id for that Omixatlas
dataset_ids (list) –

list of dataset_ids that users want to delete. It is mandatory for users to pass the dataset_id which they want to delete from the repo in this list.
dataset_file_path_dict ((dict, Optional), default: {} ) –

Optional Parameter. In case a given dataset ID has multiple files associated with it, then the user has to specifically give the file_path which needs to be deleted. Users can use the function get_all_file_paths to get paths of all the files which correspond to same dataset_id.

Raises:

paramError –

If Params are not passed in the desired format or value not valid.
RequestException –

If there is issue in data ingestion.

Returns:

–

None

save_to_workspace

save_to_workspace(repo_id, dataset_id, workspace_id, workspace_path)

Function to download a dataset from OmixAtlas and save it to Workspaces.

Parameters:

repo_id (str) –

repo_id of the Omixatlas
dataset_id (str) –

dataset id that needs to be saved
workspace_id (int) –

workspace id in which the dataset needs to be saved
workspace_path (str) –

path where the workspace resides

Returns:

json ( json ) –

Info about workspace where data is saved and of which Omixatlas

update_datasets

update_datasets(repo_id, source_folder_path, priority='low', file_mapping={}, validation=False)

This function is used to update a new data into an OmixAtlas. Once user runs this function successfully, it takes 30 seconds to log the ingestion request and within 2 mins, the ingestion log will be shown in the data ingestion monitoring dashboard. In order to update datasets the user must be a Data Contributor at the resource level. Please contact polly@support.com if you get Access Denied error message.

Parameters:

repo_id (int / str) –

repo_id for that Omixatlas
source_folder_path (dict) –

source folder paths from data and metadata files are fetched.In this dictionary, there should be two keys called "data" and "metadata" with value consisting of folders where data and metadata files are stored respectively i.e. {"data":"", "metadata":""}
priority (str, default: 'low' ) –

Optional parameter(low/medium/high). Priority at which this data has to be ingested into the OmixAtlas. The default value is "low". Acceptable values are "medium" and "high".
file_mapping (dict, default: {} ) –

Defaults to empty dict. The dictionary should be in the format {"":""}. Full Dataset File Name should be provided in the key. Example entry -> {"GSE2067_GPL96.gct": "GSE2067_GPL96", "GSE2067_GPL97.gct": "GSE2067_GPL97"}
validation (bool, default: False ) –

Optional parameter(True/False) Users was to activate validation. By Default False. Means validation not active by default. Validation needs to be activated only when validated files are being ingested.

Raises:

paramError –

If Params are not passed in the desired format or value not valid.
RequestException –

If there is issue in data ingestion.

Returns:

DataFrame –

pd.DataFrame: DataFrame showing Upload Status of Files

move_data

move_data(source_repo_key, destination_repo_key, dataset_ids, priority='medium')

This function is used to move datasets from source atlas to destination atlas. This function should only be used when schema of source and destination atlas are compatible with each other. Else, the behaviour of data in destination atlas may not be the same or the ingestion may fail. Please contact polly@support.com if you get Access Denied error message.

Parameters:

source_repo_key (str / int) –

src repo key of the dataset ids. Only repo_id supported now,
destination_repo_key (str / int) –

destination repo key where the data needs to be transferred
dataset_ids (list) –

list of dataset ids to transfer
priority (str, default: 'medium' ) –

Optional parameter(low/medium/high). Priority of ingestion. Defaults to "medium".

Returns:

None ( str ) –

None

Examples

# Install polly python
pip install polly-python

# Import libraries
from polly.auth import Polly
from polly.omixatlas import OmixAtlas

# Create omixatlas object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
Polly.auth(AUTH_TOKEN)
omixatlas = OmixAtlas()

Add data to a new OmixAtlas

Addition of data into a newly created OA, also referred to as ingestion, can be done using polly py function add_datasets. The OmixAtlas to which the data is to be added should have the supported schema as per the metadata contained in the data.

Please see this FAQ to check if the metadata and the schema of the OA match. While adding a dataset both metdata file (json) and the data file (h5ad, gct, vcf) are required.

In order to use this function,

the metadata and data files should be present in separate folders and the path to these folders should be provided as a dictionary of keys metadata and data with values as the metadata folder path and data folder path respectively.
Each metdata file should have the corresponding data file and visa-versa. The metdata and data file of a dataset is expected to have the same name of the dataset_id. For example, GSE100009_GPL11154.json and GSE100009_GPL11154.gct are the metadata file and data file for the dataset id GSE100009_GPL11154 respectively.

Once the files are uploaded for ingestion, the ingestion progress and logs can be monitored and fetched on the ingestion monitoring dashboard.

data_source_folder_path = "data_ingestion_demo/data/"
metadata_source_folder_path = "data_ingestion_demo/metadata/"
source_data_folder = {"data": data_source_folder_path, "metadata": metadata_source_folder_path}
repo_id = repo_id
priority= "medium"
omixatlas.add_datasets(repo_id, source_data_folder, priority= "medium")

Creating Combined Metadata File and uploading it: 100%|██████████| 2/2 [00:00<00:00, 4443.12it/s] Uploading data files: 100%|██████████| 2/2 [00:00<00:00, 3.26files/s]

                    File Name        Message
0      combined_metadata.json  File Uploaded
1  GSE100009_GPL11154_raw.gct  File Uploaded
2  GSE100013_GPL16791_raw.gct  File Uploaded
Please wait for 30 seconds while your ingestion request is getting logged.

Your ingestion request is successfully logged. You can go to ingestion monitoring dashboard for tracking it's status.

Usage of priority and validation flags

priority: This is an optional parameter too. This states the priority at which this data has to be ingested into the OmixAtlas. The default value is "low". Acceptable values are "low", "medium" and "high".
validation option can be provided as True or False. By Default, validation in inactive. When validation needs to be activated, user should pass another argument validation = True while running the function.

Template for dataset level metadata

To check if the keys used in the dataset level metadata is compatible with the schema of the OmixAtlas. The dataset level metadata template can be fetched using the function as shown below. In this example, followed by fetching the template, we are loading the keys in dataset level metadata and checking if the keys are matching schema requirements.

#example: get the metdata template of the destination atlas. here, we are looking at repo_id "9"
data_metadata_template_geo = omixatlas.dataset_metadata_template("9")

# getting the dataset level metadata from the dataset metadata json that is to be ingested
import json
keys_in_json = set()
f = open('/import/data_ingestion_demo/metadata/GSE95448_GPL19057.json')
data = json.load(f)

for key in data:
    keys_in_json.add(key)

# comparing the keys in the destination atlas metadata vs the keys in the dataset metadata json that is to be ingested
intersect = keys_in_json.intersection(template_keys)
template_keys.difference(intersect)

Move data from source to destination

This function is used to move datasets from source atlas to destination atlas. This function should only be used when schema of source and destination atlas are compatible with each other. Else, the behaviour of data in destination atlas may not be the same or the ingestion may fail.

# example: moving 3 datasets from source "geo_transcriptomics_omixatlas" to destination "rankine_atlas"
omixatlas.move_data(source_repo_key = "geo_transcriptomics_omixatlas", destination_repo_key = "rankine_atlas", 
                    dataset_ids = ["GSE12332_GPL123", "GSE43234_GPL143", "GSE89768_GPL967"])

Update the data or metadata in omixatlas

The already ingested data in the OA can be updated by re-ingesting either the metadata file of a dataset or the data file of a dataset or both of a dataset based on what needs to be updated. The update progress can also be seen on the ingestion monitoring dashboard. However, if there are no change in the files, the process will not be initiated and not seen in the ingestion monitoring dashboard.

In order to use this fuction,

The metadata and data files should be present in separate folder and the path to these folders should be provided as a dictionary of keys "metadata" and "data" with values as the metadata folder path and data folder path respectively and which ever applicable.
In case the data or metadata being updated has not been ingested before, then appropriate warning would be provided and it would suggested to use the add_datasets fucntion to add the data first.

Ex 1: Update both data and metadata file

metadata_folder_path = "repoid_1654268055800_files_test/metadata_2/"
data_folder_path = "repoid_1654268055800_files_test/data_2/"
repo_id= "1654268055800"
source_folder_path = {"metadata":metadata_folder_path, "data": data_folder_path}

omixatlas.update_datasets(repo_id, source_folder_path, priority)

    /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
    repoid_1654268055800_files_test/data_2/:
    ACBC_MSKCC_2015_Copy_Number_AdCC10T.gct

    repoid_1654268055800_files_test/metadata_2/:
    ACBC_MSKCC_2015_Copy_Number_AdCC10T.json
    Processing Metadata files: 100%|██████████| 1/1 [00:00<00:00, 357.57it/s]
    Uploading data files: 100%|██████████| 1/1 [00:00<00:00,  6.36files/s]
    Please wait for 30 seconds while your ingestion request is getting logged.


    Your ingestion request is successfully logged. You can go to ingestion monitoring dashboard for tracking it's status.

                                      File Name        Message
    0                   combined_metadata.json  File Uploaded
    1  ACBC_MSKCC_2015_Copy_Number_AdCC10T.gct  File Uploaded

Ex 2: updating only data files

data_folder_path = "repoid_1654268055800_files_test/data"
repo_id= "1654268055800"
priority = "medium"
source_folder_path = {"data":data_folder_path}

omixatlas.update_datasets(repo_id, source_folder_path, priority)

    Please wait for 30 seconds while your ingestion request is getting logged.
    Your ingestion request is successfully logged. You can go to ingestion monitoring dashboard for tracking it's status.

                      File Name        Message
    0  CCLE_Mutation_C3A_LIVER.gct  File Uploaded

Delete data from OmixAtlas

Datasets ingested can be deleted from the omixatlas using the delete_datasets function. A list of dataset ids can be provided in order to delete multiple datasets. Status of delete operation can be seen on ingestion monitoring dashboard within 2 mins after function is run. Please note that the user would need to have relevant permissions/roles in order to delete datasets.

repo_id = "1643359804137"
dataset_ids = ["GSE100009_GPL11154", "GSE145009_GPL11124"]
omixatlas.delete_datasets(repo_id, dataset_ids)

      Dataset Id          Message
    0 GSE145009_GPL11124  Request Accepted. Dataset Will be deleted in t...

With polly-python >= v0.3.0, while using the delete_datasets function, in case the datasets have been ingested at multiple file paths in the OmixAtlas, the users need to pass the file path(s) from where the datasets should be deleted.

To fetch the list of file paths where a particular dataset has been ingested in the OmixAtlas, the get_all_file_paths function can be used.

repo_id = "1673847977346"
dataset_id = "GSE140509_GPL16791"
file_paths = omixatlas.get_all_file_paths(repo_id, dataset_id)

['transcriptomics_206/GSE140509_GPL16791.gct', 'transcriptomics_209/GSE140509_GPL16791.gct']

For deleting a file present from a specific folder path, use the dataset_file_path_dict parameter to pass a dictionary with key as the dataset_id and the value as the list of folder_paths from where the dataset_id files are to be deleted. Users should provide the full paths as shown in the example below.

Here the file GSE140509_GPL16791.gct from the folder path transcriptomics_209/ is being deleted for the dataset id GSE140509_GPL16791.

# dataset id present in multiple paths -> one of the paths passed for deletion
dataset_id = ["GSE140509_GPL16791"]
repo_id = "1673847977346"
dataset_file_path_dict = {"GSE140509_GPL16791":["transcriptomics_209/GSE140509_GPL16791.gct"]}
omixatlas.delete_datasets(repo_id, dataset_id,dataset_file_path_dict=dataset_file_path_dict)

         DatasetId      Message  \                                              
0  GSE140509_GPL16791   Request Accepted. 
                        Dataset Will be deleted in the 
                        next version of OmixAtlas  
  Folder Path                         
0  transcriptomics_209/GSE140509_GPL16791.gct

Data ingestion monitoring dashboard

The Data ingestion monitoring dashboard on the GUI, allows users to monitor the progress of the ingestion runs (add_datasets, update_datasets, delete_datasets). For each dataset undergoing ingestion (addition/update) or deletion, the logs are available here to be viewed and downloaded.

To know more about ingestion monitoring dashboard, please refer to this section

How and why to save data in workspace?

Workspaces allow to download and save data from the analysis or an omixatlas to be reused again instead of downloading to the local system. Workspaces can act as storage spaces with additional capability of sharing or collaborating with other users. The workspace id can be fetched from the url that comes on opening a workspace on the GUI and needs to be passed as an integer.

repo_id = "9"
dataset_id = "GSE107280_GPL11154"
workspace_id = 12345
workspace_path= "geo_GSE107280_GPL11154"
omixatlas.save_to_workspace(repo_id, dataset_id, workspace_id, workspace_path)

    INFO:root:Data Saved to workspace=8223
    {'data': {'type': 'workspace-jobs',
      'id': '9/12345',
      'attributes': {'destination-gct': '12345/geo_GSE107280_GPL11154/GSE107280_GPL11154_curated.gct',
       'destination-json': '12345/geo_GSE107280_GPL11154/GSE107280_GPL11154.json'}}}

Import a file from workspace to Polly Notebooks

The files present in the workspace can be viewed from notebook as well and also synced, so that the files that are present in the workspace are available for use in the current analysis notebook as well. Please note only those files that are present in the same workspace as the analysis/notebook can be synced.

# to list files in the folder "repoid_1654268055800_files_test/" in the current workspace
!polly files list --workspace-path "polly://repoid_1654268055800_files_test/" -y

# copy the files from the folder repoid_1654268055800_files_test/ in current workspace to the notebook under folder destination_repoid_1654268055800_files_test/"
!polly files sync -s "polly://repoid_1654268055800_files_test/" -d "destiantion_repoid_1654268055800_files_test/" -y