Working with Cohorts
The Cohort class contains functions which can be used to create cohorts, add or remove samples, merge metadata and data-matrix of samples/datasets in a cohort and edit or delete a cohort.
Parameters:
-
token
(str
, default:None
) –Authentication token from polly
Usage
from polly.cohort import Cohort
cohort = Cohort(token)
add_to_cohort
This function is used to add datasets or samples to a cohort.
Parameters:
-
repo_key
(str
) –repo_key(repo_name OR repo_id) for the omixatlas where datasets or samples belong.
-
dataset_id
(list / str
, default:None
) –dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)
-
sample_id
(list
, default:None
) –list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.
Returns:
-
None
–A message will be displayed on the status of the operation.
Raises:
-
InvalidParameterException
–Empty or Invalid Parameters.
-
InvalidCohortOperationException
–This operation is not valid as no cohort has been instantiated.
create_cohort
This function is used to create a cohort. After making Cohort Object you can create cohort.
Parameters:
-
local_path
(str
) –local path to instantiate the cohort.
-
cohort_name
(str
) –identifier name for the cohort.
-
description
(str
) –description about the cohort.
-
repo_key
(str
, default:None
) –repo_key(repo_name/repo_id) for the omixatlas from where datasets or samples is to be added.
-
dataset_id
(list / str
, default:None
) –dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)
-
sample_id
(list
, default:None
) –list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.
Returns:
-
None
–A message will be displayed on the status of the operation.
Raises:
-
InvalidParameterException
–Empty or Invalid Parameters
-
InvalidCohortNameException
–The cohort_name does not represent a valid cohort name.
-
InvalidPathException
–Provided path does not represent a file or a directory.
create_merged_gct
This function is used to merge all the gct files in a cohort into a single gct file.
Parameters:
-
file_path
(str
) –the system path where the gct file is to be written.
-
file_name
(str
, default:''
) –Identifier for the merged file name, cohort name will be used by default.
delete_cohort
This function is used to delete a cohort.
Returns:
-
None
–A confirmation message on deletion of cohort
edit_cohort
This function is used to edit the cohort level metadata such as cohort name and description. Atleast one of the argument should be present. Args: new_cohort_name (str): new identifier name for the cohort. new_description (str): new description about the cohort.
Returns:
-
message
–A confirmation message on updation of cohort.
Raises:
-
InvalidCohortOperationException
–This operation is not valid as no cohort has been instantiated.
-
CohortEditException
–No parameter specified for editing in cohort
is_valid
This function is used to check the validity of a cohort.
Returns:
-
bool
–A boolean result based on the validity of the cohort.
Raises:
-
InvalidPathException
–Cohort path does not represent a file or a directory.
-
InvalidCohortOperationException
–This operation is not valid as no cohort has been instantiated.
load_cohort
Function to load an existing cohort into an object. Once loaded, the functions described in the documentation can be used for the object where the cohort is loaded.
Parameters:
-
local_path
(str
) –local path of the cohort.
Returns:
-
None
–A confirmation message on instantiation of the cohort.
Raises:
-
InvalidPathException
–This path does not represent a file or a directory.
-
InvalidCohortPathException
–This path does not represent a Cohort.
merge_data
Function to merge metadata (dataset,sample and feature level) or data-matrix of all the samples/datasets in the cohort.
Parameters:
-
data_level
(str
) –identifier to specify the data to be merged - "dataset", "sample", "feature" or "data_matrix"
Returns:
-
Dataframe
–A pandas dataframe containing the merged data which is ready for analysis
remove_from_cohort
This function is used for removing datasets or samples from a cohort.
Parameters:
-
dataset_id
(list / str
, default:None
) –dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)
-
sample_id
(list
, default:[]
) –list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.
Returns:
-
None
–A message will be displayed on the status of the operation.
Raises:
-
InvalidParameterException
–Empty or Invalid Parameters
-
InvalidCohortOperationException
–This operation is not valid as no cohort has been instantiated.
summarize_cohort
Function to return cohort level metadata and dataframe with datasets or samples added in the cohort.
Returns:
-
Tuple
–A tuple with the first value as cohort metadata information (name, description and number of dataset(s) or sample(s) in the cohort) and the second value as dataframe containing the source, dataset_id/sample_id and data type available in the cohort.
Raises:
-
InvalidCohortOperationException
–This operation is not valid as no cohort has been instantiated.
Examples
In TCGA
Query execution succeeded (time taken: 2.13 seconds, data scanned: 0.244 MB)
Fetched 123 rows
dataset_ids = results['dataset_id'].tolist()
cohort1.create_cohort("/import","tcga_data","Proteomics datasets","tcga", dataset_ids)
INFO:root:Cohort Created !
Initializing process...
Verifying Data: 100%|██████████| 123/123 [00:11<00:00, 10.71it/s]
Adding data to cohort: 100%|██████████| 123/123 [00:14<00:00, 8.72it/s]
Adding metadata to cohort: 100%|██████████| 123/123 [00:11<00:00, 10.25it/s]
INFO:root:'123' dataset/s added to Cohort!
All_Metadata_col = cohort1.merge_data("sample")
print("\nColumns/Datasets information")
display(All_Metadata_col.head())
In GEO
query = <someSQLquery>
results = omixatlas.query_metadata(query)
dataset_ids = results['dataset_id'].tolist()
cohort1.create_cohort("/import","geo_data","Transcriptomics datasets","geo", dataset_ids[0])
for i in dataset_ids[1:]:
cohort1.add_to_cohort("geo", i)
INFO:root:Cohort Created !
Initializing process...
Adding data to cohort...
Adding metadata to cohort...
INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE120746_GPL18573.gct
INFO:root:'18' sample/s added to Cohort!
Initializing process...
Adding data to cohort...
Adding metadata to cohort...
INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE62642_GPL16791.gct
INFO:root:'14' sample/s added to Cohort!
Initializing process...
Adding data to cohort...
Adding metadata to cohort...
INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE68719_GPL11154.gct
INFO:root:'73' sample/s added to Cohort!
All_Metadata_col = cohort1.merge_data("sample")
print("\nColumns/Datasets information")
display(All_Metadata_col.head())