Skip to content

Working with Cohorts

The Cohort class contains functions which can be used to create cohorts, add or remove samples, merge metadata and data-matrix of samples/datasets in a cohort and edit or delete a cohort. Args: token (str): Authentication token from polly Usage: from polly.cohort import Cohort

cohort = Cohort(token)

add_to_cohort(repo_key, dataset_id=None, sample_id=None)

This function is used to add datasets or samples to a cohort.

Parameters:

Name Type Description Default
repo_key str

repo_key(repo_name OR repo_id) for the omixatlas where datasets or samples belong.

required
dataset_id list / str

dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)

None
sample_id list

list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.

None

Returns:

Type Description
None

A message will be displayed on the status of the operation.

Raises:
  InvalidParameterException: Empty or Invalid Parameters.
  InvalidCohortOperationException: This operation is not valid as no cohort has been instantiated.

create_cohort(local_path, cohort_name, description, repo_key=None, dataset_id=None, sample_id=None)

This function is used to create a cohort. After making Cohort Object you can create cohort.

Parameters:

Name Type Description Default
local_path str

local path to instantiate the cohort.

required
cohort_name str

identifier name for the cohort.

required
description str

description about the cohort.

required
repo_key str

repo_key(repo_name/repo_id) for the omixatlas from where datasets or samples is to be added.

None
dataset_id list / str

dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)

None
sample_id list

list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.

None

Returns:

Type Description
None

A message will be displayed on the status of the operation.

Raises:

Type Description
InvalidParameterException

Empty or Invalid Parameters

InvalidCohortNameException

The cohort_name does not represent a valid cohort name.

InvalidPathException

Provided path does not represent a file or a directory.

create_merged_gct(file_path, file_name='')

This function is used to merge all the gct files in a cohort into a single gct file.

Parameters:

Name Type Description Default
file_path str

the system path where the gct file is to be written.

required
file_name str

Identifier for the merged file name, cohort name will be used by default.

''

delete_cohort()

This function is used to delete a cohort. Returns: A confirmation message on deletion of cohort

edit_cohort(new_cohort_name=None, new_description=None)

This function is used to edit the cohort level metadata such as cohort name and description. Atleast one of the argument should be present. Args: new_cohort_name (str): new identifier name for the cohort. new_description (str): new description about the cohort.

Returns:
        A confirmation message on updation of cohort.

Raises:
        InvalidCohortOperationException: This operation is not valid as no cohort has been instantiated.
        CohortEditException: No parameter specified for editing in cohort

is_valid()

This function is used to check the validity of a cohort.

Returns:

Type Description
bool

A boolean result based on the validity of the cohort.

Raises:

Type Description
InvalidPathException

Cohort path does not represent a file or a directory.

InvalidCohortOperationException

This operation is not valid as no cohort has been instantiated.

load_cohort(local_path)

Function to load an existing cohort into an object. Once loaded, the functions described in the documentation can be used for the object where the cohort is loaded.

Parameters:

Name Type Description Default
local_path str

local path of the cohort.

required

Returns:

Type Description

A confirmation message on instantiation of the cohort.

Raises: InvalidPathException: This path does not represent a file or a directory. InvalidCohortPathException: This path does not represent a Cohort.

merge_data(data_level)

Function to merge metadata (dataset,sample and feature level) or data-matrix of all the samples/datasets in the cohort.

Parameters:

Name Type Description Default
data_level str

identifier to specify the data to be merged - "dataset", "sample", "feature" or "data_matrix"

required

Returns:

Type Description

A pandas dataframe containing the merged data which is ready for analysis

remove_from_cohort(dataset_id=None, sample_id=[])

This function is used for removing datasets or samples from a cohort.

Parameters:

Name Type Description Default
dataset_id list / str

dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)

None
sample_id list

list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.

[]

Returns:

Type Description
None

A message will be displayed on the status of the operation.

Raises:

Type Description
InvalidParameterException

Empty or Invalid Parameters

InvalidCohortOperationException

This operation is not valid as no cohort has been instantiated.

summarize_cohort()

Function to return cohort level metadata and dataframe with datasets or samples added in the cohort.

Returns:

Type Description

A tuple with the first value as cohort metadata information

(name, description and number of dataset(s) or sample(s) in the cohort) and the second value

as dataframe containing the source, dataset_id/sample_id and data type available in the cohort.

Raises:
  InvalidCohortOperationException: This operation is not valid as no cohort has been instantiated.

Examples

In TCGA

query = <someSQLquery>
results=omixatlas.query_metadata(query)
Query execution succeeded (time taken: 2.13 seconds, data scanned: 0.244 MB)
Fetched 123 rows
dataset_ids = results['dataset_id'].tolist()
cohort1.create_cohort("/import","tcga_data","Proteomics datasets","tcga", dataset_ids)
INFO:root:Cohort Created !


Initializing process...


Verifying Data: 100%|██████████| 123/123 [00:11<00:00, 10.71it/s]
Adding data to cohort: 100%|██████████| 123/123 [00:14<00:00,  8.72it/s]
Adding metadata to cohort: 100%|██████████| 123/123 [00:11<00:00, 10.25it/s]
INFO:root:'123' dataset/s added to Cohort!
dataset_metadata = cohort1.merge_data("dataset")
display(dataset_metadata.head())
All_Metadata_col = cohort1.merge_data("sample")
print("\nColumns/Datasets information")
display(All_Metadata_col.head())
df_real = cohort1.merge_data("data_matrix")
print("\nData Matrix")
display(df_real.head())

In GEO

query = <someSQLquery>
results = omixatlas.query_metadata(query)
dataset_ids = results['dataset_id'].tolist()
cohort1.create_cohort("/import","geo_data","Transcriptomics datasets","geo", dataset_ids[0])

for i in dataset_ids[1:]:
    cohort1.add_to_cohort("geo", i)
INFO:root:Cohort Created !


Initializing process...
Adding data to cohort...
Adding metadata to cohort...


INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE120746_GPL18573.gct
INFO:root:'18' sample/s added to Cohort!


Initializing process...
Adding data to cohort...
Adding metadata to cohort...


INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE62642_GPL16791.gct
INFO:root:'14' sample/s added to Cohort!


Initializing process...
Adding data to cohort...
Adding metadata to cohort...


INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE68719_GPL11154.gct
INFO:root:'73' sample/s added to Cohort!
dataset_metadata = cohort1.merge_data("dataset")
display(dataset_metadata.head())
All_Metadata_col = cohort1.merge_data("sample")
print("\nColumns/Datasets information")
display(All_Metadata_col.head())
df_real = cohort1.merge_data("data_matrix")
print("\nData Matrix")
display(df_real.head())

Tutorial Notebooks

  1. Creating Multiple Cohorts in TCGA

  2. Proteomics Data Analysis in TCGA using Cohorts