Skip to content

Working with Cohorts

The Cohort class contains functions which can be used to create cohorts, add or remove samples, merge metadata and data-matrix of samples/datasets in a cohort and edit or delete a cohort.

Parameters:

  • token (str, default: None ) –

    Authentication token from polly

Usage

from polly.cohort import Cohort

cohort = Cohort(token)

add_to_cohort

add_to_cohort(repo_key, dataset_id=None, sample_id=None)

This function is used to add datasets or samples to a cohort.

Parameters:

  • repo_key (str) –

    repo_key(repo_name OR repo_id) for the omixatlas where datasets or samples belong.

  • dataset_id (list / str, default: None ) –

    dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)

  • sample_id (list, default: None ) –

    list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.

Returns:

  • None

    A message will be displayed on the status of the operation.

Raises:

  • InvalidParameterException

    Empty or Invalid Parameters.

  • InvalidCohortOperationException

    This operation is not valid as no cohort has been instantiated.

create_cohort

create_cohort(local_path, cohort_name, description, repo_key=None, dataset_id=None, sample_id=None)

This function is used to create a cohort. After making Cohort Object you can create cohort.

Parameters:

  • local_path (str) –

    local path to instantiate the cohort.

  • cohort_name (str) –

    identifier name for the cohort.

  • description (str) –

    description about the cohort.

  • repo_key (str, default: None ) –

    repo_key(repo_name/repo_id) for the omixatlas from where datasets or samples is to be added.

  • dataset_id (list / str, default: None ) –

    dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)

  • sample_id (list, default: None ) –

    list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.

Returns:

  • None

    A message will be displayed on the status of the operation.

Raises:

  • InvalidParameterException

    Empty or Invalid Parameters

  • InvalidCohortNameException

    The cohort_name does not represent a valid cohort name.

  • InvalidPathException

    Provided path does not represent a file or a directory.

create_merged_gct

create_merged_gct(file_path, file_name='')

This function is used to merge all the gct files in a cohort into a single gct file.

Parameters:

  • file_path (str) –

    the system path where the gct file is to be written.

  • file_name (str, default: '' ) –

    Identifier for the merged file name, cohort name will be used by default.

delete_cohort

delete_cohort()

This function is used to delete a cohort.

Returns:

  • None

    A confirmation message on deletion of cohort

edit_cohort

edit_cohort(new_cohort_name=None, new_description=None)

This function is used to edit the cohort level metadata such as cohort name and description. Atleast one of the argument should be present. Args: new_cohort_name (str): new identifier name for the cohort. new_description (str): new description about the cohort.

Returns:

  • message

    A confirmation message on updation of cohort.

Raises:

  • InvalidCohortOperationException

    This operation is not valid as no cohort has been instantiated.

  • CohortEditException

    No parameter specified for editing in cohort

is_valid

is_valid()

This function is used to check the validity of a cohort.

Returns:

  • bool

    A boolean result based on the validity of the cohort.

Raises:

  • InvalidPathException

    Cohort path does not represent a file or a directory.

  • InvalidCohortOperationException

    This operation is not valid as no cohort has been instantiated.

load_cohort

load_cohort(local_path)

Function to load an existing cohort into an object. Once loaded, the functions described in the documentation can be used for the object where the cohort is loaded.

Parameters:

  • local_path (str) –

    local path of the cohort.

Returns:

  • None

    A confirmation message on instantiation of the cohort.

Raises:

  • InvalidPathException

    This path does not represent a file or a directory.

  • InvalidCohortPathException

    This path does not represent a Cohort.

merge_data

merge_data(data_level)

Function to merge metadata (dataset,sample and feature level) or data-matrix of all the samples/datasets in the cohort.

Parameters:

  • data_level (str) –

    identifier to specify the data to be merged - "dataset", "sample", "feature" or "data_matrix"

Returns:

  • Dataframe

    A pandas dataframe containing the merged data which is ready for analysis

remove_from_cohort

remove_from_cohort(dataset_id=None, sample_id=[])

This function is used for removing datasets or samples from a cohort.

Parameters:

  • dataset_id (list / str, default: None ) –

    dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)

  • sample_id (list, default: [] ) –

    list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.

Returns:

  • None

    A message will be displayed on the status of the operation.

Raises:

  • InvalidParameterException

    Empty or Invalid Parameters

  • InvalidCohortOperationException

    This operation is not valid as no cohort has been instantiated.

summarize_cohort

summarize_cohort()

Function to return cohort level metadata and dataframe with datasets or samples added in the cohort.

Returns:

  • Tuple

    A tuple with the first value as cohort metadata information (name, description and number of dataset(s) or sample(s) in the cohort) and the second value as dataframe containing the source, dataset_id/sample_id and data type available in the cohort.

Raises:

  • InvalidCohortOperationException

    This operation is not valid as no cohort has been instantiated.

Examples

In TCGA

query = <someSQLquery>
results=omixatlas.query_metadata(query)
Query execution succeeded (time taken: 2.13 seconds, data scanned: 0.244 MB)
Fetched 123 rows
dataset_ids = results['dataset_id'].tolist()
cohort1.create_cohort("/import","tcga_data","Proteomics datasets","tcga", dataset_ids)
INFO:root:Cohort Created !


Initializing process...


Verifying Data: 100%|██████████| 123/123 [00:11<00:00, 10.71it/s]
Adding data to cohort: 100%|██████████| 123/123 [00:14<00:00,  8.72it/s]
Adding metadata to cohort: 100%|██████████| 123/123 [00:11<00:00, 10.25it/s]
INFO:root:'123' dataset/s added to Cohort!
dataset_metadata = cohort1.merge_data("dataset")
display(dataset_metadata.head())
All_Metadata_col = cohort1.merge_data("sample")
print("\nColumns/Datasets information")
display(All_Metadata_col.head())
df_real = cohort1.merge_data("data_matrix")
print("\nData Matrix")
display(df_real.head())

In GEO

query = <someSQLquery>
results = omixatlas.query_metadata(query)
dataset_ids = results['dataset_id'].tolist()
cohort1.create_cohort("/import","geo_data","Transcriptomics datasets","geo", dataset_ids[0])

for i in dataset_ids[1:]:
    cohort1.add_to_cohort("geo", i)
INFO:root:Cohort Created !


Initializing process...
Adding data to cohort...
Adding metadata to cohort...


INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE120746_GPL18573.gct
INFO:root:'18' sample/s added to Cohort!


Initializing process...
Adding data to cohort...
Adding metadata to cohort...


INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE62642_GPL16791.gct
INFO:root:'14' sample/s added to Cohort!


Initializing process...
Adding data to cohort...
Adding metadata to cohort...


INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE68719_GPL11154.gct
INFO:root:'73' sample/s added to Cohort!
dataset_metadata = cohort1.merge_data("dataset")
display(dataset_metadata.head())
All_Metadata_col = cohort1.merge_data("sample")
print("\nColumns/Datasets information")
display(All_Metadata_col.head())
df_real = cohort1.merge_data("data_matrix")
print("\nData Matrix")
display(df_real.head())

Tutorial Notebooks

  1. Creating Multiple Cohorts in TCGA

  2. Proteomics Data Analysis in TCGA using Cohorts