Skip to content

Data Curation

The Curation class contains wrapper functions around the models used for semantic annotations of string/text. Args: token (str): token copy from polly.


from polly.curation import Curation

curationObj = Curation(token)


Tag a given piece of text. A "tag" is just an ontology term. Annotates with Polly supported ontologies. This function calls recognise_entity followed by normalize. Given a text, users can identify and tag entities in a text. Each entity/tag recognised in the text contains the name(word in the text identified), entity_type and the ontology_id.


Name Type Description Default
text str

Input text


Returns: set of unique tags

assign_clinical_labels(repo_name, dataset_ids, sample_ids=None)

Returns a list of clinical or non clinical labels for the given datasets or samples.


Name Type Description Default
repo_name str

name of the repository for fetching datasets.

dataset_ids List[str]

dataset ids to be used for inference


Other Parameters:

Name Type Description
sample_ids List[str]

Optional Parameter. Sample ids if that is needed.


Type Description

API response exception


Invalid parameters


Type Description

dataframe which is a list of clinical tags for given ids

assign_control_pert_labels(sample_metadata, columns_to_exclude=None)

Returns the sample metadata dataframe with 2 additional columns. is_control - whether the sample is a control sample control_prob - the probability that the sample is control Args: sample_metadata (DataFrame): Metadata table columns_to_exclude (Set[str]): Any columns which don't play any role in determining the label, e.g. sample id Returns: DataFrame : Input data frame with 2 additional columns Raises: requestException : Invalid Request


To run abbreviation detection separately. Internally calls a normaliser.


Name Type Description Default
text str

The string to detect abbreviations in.


Returns: Dictionary with abbreviation as key and full form as value


Type Description

Invalid Request

recognise_entity(text, threshold=None, normalize_output=False)

Run an NER model on the given text. The returned value is a list of entities along with span info. Users can simply recognise entities in a given text without any ontology standardisation (unlike the annotate_with_ontology function which normalises as well).


Name Type Description Default
text str

input text

threshold float

Optional Parameter. All entities with a score < threshold are filtered out from the output. Its best not to specify a threshold and just use the default value instead. normalize_output (bool): whether to normalize the keywords



Name Type Description
entities List[dict]

List of spans containing the keyword, start/end index of the keyword and the entity type

Raises: requestException : Invalid Request

standardise_entity(mention, entity_type, context=None, threshold=None) cached

Map a given mention (keyword) to an ontology term. Given a text and entity type, users can get the Polly compatible ontology for the text such as the MESH ontology.


Name Type Description Default
mention str

mention of an entity e.g. "Cadiac arrythmia"

entity_type str

Should be one of ['disease', 'drug', 'tissue', 'cell_type', 'cell_line', 'species', 'gene']

context str

The text where the mention occurs. This is used to resolve abbreviations.


(float, optional) = Optional Parameter. All entities with a score < threshold are filtered out from the output. Its best not to specify a threshold and just use the default value instead.



Name Type Description
dict dict

Dictionary containing keys and values of the entity type,


ontology (such as NCBI, MeSH), ontology ID (such as the MeSH ID), the score (confidence score), and synonyms if any

Raises: requestException : Invalid Request


# Install polly python
!sudo pip3 install polly-python --quiet
# Import libraries
from polly.auth import Polly     
from polly.curation  import Curation
import os
import pandas as pd
from json import dumps
import ipywidgets as widgets
# Create curation object and authenticate
curate = Curation(AUTH_TOKEN)  


# Basic example
{'ontology': 'NCBI',
 'ontology_id': 'txid10090',
 'name': 'Mus musculus',
 'entity_type': 'species',
 'score': None,
 'synonym': None}
# Without 'context'
curate.standardise_entity("AD", "disease")
{'ontology': 'MESH',
 'ontology_id': 'C564330',
 'name': 'Alzheimer Disease, Familial, 3, with Spastic Paraparesis and Apraxia',
 'entity_type': 'disease',
 'score': 202.1661376953125,
 'synonym': 'ad'}
# With context, returns the desired keyword in case of abbreviation
curate.standardise_entity("AD", "disease", 
                context="Patients with atopic dermatitis (AD) where given drug A whereas non AD patients were given drug B")
{'ontology': 'MESH',
 'ontology_id': 'D003876',
 'name': 'Dermatitis, Atopic',
 'entity_type': 'disease',
 'score': 196.61105346679688,
 'synonym': 'atopic dermatitis'}
# Usage of non-matching 'entity_type' returns none values
{'ontology': 'CUI-less',
 'ontology_id': None,
 'name': None,
 'entity_type': 'disease',
 'score': None,
 'synonym': None}


# Basic example with two entities
curate.recognise_entity("Gene expression profiling on mice lungs and reveals ACE2 upregulation")
[{'keyword': 'lungs',
  'entity_type': 'tissue',
  'span_begin': 34,
  'span_end': 39,
  'score': 0.9985597729682922},
 {'keyword': 'ACE2',
  'entity_type': 'gene',
  'span_begin': 52,
  'span_end': 55,
  'score': 0.9900580048561096},
 {'keyword': 'mice',
  'entity_type': 'species',
  'span_begin': 29,
  'span_end': 32,
  'score': 0.989605188369751}]
# No entity in the text
curate.recognise_entity("Significant upregulation was found in 100 samples")


# Basic example
curate.annotate_with_ontology("Mouse model shows presence of Adeno carcinoma")
[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species'),
 Tag(name='Adenocarcinoma', ontology_id='MESH:D000230', entity_type='disease')]
# Spelling errors
curate.annotate_with_ontology("Mouse model shows presence of Adino carcinoma")
[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species')]


# Full form is not mentioned on the text
curate.find_abbreviations("Patient is diagnosed with T1D")
# '-' on the text is not understood
curate.find_abbreviations("Patient is diagnosed with T1D- Type 1 Diabetes")
# Abbreviation is recognized
curate.find_abbreviations("Patient is diagnosed with T1D (Type 1 Diabetes)")
{'T1D': 'Type 1 Diabetes'}
# Abbreviation does not match the full text
curate.find_abbreviations("Patient is diagnosed with T2D (Type 1 Diabetes)")


sample_metadata = pd.DataFrame({"sample_id": [1, 2, 3, 4], "disease": ["control1", "ctrl2", "healthy", "HCC"],})
sample_id disease
0 1 control1
1 2 ctrl2
2 3 healthy
3 4 HCC
curate.assign_control_pert_labels(sample_metadata, columns_to_exclude=["sample_id"])
sample_id disease is_control control_prob
0 1 control1 True 1.00
1 2 ctrl2 True 1.00
2 3 healthy True 0.96
3 4 HCC False 0.08

Tutorial Notebooks

  1. Basic Usage Examples

  2. Custom Curation with GEO Datasets from Polly