Skip to content

Data Curation

Curation

The Curation class contains wrapper functions around the models used for semantic annotations of string/text.

Curation functions are able to recognise different entities given a text, normalise them based on certain nomenclature such as Polly compatible ontologies. Entities that are supported are: "disease", "drug", "species", "tissue", "cell_type", "cell_line", "gene".

Parameters:

Name Type Description Default
token str

token copy from polly.

required
Usage

from polly.curation import Curation

curationObj = Curation(token)

annotate_with_ontology(text)

Tag a given piece of text. A "tag" is just an ontology term. Annotates with Polly supported ontologies.

This function calls recognise_entity followed by normalize. Given a text, users can identify and tag entities in a text. Each entity/tag recognised in the text contains the name( word in the text identified), entity_type and the ontology_id.

Parameters:

Name Type Description Default
text str

Input text

required

Returns:

Type Description
List[Tuples]

set of unique tags

assign_control_pert_labels(sample_metadata, columns_to_exclude=None)

Returns the sample metadata dataframe with 2 additional columns. is_control - whether the sample is a control sample control_prob - the probability that the sample is control

Parameters:

Name Type Description Default
sample_metadata pandas.DataFrame

Metadata table

required
columns_to_exclude Set[str]

Any columns which don't play any role in determining the label, e.g. any arbitrary sample identifier

None

Returns:

Type Description
pandas.DataFrame

DataFrame with input data frame with 2 additional columns

Raises:

Type Description
requestException

Invalid Request

find_abbreviations(text)

To run abbreviation detection separately. Internally calls a normaliser.

Parameters:

Name Type Description Default
text str

The string to detect abbreviations in

required

Returns:

Type Description
Dict[str, str]

Dictionary with abbreviation as key and full form as value

Raises:

Type Description
requestException

Invalid Request

recognise_entity(text, threshold=None, normalize_output=False)

Run an NER model on the given text. The returned value is a list of entities along with span info.

Users can simply recognise entities in a given text without any ontology standardisation (unlike the annotate_with_ontology function which normalises as well).

Parameters:

Name Type Description Default
text str

input text

required
normalize_output bool

whether to normalize the keywords

False

Returns:

Name Type Description
entities List[dict]

returns a list of spans containing the keyword, start and end index of the keyword and the entity type

Raises:

Type Description
requestException

Invalid Request

standardise_entity(mention, entity_type, context=None, threshold=None)

Map a given mention (keyword) to an ontology term.

Given a text and the type of entity it is, users can get the Polly compatible ontology for the text such as the MESH ontology.

Parameters:

Name Type Description Default
mention str

mention of an entity e.g. "Cadiac arrythmia"

required
entity_type str

Should be one of

required
context str

The text where the mention occurs.

None

Returns:

Name Type Description
dict dict

Dictionary containing keys and values of the entity type, ontology (such as NCBI, MeSH), ontology ID (such as the MeSH ID), the score (confidence score), and synonyms if any

Raises:

Type Description
requestException

Invalid Request

Examples

# Install polly python
!sudo pip3 install polly-python --quiet
# Import libraries
from polly.auth import Polly     
from polly.curation  import Curation
import os
import pandas as pd
from json import dumps
import ipywidgets as widgets
# Create curation object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
curate = Curation(AUTH_TOKEN)  

standardize_entity()

# Basic example
curate.standardise_entity("Mouse","species")
{'ontology': 'NCBI',
 'ontology_id': 'txid10090',
 'name': 'Mus musculus',
 'entity_type': 'species',
 'score': None,
 'synonym': None}
# Without 'context'
curate.standardise_entity("AD", "disease")
{'ontology': 'MESH',
 'ontology_id': 'C564330',
 'name': 'Alzheimer Disease, Familial, 3, with Spastic Paraparesis and Apraxia',
 'entity_type': 'disease',
 'score': 202.1661376953125,
 'synonym': 'ad'}
# With context, returns the desired keyword in case of abbreviation
curate.standardise_entity("AD", "disease", 
                context="Patients with atopic dermatitis (AD) where given drug A whereas non AD patients were given drug B")
{'ontology': 'MESH',
 'ontology_id': 'D003876',
 'name': 'Dermatitis, Atopic',
 'entity_type': 'disease',
 'score': 196.61105346679688,
 'synonym': 'atopic dermatitis'}
# Usage of non-matching 'entity_type' returns none values
curate.standardise_entity("Mouse","disease")
{'ontology': 'CUI-less',
 'ontology_id': None,
 'name': None,
 'entity_type': 'disease',
 'score': None,
 'synonym': None}
# Usage of non-supported 'entity_type' returns error -> Here, it is supposed to be "species" and not "specie"
curate.standardise_entity("Mouse","specie")
-----------------------------------------------------------------------

RequestException                          Traceback (most recent call last)

Input In [8], in <cell line: 2>()
      1 # Usage of non-supported 'entity_type' returns error -> Here, it is supposed to be "species" and not "specie"
----> 2 curate.standardise_entity("Mouse","specie")


File /usr/local/lib/python3.10/site-packages/polly/curation.py:145, in Curation.standardise_entity(self, mention, entity_type, context, threshold)
    143 if output.get("errors", []):
    144     title, detail = self._handle_errors(output)
--> 145     raise RequestException(title, detail)
    147 if "term" not in output:
    148     return {
    149         "ontology": "CUI-less",
    150         "ontology_id": None,
    151         "name": None,
    152         "entity_type": entity_type,
    153     }


RequestException: ('Invalid Payload', {'detail': [{'loc': ['body', 'mention', 'entity_type'], 'msg': "value is not a valid enumeration member; permitted: 'disease', 'drug', 'drug_chebi', 'species', 'tissue', 'cell_type', 'cell_line', 'gene', 'metabolite'", 'type': 'type_error.enum', 'ctx': {'enum_values': ['disease', 'drug', 'drug_chebi', 'species', 'tissue', 'cell_type', 'cell_line', 'gene', 'metabolite']}}]})

recognise_entity()

# Basic example with two entities
curate.recognise_entity("Gene expression profiling on mice lungs and reveals ACE2 upregulation")
[{'keyword': 'lungs',
  'entity_type': 'tissue',
  'span_begin': 34,
  'span_end': 39,
  'score': 0.9985597729682922},
 {'keyword': 'ACE2',
  'entity_type': 'gene',
  'span_begin': 52,
  'span_end': 55,
  'score': 0.9900580048561096},
 {'keyword': 'mice',
  'entity_type': 'species',
  'span_begin': 29,
  'span_end': 32,
  'score': 0.989605188369751}]
# Multiple entities of the same type
curate.recognise_entity("Batch effects were observed between ductal carcinoma and lobular carcinoma")
[{'keyword': 'ductal carcinoma',
  'entity_type': 'disease',
  'span_begin': 36,
  'span_end': 51,
  'score': 0.9999971389770508},
 {'keyword': 'lobular carcinoma',
  'entity_type': 'disease',
  'span_begin': 57,
  'span_end': 73,
  'score': 0.9999983906745911}]
# Repeating entities
curate.recognise_entity("The study showed ACE2 upregulation and ACE2 downregulation")
[{'keyword': 'ACE2',
  'entity_type': 'gene',
  'span_begin': 17,
  'span_end': 20,
  'score': 0.9962862730026245},
 {'keyword': 'ACE2',
  'entity_type': 'gene',
  'span_begin': 39,
  'span_end': 42,
  'score': 0.990687906742096}]
# No entity in the text
curate.recognise_entity("Significant upregulation was found in 100 samples")
[]

annotate_with_ontology()

# Basic example
curate.annotate_with_ontology("Mouse model shows presence of Adeno carcinoma")
[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species'),
 Tag(name='Adenocarcinoma', ontology_id='MESH:D000230', entity_type='disease')]
# Spelling errors
curate.annotate_with_ontology("Mouse model shows presence of Adino carcinoma")
[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species')]
# incorrect input format -> here, list instead of string
curate.annotate_with_ontology(["Mouse model shows presence", "adeno carcinoma"])
---------------------------------------------------------------------------

RequestException                          Traceback (most recent call last)

Input In [23], in <cell line: 2>()
      1 # incorrect input format -> here, list instead of string
----> 2 curate.annotate_with_ontology(["Mouse model shows presence", "adeno carcinoma"])


File /usr/local/lib/python3.10/site-packages/polly/curation.py:227, in Curation.annotate_with_ontology(self, text)
    209 def annotate_with_ontology(
    210     self,
    211     text: str,
    212 ) -> List[Tag]:
    214     """
    215     Tag a given piece of text. A "tag" is just an ontology term.
    216     Annotates with Polly supported ontologies.
   (...)
    224         tags (set of tuples): set of unique tags
    225     """
--> 227     entities = self.recognise_entity(text, normalize_output=True)
    228     res = {
    229         self.Tag(
    230             e.get("name", []), e.get("ontology_id", []), e.get("entity_type", [])
   (...)
    233         if e.get("name")
    234     }
    235     return list(res)


File /usr/local/lib/python3.10/site-packages/polly/curation.py:184, in Curation.recognise_entity(self, text, threshold, normalize_output)
    182 if "errors" in response:
    183     title, detail = self._handle_errors(response)
--> 184     raise RequestException(title, detail)
    185 try:
    186     entities = response.get("entities", [])


RequestException: ('Invalid Payload', {'detail': [{'loc': ['body', 'text'], 'msg': 'str type expected', 'type': 'type_error.str'}]})

find_abbreviations()

# Full form is not mentioned on the text
curate.find_abbreviations("Patient is diagnosed with T1D")
{}
# '-' on the text is not understood
curate.find_abbreviations("Patient is diagnosed with T1D- Type 1 Diabetes")
{}
# Abbreviation is recognized
curate.find_abbreviations("Patient is diagnosed with T1D (Type 1 Diabetes)")
{'T1D': 'Type 1 Diabetes'}
# Abbreviation does not match the full text
curate.find_abbreviations("Patient is diagnosed with T2D (Type 1 Diabetes)")
{}

assign_control_pert_labels()

sample_metadata = pd.DataFrame({"sample_id": [1, 2, 3, 4], "disease": ["control1", "ctrl2", "healthy", "HCC"],})
sample_metadata
sample_id disease
0 1 control1
1 2 ctrl2
2 3 healthy
3 4 HCC
curate.assign_control_pert_labels(sample_metadata, columns_to_exclude=["sample_id"])
sample_id disease is_control control_prob
0 1 control1 True 1.00
1 2 ctrl2 True 1.00
2 3 healthy True 0.96
3 4 HCC False 0.08

Tutorial Notebooks

  1. Basic Usage Examples

  2. Custom Curation with GEO Datasets from Polly