Data Curation

`Curation`

The Curation class contains wrapper functions around the models used for semantic annotations of string/text.

Curation functions are able to recognise different entities given a text, normalise them based on certain nomenclature such as Polly compatible ontologies. Entities that are supported are: "disease", "drug", "species", "tissue", "cell_type", "cell_line", "gene".

Parameters:

Name	Type	Description	Default
`token`	`str`	token copy from polly.	required

Usage

from polly.curation import Curation

curationObj = Curation(token)

`annotate_with_ontology(text)`

Tag a given piece of text. A "tag" is just an ontology term. Annotates with Polly supported ontologies.

This function calls recognise_entity followed by normalize. Given a text, users can identify and tag entities in a text. Each entity/tag recognised in the text contains the name( word in the text identified), entity_type and the ontology_id.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text	required

Returns:

Type	Description
`List[Tuples]`	set of unique tags

`assign_control_pert_labels(sample_metadata, columns_to_exclude=None)`

Returns the sample metadata dataframe with 2 additional columns. is_control - whether the sample is a control sample control_prob - the probability that the sample is control

Parameters:

Name	Type	Description	Default
`sample_metadata`	`pandas.DataFrame`	Metadata table	required
`columns_to_exclude`	`Set[str]`	Any columns which don't play any role in determining the label, e.g. any arbitrary sample identifier	`None`

Returns:

Type	Description
`pandas.DataFrame`	DataFrame with input data frame with 2 additional columns

Raises:

Type	Description
`requestException`	Invalid Request

`find_abbreviations(text)`

To run abbreviation detection separately. Internally calls a normaliser.

Parameters:

Name	Type	Description	Default
`text`	`str`	The string to detect abbreviations in	required

Returns:

Type	Description
`Dict[str, str]`	Dictionary with abbreviation as key and full form as value

Raises:

Type	Description
`requestException`	Invalid Request

`recognise_entity(text, threshold=None, normalize_output=False)`

Run an NER model on the given text. The returned value is a list of entities along with span info.

Users can simply recognise entities in a given text without any ontology standardisation (unlike the annotate_with_ontology function which normalises as well).

Parameters:

Name	Type	Description	Default
`text`	`str`	input text	required
`normalize_output`	`bool`	whether to normalize the keywords	`False`

Returns:

Name	Type	Description
`entities`	`List[dict]`	returns a list of spans containing the keyword, start and end index of the keyword and the entity type

Raises:

Type	Description
`requestException`	Invalid Request

`standardise_entity(mention, entity_type, context=None, threshold=None)`

Map a given mention (keyword) to an ontology term.

Given a text and the type of entity it is, users can get the Polly compatible ontology for the text such as the MESH ontology.

Parameters:

Name	Type	Description	Default
`mention`	`str`	mention of an entity e.g. "Cadiac arrythmia"	required
`entity_type`	`str`	Should be one of	required
`context`	`str`	The text where the mention occurs.	`None`

Returns:

Name	Type	Description
`dict`	`dict`	Dictionary containing keys and values of the entity type, ontology (such as NCBI, MeSH), ontology ID (such as the MeSH ID), the score (confidence score), and synonyms if any

Raises:

Type	Description
`requestException`	Invalid Request

Examples

# Install polly python
!sudo pip3 install polly-python --quiet
# Import libraries
from polly.auth import Polly     
from polly.curation  import Curation
import os
import pandas as pd
from json import dumps
import ipywidgets as widgets

# Create curation object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
curate = Curation(AUTH_TOKEN)

standardize_entity()

# Basic example
curate.standardise_entity("Mouse","species")

{'ontology': 'NCBI',
 'ontology_id': 'txid10090',
 'name': 'Mus musculus',
 'entity_type': 'species',
 'score': None,
 'synonym': None}

# Without 'context'
curate.standardise_entity("AD", "disease")

{'ontology': 'MESH',
 'ontology_id': 'C564330',
 'name': 'Alzheimer Disease, Familial, 3, with Spastic Paraparesis and Apraxia',
 'entity_type': 'disease',
 'score': 202.1661376953125,
 'synonym': 'ad'}

# With context, returns the desired keyword in case of abbreviation
curate.standardise_entity("AD", "disease", 
                context="Patients with atopic dermatitis (AD) where given drug A whereas non AD patients were given drug B")

{'ontology': 'MESH',
 'ontology_id': 'D003876',
 'name': 'Dermatitis, Atopic',
 'entity_type': 'disease',
 'score': 196.61105346679688,
 'synonym': 'atopic dermatitis'}

# Usage of non-matching 'entity_type' returns none values
curate.standardise_entity("Mouse","disease")

{'ontology': 'CUI-less',
 'ontology_id': None,
 'name': None,
 'entity_type': 'disease',
 'score': None,
 'synonym': None}

# Usage of non-supported 'entity_type' returns error -> Here, it is supposed to be "species" and not "specie"
curate.standardise_entity("Mouse","specie")

-----------------------------------------------------------------------

RequestException                          Traceback (most recent call last)

Input In [8], in <cell line: 2>()
      1 # Usage of non-supported 'entity_type' returns error -> Here, it is supposed to be "species" and not "specie"
----> 2 curate.standardise_entity("Mouse","specie")


File /usr/local/lib/python3.10/site-packages/polly/curation.py:145, in Curation.standardise_entity(self, mention, entity_type, context, threshold)
    143 if output.get("errors", []):
    144     title, detail = self._handle_errors(output)
--> 145     raise RequestException(title, detail)
    147 if "term" not in output:
    148     return {
    149         "ontology": "CUI-less",
    150         "ontology_id": None,
    151         "name": None,
    152         "entity_type": entity_type,
    153     }


RequestException: ('Invalid Payload', {'detail': [{'loc': ['body', 'mention', 'entity_type'], 'msg': "value is not a valid enumeration member; permitted: 'disease', 'drug', 'drug_chebi', 'species', 'tissue', 'cell_type', 'cell_line', 'gene', 'metabolite'", 'type': 'type_error.enum', 'ctx': {'enum_values': ['disease', 'drug', 'drug_chebi', 'species', 'tissue', 'cell_type', 'cell_line', 'gene', 'metabolite']}}]})

recognise_entity()

# Basic example with two entities
curate.recognise_entity("Gene expression profiling on mice lungs and reveals ACE2 upregulation")

[{'keyword': 'lungs',
  'entity_type': 'tissue',
  'span_begin': 34,
  'span_end': 39,
  'score': 0.9985597729682922},
 {'keyword': 'ACE2',
  'entity_type': 'gene',
  'span_begin': 52,
  'span_end': 55,
  'score': 0.9900580048561096},
 {'keyword': 'mice',
  'entity_type': 'species',
  'span_begin': 29,
  'span_end': 32,
  'score': 0.989605188369751}]

# Multiple entities of the same type
curate.recognise_entity("Batch effects were observed between ductal carcinoma and lobular carcinoma")

[{'keyword': 'ductal carcinoma',
  'entity_type': 'disease',
  'span_begin': 36,
  'span_end': 51,
  'score': 0.9999971389770508},
 {'keyword': 'lobular carcinoma',
  'entity_type': 'disease',
  'span_begin': 57,
  'span_end': 73,
  'score': 0.9999983906745911}]

# Repeating entities
curate.recognise_entity("The study showed ACE2 upregulation and ACE2 downregulation")

[{'keyword': 'ACE2',
  'entity_type': 'gene',
  'span_begin': 17,
  'span_end': 20,
  'score': 0.9962862730026245},
 {'keyword': 'ACE2',
  'entity_type': 'gene',
  'span_begin': 39,
  'span_end': 42,
  'score': 0.990687906742096}]

# No entity in the text
curate.recognise_entity("Significant upregulation was found in 100 samples")

[]

annotate_with_ontology()

# Basic example
curate.annotate_with_ontology("Mouse model shows presence of Adeno carcinoma")

[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species'),
 Tag(name='Adenocarcinoma', ontology_id='MESH:D000230', entity_type='disease')]

# Spelling errors
curate.annotate_with_ontology("Mouse model shows presence of Adino carcinoma")

[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species')]

# incorrect input format -> here, list instead of string
curate.annotate_with_ontology(["Mouse model shows presence", "adeno carcinoma"])

---------------------------------------------------------------------------

RequestException                          Traceback (most recent call last)

Input In [23], in <cell line: 2>()
      1 # incorrect input format -> here, list instead of string
----> 2 curate.annotate_with_ontology(["Mouse model shows presence", "adeno carcinoma"])


File /usr/local/lib/python3.10/site-packages/polly/curation.py:227, in Curation.annotate_with_ontology(self, text)
    209 def annotate_with_ontology(
    210     self,
    211     text: str,
    212 ) -> List[Tag]:
    214     """
    215     Tag a given piece of text. A "tag" is just an ontology term.
    216     Annotates with Polly supported ontologies.
   (...)
    224         tags (set of tuples): set of unique tags
    225     """
--> 227     entities = self.recognise_entity(text, normalize_output=True)
    228     res = {
    229         self.Tag(
    230             e.get("name", []), e.get("ontology_id", []), e.get("entity_type", [])
   (...)
    233         if e.get("name")
    234     }
    235     return list(res)


File /usr/local/lib/python3.10/site-packages/polly/curation.py:184, in Curation.recognise_entity(self, text, threshold, normalize_output)
    182 if "errors" in response:
    183     title, detail = self._handle_errors(response)
--> 184     raise RequestException(title, detail)
    185 try:
    186     entities = response.get("entities", [])


RequestException: ('Invalid Payload', {'detail': [{'loc': ['body', 'text'], 'msg': 'str type expected', 'type': 'type_error.str'}]})

find_abbreviations()

# Full form is not mentioned on the text
curate.find_abbreviations("Patient is diagnosed with T1D")

{}

# '-' on the text is not understood
curate.find_abbreviations("Patient is diagnosed with T1D- Type 1 Diabetes")

{}

# Abbreviation is recognized
curate.find_abbreviations("Patient is diagnosed with T1D (Type 1 Diabetes)")

{'T1D': 'Type 1 Diabetes'}

# Abbreviation does not match the full text
curate.find_abbreviations("Patient is diagnosed with T2D (Type 1 Diabetes)")

{}

assign_control_pert_labels()

sample_metadata = pd.DataFrame({"sample_id": [1, 2, 3, 4], "disease": ["control1", "ctrl2", "healthy", "HCC"],})
sample_metadata

	sample_id	disease
0	1	control1
1	2	ctrl2
2	3	healthy
3	4	HCC

curate.assign_control_pert_labels(sample_metadata, columns_to_exclude=["sample_id"])

	sample_id	disease	is_control	control_prob
0	1	control1	True	1.00
1	2	ctrl2	True	1.00
2	3	healthy	True	0.96
3	4	HCC	False	0.08

Data Curation

Curation

annotate_with_ontology(text)

assign_control_pert_labels(sample_metadata, columns_to_exclude=None)

find_abbreviations(text)

recognise_entity(text, threshold=None, normalize_output=False)

standardise_entity(mention, entity_type, context=None, threshold=None)

Examples

standardize_entity()

recognise_entity()

annotate_with_ontology()

find_abbreviations()

assign_control_pert_labels()

Tutorial Notebooks

`Curation`

`annotate_with_ontology(text)`

`assign_control_pert_labels(sample_metadata, columns_to_exclude=None)`

`find_abbreviations(text)`

`recognise_entity(text, threshold=None, normalize_output=False)`

`standardise_entity(mention, entity_type, context=None, threshold=None)`