Data Curation
Curation
The Curation class contains wrapper functions around the models used for semantic annotations of string/text.
Curation functions are able to recognise different entities given a text, normalise them based on certain nomenclature such as Polly compatible ontologies. Entities that are supported are: "disease", "drug", "species", "tissue", "cell_type", "cell_line", "gene".
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token |
str
|
token copy from polly. |
required |
Usage
from polly.curation import Curation
curationObj = Curation(token)
annotate_with_ontology(text)
Tag a given piece of text. A "tag" is just an ontology term. Annotates with Polly supported ontologies.
This function calls recognise_entity followed by normalize. Given a text, users can identify and tag entities in a text. Each entity/tag recognised in the text contains the name( word in the text identified), entity_type and the ontology_id.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
Input text |
required |
Returns:
Type | Description |
---|---|
List[Tuples]
|
set of unique tags |
assign_control_pert_labels(sample_metadata, columns_to_exclude=None)
Returns the sample metadata dataframe with 2 additional columns. is_control - whether the sample is a control sample control_prob - the probability that the sample is control
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sample_metadata |
pandas.DataFrame
|
Metadata table |
required |
columns_to_exclude |
Set[str]
|
Any columns which don't play any role in determining the label, e.g. any arbitrary sample identifier |
None
|
Returns:
Type | Description |
---|---|
pandas.DataFrame
|
DataFrame with input data frame with 2 additional columns |
Raises:
Type | Description |
---|---|
requestException
|
Invalid Request |
find_abbreviations(text)
To run abbreviation detection separately. Internally calls a normaliser.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The string to detect abbreviations in |
required |
Returns:
Type | Description |
---|---|
Dict[str, str]
|
Dictionary with abbreviation as key and full form as value |
Raises:
Type | Description |
---|---|
requestException
|
Invalid Request |
recognise_entity(text, threshold=None, normalize_output=False)
Run an NER model on the given text. The returned value is a list of entities along with span info.
Users can simply recognise entities in a given text without any ontology standardisation (unlike the annotate_with_ontology function which normalises as well).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
input text |
required |
normalize_output |
bool
|
whether to normalize the keywords |
False
|
Returns:
Name | Type | Description |
---|---|---|
entities |
List[dict]
|
returns a list of spans containing the keyword, start and end index of the keyword and the entity type |
Raises:
Type | Description |
---|---|
requestException
|
Invalid Request |
standardise_entity(mention, entity_type, context=None, threshold=None)
Map a given mention (keyword) to an ontology term.
Given a text and the type of entity it is, users can get the Polly compatible ontology for the text such as the MESH ontology.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mention |
str
|
mention of an entity e.g. "Cadiac arrythmia" |
required |
entity_type |
str
|
Should be one of |
required |
context |
str
|
The text where the mention occurs. |
None
|
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
Dictionary containing keys and values of the entity type, ontology (such as NCBI, MeSH), ontology ID (such as the MeSH ID), the score (confidence score), and synonyms if any |
Raises:
Type | Description |
---|---|
requestException
|
Invalid Request |
Examples
# Install polly python
!sudo pip3 install polly-python --quiet
# Import libraries
from polly.auth import Polly
from polly.curation import Curation
import os
import pandas as pd
from json import dumps
import ipywidgets as widgets
# Create curation object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
curate = Curation(AUTH_TOKEN)
standardize_entity()
{'ontology': 'NCBI',
'ontology_id': 'txid10090',
'name': 'Mus musculus',
'entity_type': 'species',
'score': None,
'synonym': None}
{'ontology': 'MESH',
'ontology_id': 'C564330',
'name': 'Alzheimer Disease, Familial, 3, with Spastic Paraparesis and Apraxia',
'entity_type': 'disease',
'score': 202.1661376953125,
'synonym': 'ad'}
# With context, returns the desired keyword in case of abbreviation
curate.standardise_entity("AD", "disease",
context="Patients with atopic dermatitis (AD) where given drug A whereas non AD patients were given drug B")
{'ontology': 'MESH',
'ontology_id': 'D003876',
'name': 'Dermatitis, Atopic',
'entity_type': 'disease',
'score': 196.61105346679688,
'synonym': 'atopic dermatitis'}
# Usage of non-matching 'entity_type' returns none values
curate.standardise_entity("Mouse","disease")
{'ontology': 'CUI-less',
'ontology_id': None,
'name': None,
'entity_type': 'disease',
'score': None,
'synonym': None}
# Usage of non-supported 'entity_type' returns error -> Here, it is supposed to be "species" and not "specie"
curate.standardise_entity("Mouse","specie")
-----------------------------------------------------------------------
RequestException Traceback (most recent call last)
Input In [8], in <cell line: 2>()
1 # Usage of non-supported 'entity_type' returns error -> Here, it is supposed to be "species" and not "specie"
----> 2 curate.standardise_entity("Mouse","specie")
File /usr/local/lib/python3.10/site-packages/polly/curation.py:145, in Curation.standardise_entity(self, mention, entity_type, context, threshold)
143 if output.get("errors", []):
144 title, detail = self._handle_errors(output)
--> 145 raise RequestException(title, detail)
147 if "term" not in output:
148 return {
149 "ontology": "CUI-less",
150 "ontology_id": None,
151 "name": None,
152 "entity_type": entity_type,
153 }
RequestException: ('Invalid Payload', {'detail': [{'loc': ['body', 'mention', 'entity_type'], 'msg': "value is not a valid enumeration member; permitted: 'disease', 'drug', 'drug_chebi', 'species', 'tissue', 'cell_type', 'cell_line', 'gene', 'metabolite'", 'type': 'type_error.enum', 'ctx': {'enum_values': ['disease', 'drug', 'drug_chebi', 'species', 'tissue', 'cell_type', 'cell_line', 'gene', 'metabolite']}}]})
recognise_entity()
# Basic example with two entities
curate.recognise_entity("Gene expression profiling on mice lungs and reveals ACE2 upregulation")
[{'keyword': 'lungs',
'entity_type': 'tissue',
'span_begin': 34,
'span_end': 39,
'score': 0.9985597729682922},
{'keyword': 'ACE2',
'entity_type': 'gene',
'span_begin': 52,
'span_end': 55,
'score': 0.9900580048561096},
{'keyword': 'mice',
'entity_type': 'species',
'span_begin': 29,
'span_end': 32,
'score': 0.989605188369751}]
# Multiple entities of the same type
curate.recognise_entity("Batch effects were observed between ductal carcinoma and lobular carcinoma")
[{'keyword': 'ductal carcinoma',
'entity_type': 'disease',
'span_begin': 36,
'span_end': 51,
'score': 0.9999971389770508},
{'keyword': 'lobular carcinoma',
'entity_type': 'disease',
'span_begin': 57,
'span_end': 73,
'score': 0.9999983906745911}]
# Repeating entities
curate.recognise_entity("The study showed ACE2 upregulation and ACE2 downregulation")
[{'keyword': 'ACE2',
'entity_type': 'gene',
'span_begin': 17,
'span_end': 20,
'score': 0.9962862730026245},
{'keyword': 'ACE2',
'entity_type': 'gene',
'span_begin': 39,
'span_end': 42,
'score': 0.990687906742096}]
# No entity in the text
curate.recognise_entity("Significant upregulation was found in 100 samples")
[]
annotate_with_ontology()
[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species'),
Tag(name='Adenocarcinoma', ontology_id='MESH:D000230', entity_type='disease')]
[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species')]
# incorrect input format -> here, list instead of string
curate.annotate_with_ontology(["Mouse model shows presence", "adeno carcinoma"])
---------------------------------------------------------------------------
RequestException Traceback (most recent call last)
Input In [23], in <cell line: 2>()
1 # incorrect input format -> here, list instead of string
----> 2 curate.annotate_with_ontology(["Mouse model shows presence", "adeno carcinoma"])
File /usr/local/lib/python3.10/site-packages/polly/curation.py:227, in Curation.annotate_with_ontology(self, text)
209 def annotate_with_ontology(
210 self,
211 text: str,
212 ) -> List[Tag]:
214 """
215 Tag a given piece of text. A "tag" is just an ontology term.
216 Annotates with Polly supported ontologies.
(...)
224 tags (set of tuples): set of unique tags
225 """
--> 227 entities = self.recognise_entity(text, normalize_output=True)
228 res = {
229 self.Tag(
230 e.get("name", []), e.get("ontology_id", []), e.get("entity_type", [])
(...)
233 if e.get("name")
234 }
235 return list(res)
File /usr/local/lib/python3.10/site-packages/polly/curation.py:184, in Curation.recognise_entity(self, text, threshold, normalize_output)
182 if "errors" in response:
183 title, detail = self._handle_errors(response)
--> 184 raise RequestException(title, detail)
185 try:
186 entities = response.get("entities", [])
RequestException: ('Invalid Payload', {'detail': [{'loc': ['body', 'text'], 'msg': 'str type expected', 'type': 'type_error.str'}]})
find_abbreviations()
{}
# '-' on the text is not understood
curate.find_abbreviations("Patient is diagnosed with T1D- Type 1 Diabetes")
{}
# Abbreviation is recognized
curate.find_abbreviations("Patient is diagnosed with T1D (Type 1 Diabetes)")
{'T1D': 'Type 1 Diabetes'}
# Abbreviation does not match the full text
curate.find_abbreviations("Patient is diagnosed with T2D (Type 1 Diabetes)")
{}
assign_control_pert_labels()
sample_metadata = pd.DataFrame({"sample_id": [1, 2, 3, 4], "disease": ["control1", "ctrl2", "healthy", "HCC"],})
sample_metadata
sample_id | disease | |
---|---|---|
0 | 1 | control1 |
1 | 2 | ctrl2 |
2 | 3 | healthy |
3 | 4 | HCC |
sample_id | disease | is_control | control_prob | |
---|---|---|---|---|
0 | 1 | control1 | True | 1.00 |
1 | 2 | ctrl2 | True | 1.00 |
2 | 3 | healthy | True | 0.96 |
3 | 4 | HCC | False | 0.08 |