Data Curation

The Curation class contains wrapper functions around the models used for semantic annotations of string/text.

Parameters:

token (str, default: None ) –

token copy from polly.

Usage

from polly.curation import Curation

curationObj = Curation(token)

annotate_with_ontology

annotate_with_ontology(text)

Tag a given piece of text. A "tag" is just an ontology term. Annotates with Polly supported ontologies. This function calls recognise_entity followed by normalize. Given a text, users can identify and tag entities in a text. Each entity/tag recognised in the text contains the name(word in the text identified), entity_type and the ontology_id.

Parameters:

text (str) –

Input text

Returns:

List[Tag] –

set of unique tags

assign_control_pert_labels

assign_control_pert_labels(sample_metadata, columns_to_exclude=None)

Returns the sample metadata dataframe with 2 additional columns. is_control - whether the sample is a control sample control_prob - the probability that the sample is control

Parameters:

sample_metadata (DataFrame) –

Metadata table
columns_to_exclude (Set[str], default: None ) –

Any columns which don't play any role in determining the label, e.g. sample id

Returns:

DataFrame ( DataFrame ) –

Input data frame with 2 additional columns

Raises:

requestException –

Invalid Request

find_abbreviations

find_abbreviations(text)

To run abbreviation detection separately. Internally calls a normaliser.

Parameters:

text (str) –

The string to detect abbreviations in.

Returns:

Dict[str, str] –

Dictionary with abbreviation as key and full form as value

Raises:

requestException –

Invalid Request

recognise_entity

recognise_entity(text, threshold=None, normalize_output=False)

Run an NER model on the given text. The returned value is a list of entities along with span info. Users can simply recognise entities in a given text without any ontology standardisation (unlike the annotate_with_ontology function which normalises as well).

Parameters:

text (str) –

input text
threshold (float, default: None ) –

Optional Parameter. All entities with a score < threshold are filtered out from the output. Its best not to specify a threshold and just use the default value instead.
normalize_output (bool, default: False ) –

whether to normalize the keywords

Returns:

entities ( List[dict] ) –

List of spans containing the keyword, start/end index of the keyword and the entity type

Raises:

requestException –

Invalid Request

standardise_entity `cached`

standardise_entity(mention, entity_type, context=None, threshold=None)

Map a given mention (keyword) to an ontology term. Given a text and entity type, users can get the Polly compatible ontology for the text such as the MESH ontology.

Parameters:

mention (str) –

mention of an entity e.g. "Cadiac arrythmia"
entity_type (str) –

Should be one of ['disease', 'drug', 'tissue', 'cell_type', 'cell_line', 'species', 'gene']
context (str, default: None ) –

The text where the mention occurs. This is used to resolve abbreviations.
Threshold –

(float, optional) = Optional Parameter. All entities with a score < threshold are filtered out from the output. Its best not to specify a threshold and just use the default value instead.

Returns:

dict ( dict ) –

Dictionary containing keys and values of the entity type, ontology (such as NCBI, MeSH), ontology ID (such as the MeSH ID), the score (confidence score), and synonyms if any

Raises:

requestException –

Invalid Request

Examples

# Install polly python
!sudo pip3 install polly-python --quiet
# Import libraries
from polly.auth import Polly     
from polly.curation  import Curation
import os
import pandas as pd
from json import dumps
import ipywidgets as widgets

# Create curation object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
curate = Curation(AUTH_TOKEN)

standardize_entity()

# Basic example
curate.standardise_entity("Mouse","species")

{'ontology': 'NCBI',
 'ontology_id': 'txid10090',
 'name': 'Mus musculus',
 'entity_type': 'species',
 'score': None,
 'synonym': None}

# Without 'context'
curate.standardise_entity("AD", "disease")

{'ontology': 'MESH',
 'ontology_id': 'C564330',
 'name': 'Alzheimer Disease, Familial, 3, with Spastic Paraparesis and Apraxia',
 'entity_type': 'disease',
 'score': 202.1661376953125,
 'synonym': 'ad'}

# With context, returns the desired keyword in case of abbreviation
curate.standardise_entity("AD", "disease", 
                context="Patients with atopic dermatitis (AD) where given drug A whereas non AD patients were given drug B")

{'ontology': 'MESH',
 'ontology_id': 'D003876',
 'name': 'Dermatitis, Atopic',
 'entity_type': 'disease',
 'score': 196.61105346679688,
 'synonym': 'atopic dermatitis'}

# Usage of non-matching 'entity_type' returns none values
curate.standardise_entity("Mouse","disease")

{'ontology': 'CUI-less',
 'ontology_id': None,
 'name': None,
 'entity_type': 'disease',
 'score': None,
 'synonym': None}

recognise_entity()

# Basic example with two entities
curate.recognise_entity("Gene expression profiling on mice lungs and reveals ACE2 upregulation")

[{'keyword': 'lungs',
  'entity_type': 'tissue',
  'span_begin': 34,
  'span_end': 39,
  'score': 0.9985597729682922},
 {'keyword': 'ACE2',
  'entity_type': 'gene',
  'span_begin': 52,
  'span_end': 55,
  'score': 0.9900580048561096},
 {'keyword': 'mice',
  'entity_type': 'species',
  'span_begin': 29,
  'span_end': 32,
  'score': 0.989605188369751}]

# No entity in the text
curate.recognise_entity("Significant upregulation was found in 100 samples")

[]

annotate_with_ontology()

# Basic example
curate.annotate_with_ontology("Mouse model shows presence of Adeno carcinoma")

[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species'),
 Tag(name='Adenocarcinoma', ontology_id='MESH:D000230', entity_type='disease')]

# Spelling errors
curate.annotate_with_ontology("Mouse model shows presence of Adino carcinoma")

[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species')]

find_abbreviations()

# Full form is not mentioned on the text
curate.find_abbreviations("Patient is diagnosed with T1D")

{}

# '-' on the text is not understood
curate.find_abbreviations("Patient is diagnosed with T1D- Type 1 Diabetes")

{}

# Abbreviation is recognized
curate.find_abbreviations("Patient is diagnosed with T1D (Type 1 Diabetes)")

{'T1D': 'Type 1 Diabetes'}

# Abbreviation does not match the full text
curate.find_abbreviations("Patient is diagnosed with T2D (Type 1 Diabetes)")

{}

assign_control_pert_labels()

sample_metadata = pd.DataFrame({"sample_id": [1, 2, 3, 4], "disease": ["control1", "ctrl2", "healthy", "HCC"],})
sample_metadata

	sample_id	disease
0	1	control1
1	2	ctrl2
2	3	healthy
3	4	HCC

curate.assign_control_pert_labels(sample_metadata, columns_to_exclude=["sample_id"])

	sample_id	disease	is_control	control_prob
0	1	control1	True	1.00
1	2	ctrl2	True	1.00
2	3	healthy	True	0.96
3	4	HCC	False	0.08

Data Curation

annotate_with_ontology

assign_control_pert_labels

find_abbreviations

recognise_entity

standardise_entity cached

Examples

standardize_entity()

recognise_entity()

annotate_with_ontology()

find_abbreviations()

assign_control_pert_labels()

Tutorial Notebooks

standardise_entity `cached`