Skip to content

Data Curation

The Curation class contains wrapper functions around the models used for semantic annotations of string/text.

Parameters:

  • token (str, default: None ) –

    token copy from polly.

Usage

from polly.curation import Curation

curationObj = Curation(token)

annotate_with_ontology

annotate_with_ontology(text)

Tag a given piece of text. A "tag" is just an ontology term. Annotates with Polly supported ontologies. This function calls recognise_entity followed by normalize. Given a text, users can identify and tag entities in a text. Each entity/tag recognised in the text contains the name(word in the text identified), entity_type and the ontology_id.

Parameters:

  • text (str) –

    Input text

Returns:

  • List[Tag]

    set of unique tags

assign_control_pert_labels

assign_control_pert_labels(sample_metadata, columns_to_exclude=None)

Returns the sample metadata dataframe with 2 additional columns. is_control - whether the sample is a control sample control_prob - the probability that the sample is control

Parameters:

  • sample_metadata (DataFrame) –

    Metadata table

  • columns_to_exclude (Set[str], default: None ) –

    Any columns which don't play any role in determining the label, e.g. sample id

Returns:

  • DataFrame ( DataFrame ) –

    Input data frame with 2 additional columns

Raises:

  • requestException

    Invalid Request

find_abbreviations

find_abbreviations(text)

To run abbreviation detection separately. Internally calls a normaliser.

Parameters:

  • text (str) –

    The string to detect abbreviations in.

Returns:

  • Dict[str, str]

    Dictionary with abbreviation as key and full form as value

Raises:

  • requestException

    Invalid Request

recognise_entity

recognise_entity(text, threshold=None, normalize_output=False)

Run an NER model on the given text. The returned value is a list of entities along with span info. Users can simply recognise entities in a given text without any ontology standardisation (unlike the annotate_with_ontology function which normalises as well).

Parameters:

  • text (str) –

    input text

  • threshold (float, default: None ) –

    Optional Parameter. All entities with a score < threshold are filtered out from the output. Its best not to specify a threshold and just use the default value instead.

  • normalize_output (bool, default: False ) –

    whether to normalize the keywords

Returns:

  • entities ( List[dict] ) –

    List of spans containing the keyword, start/end index of the keyword and the entity type

Raises:

  • requestException

    Invalid Request

standardise_entity cached

standardise_entity(mention, entity_type, context=None, threshold=None)

Map a given mention (keyword) to an ontology term. Given a text and entity type, users can get the Polly compatible ontology for the text such as the MESH ontology.

Parameters:

  • mention (str) –

    mention of an entity e.g. "Cadiac arrythmia"

  • entity_type (str) –

    Should be one of ['disease', 'drug', 'tissue', 'cell_type', 'cell_line', 'species', 'gene']

  • context (str, default: None ) –

    The text where the mention occurs. This is used to resolve abbreviations.

  • Threshold

    (float, optional) = Optional Parameter. All entities with a score < threshold are filtered out from the output. Its best not to specify a threshold and just use the default value instead.

Returns:

  • dict ( dict ) –

    Dictionary containing keys and values of the entity type, ontology (such as NCBI, MeSH), ontology ID (such as the MeSH ID), the score (confidence score), and synonyms if any

Raises:

  • requestException

    Invalid Request

Examples

# Install polly python
!sudo pip3 install polly-python --quiet
# Import libraries
from polly.auth import Polly     
from polly.curation  import Curation
import os
import pandas as pd
from json import dumps
import ipywidgets as widgets
# Create curation object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
curate = Curation(AUTH_TOKEN)  

standardize_entity()

# Basic example
curate.standardise_entity("Mouse","species")
{'ontology': 'NCBI',
 'ontology_id': 'txid10090',
 'name': 'Mus musculus',
 'entity_type': 'species',
 'score': None,
 'synonym': None}
# Without 'context'
curate.standardise_entity("AD", "disease")
{'ontology': 'MESH',
 'ontology_id': 'C564330',
 'name': 'Alzheimer Disease, Familial, 3, with Spastic Paraparesis and Apraxia',
 'entity_type': 'disease',
 'score': 202.1661376953125,
 'synonym': 'ad'}
# With context, returns the desired keyword in case of abbreviation
curate.standardise_entity("AD", "disease", 
                context="Patients with atopic dermatitis (AD) where given drug A whereas non AD patients were given drug B")
{'ontology': 'MESH',
 'ontology_id': 'D003876',
 'name': 'Dermatitis, Atopic',
 'entity_type': 'disease',
 'score': 196.61105346679688,
 'synonym': 'atopic dermatitis'}
# Usage of non-matching 'entity_type' returns none values
curate.standardise_entity("Mouse","disease")
{'ontology': 'CUI-less',
 'ontology_id': None,
 'name': None,
 'entity_type': 'disease',
 'score': None,
 'synonym': None}

recognise_entity()

# Basic example with two entities
curate.recognise_entity("Gene expression profiling on mice lungs and reveals ACE2 upregulation")
[{'keyword': 'lungs',
  'entity_type': 'tissue',
  'span_begin': 34,
  'span_end': 39,
  'score': 0.9985597729682922},
 {'keyword': 'ACE2',
  'entity_type': 'gene',
  'span_begin': 52,
  'span_end': 55,
  'score': 0.9900580048561096},
 {'keyword': 'mice',
  'entity_type': 'species',
  'span_begin': 29,
  'span_end': 32,
  'score': 0.989605188369751}]
# No entity in the text
curate.recognise_entity("Significant upregulation was found in 100 samples")
[]

annotate_with_ontology()

# Basic example
curate.annotate_with_ontology("Mouse model shows presence of Adeno carcinoma")
[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species'),
 Tag(name='Adenocarcinoma', ontology_id='MESH:D000230', entity_type='disease')]
# Spelling errors
curate.annotate_with_ontology("Mouse model shows presence of Adino carcinoma")
[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species')]

find_abbreviations()

# Full form is not mentioned on the text
curate.find_abbreviations("Patient is diagnosed with T1D")
{}
# '-' on the text is not understood
curate.find_abbreviations("Patient is diagnosed with T1D- Type 1 Diabetes")
{}
# Abbreviation is recognized
curate.find_abbreviations("Patient is diagnosed with T1D (Type 1 Diabetes)")
{'T1D': 'Type 1 Diabetes'}
# Abbreviation does not match the full text
curate.find_abbreviations("Patient is diagnosed with T2D (Type 1 Diabetes)")
{}

assign_control_pert_labels()

sample_metadata = pd.DataFrame({"sample_id": [1, 2, 3, 4], "disease": ["control1", "ctrl2", "healthy", "HCC"],})
sample_metadata
sample_id disease
0 1 control1
1 2 ctrl2
2 3 healthy
3 4 HCC
curate.assign_control_pert_labels(sample_metadata, columns_to_exclude=["sample_id"])
sample_id disease is_control control_prob
0 1 control1 True 1.00
1 2 ctrl2 True 1.00
2 3 healthy True 0.96
3 4 HCC False 0.08

Tutorial Notebooks

  1. Basic Usage Examples

  2. Custom Curation with GEO Datasets from Polly