Introduction

With Polly you can access public databases which have been curated and stored in the form of data lakes or make data lakes with your own data. These data lakes can be explored and analyzed either through Polly Notebooks or the several Data Lake Applications available.

Polly Discover consists of the following major components:

  • Data Lake Curation

    Data lakes are reservoirs of information that contain multi-omics data, annotations from publicly available databases, publications, etc. Reservoirs are further segregated into two parts: public data repositories which are curated by Elucidata using public data sources, and private data repositories, where you can add your proprietary data.

  • Data Lake Exploration

    To explore a data lake, Polly provides tools such as applications and Polly Notebooks. These tools enable you to find relevant data by searching for keywords associated with the file name, file metadata, or the contents of a file.

  • On-the-fly Analysis

    Once you have narrowed down relevant omics datasets, you can analyze the dataset(s) on the fly using various statistical analyses while displaying intuitive visualizations.

Available public data repositories

Public data repositories on Polly consist of processed and curated datasets from various sources. They can be readily used for searching for new datasets or running an analysis on one or more datasets.

  • AML: Microarray and RNA Sequencing datasets for Acute Myeloid Leukemia.

  • GBM: Microarray and RNA Sequencing datasets for Gliblastoma Multiforme.

  • IBD: Microarray and RNA Sequencing datasets for Inflammatory Bowel Disease.

  • GEO: Microarray and RNA Sequencing datasets from Gene Expression Omnbius.

  • Single cell Atlas: Single cell RNA Sequencing datasets from Gene Expression Omnibus

  • GTEX: Normal tissue RNA Sequencing datasets from Genotype-Tissue Expression project

  • TCGA: Tumor RNA Sequencing datasets from The Cancer Genome Atlas.

  • COVID-19: Transcriptional datasets for SARS viruses, viral infections, and therapeutics for novel coronavirus.

Additionally, the public data repositories also consist of publicly available databases that have been curated for annotations. These publicly available databases are currently part of these repositories.

  • HMDB: Pathway information from Human Metabolome Database.

  • KEGG: Pathway information from Kyoto Encyclopedia of Genes and Genomes.

  • Reactome: Pathway information from Reactome.

  • GWAS: Phenotypic data from Genome-Wide Association Studies Catalogue.

Dataset Filtering

Dataset Filtering Dashboard

For smaller data lakes, we provide a dataset filtering interface. It allows you to explore and filter the relevant datasets present in the data lake.

Filters and Columns

Filters and columns

The filtering interface provides 4 parameters that you can use to filter the datasets within the selected repository. The parameters are:

  • Disease: This option will give you an overview of all the diseased type datasets present in the repository. You can choose to work on any of the disease options listed or the normal datasets. In order to do the selection, mark the checkboxes present besides the disease of your interest.

  • Organism: It provides the list of the organisms associated with the datasets of the datalake. You can mark a selection to filter the datasets of only the desired organism.

  • Tissue: This section will give you the distribution of tissue across the repository. Click on Load More to look at the entire list, or use the search option to find the tissue type you are looking for. Select the tissue type required to filter the datasets specific to it.

  • Data Type: The dataset variety would be listed in this option. Choose the data type for your study by selecting the checkbox beside it.

When the selections are marked, you can find the filtered datasets on the right panel.

Note:

  • You can select multiple entries at the same time.

  • To clear your filters at any point in time, click on the clear option present beside all the parameters.

Dataset Selection

Dataset selection

The right panel displays the dataset present in the repository. It incorporates the

  • Dataset ID: Unique identifier associated with the dataset.

  • Description: It encompasses the title of the paper.

  • Organism: Organism associated with the dataset

  • Datatype: Datatype of the dataset e.g. Transcriptomic, Metabolomics, Single Cell etc

  • Disease: Disease studied with the selected dataset

  • Tissue: Type of tissue the dataset is from

  • Source: Provides the link to the publication

Once you have narrowed down relevant omics datasets, you can mark a selection on the checkbox present beside the desired dataset.

On-the-fly Analysis

You can analyze the selected dataset on the fly using various applications on Polly. They enable you to perform various statistical analyses, displaying intuitive visualizations, and allowing you to create a hitlist while analyzing multiple datasets simultaneously.

In order to select the tool of your analysis, click on Select the Application option at the bottom of the screen after selection of any dataset from the list and choose the analysis platform of your interest and click on open.

Selecting an application

Select the workspace where you would like to store the analysis and click on Launch to open the selected application/notebook.

Launching workspace

Data Lake Applications

Data Lake Applications are built on top of data lakes to query and explore relevant datasets. The following data lake applications are a part of the current platform:

  • Polly Discover Application:

    It is a platform for visualization, analytics, and exploration for bulk transcriptomics data curated from GEO. It offers users an interactive dashboard for analysis and visualization of transcriptomics data. Currently, the platform handles gene expression microarray and RNA-seq data and supports three species human, mouse, and rat.

  • Single Cell Visualization:

    It is a comprehensive visualization platform for single-cell transcriptomics data. The app is helpful in visualizing cells and the association of different genes with the metadata.

  • Cellxgene:

    It is an interactive data explorer for single-cell transcriptomics datasets.

  • DepMap CCLE:

    Exploration application for cell line dependency and gene expression data from DepMap and CCLE.

  • GTEx Application

    GTEx Application is is a platform for visualization, analytics, and exploration of transcriptomics data from GTEx.

  • Dual Mode Data Visualization(Metabolomics App):

    This app allows you to perform downstream analysis on untargeted unlabeled metabolomics data along with insightful visualizations. It provides a variety of normalization methods, scaling options, and data visualization functionalities, thereby allowing an efficient analysis of the data to get actionable insights.

  • Discover Notebook:

    This app allows you to perform downstream analysis on untargeted unlabeled metabolomics data along with insightful visualizations. It provides a variety of normalization methods, scaling options, and data visualization functionalities, thereby allowing an efficient analysis of the data to get actionable insights.

Polly Notebooks Docker Machine Configuration
Discover Notebook Single-cell Single Cell Downstream Memory-optimized 32GB, Polly 2x-large
Discover Notebook Transcriptomics RNA-Seq Downstream RNA-Seq Downstream
Discover Notebook Proteomics RNA-Seq Downstream Polly medium 4GB
Discover Notebook Metabolomics Metabolomics Polly medium 4GB

Polly Discover Application

Opening the app

Upon opening the Discover portal on Polly, choose a data repository that you would like to explore. The page should look something like this.

Polly Discover

After selecting a repository, you’ll be able to view a filtering interface which provides parameters that you can use to filter the datasets within the selected repository. Once you select a dataset , you can access the integrated tools attached with the repository. For transcriptomics data, you can use discover application for further analysis of transcriptomics data.

Discover App Icon

The app shows overview page which contains a brief description of the application, it's scope and the usage as shown below.

App Description

Exploring the data lake

Search for relevant datasets by navigating to the Dataset Search tab in the navigation pane to the left. Keyword search can be applied to the following fields:

  • Data Set ID

  • Data Set Source

  • Description

  • Diseases

  • Is Public

  • Organisms

  • Platform

  • Tissue

  • Year

Search options

The search will return all datasets that are associated with your search. The result should look like the image below.

Search results

The table shown above shows very few columns by default. In order to view the other columns in the table, you can select the fields from Available Columns and click on Show! button. Download Selected Dataset button will let you download the dataset that you have selected on your local system. Export results to CSV button will let you download the search result table in the form of a .csv file. Once you have narrowed down the relevant datasets, you can analyze one or more datasets on the fly within the app.

Analyzing a single dataset

You can analyze a single dataset by selecting the checkbox to the left of the entry in the table. Once you’ve selected the checkbox, click on the Analyze Data button below the table description.

Select a data set

After clicking the Analyze Data button, the app will read the selected dataset and take you to the Dataset Analysis tab. Here, you can perform the following analyses:

  • Principal Component Analysis (PCA)

    Principal Component Analysis: Also known as PCA plot, it is used to see the overall differences between cohorts of interest, if a strong separation is found along X axis (PC1) then that means strong biological differences between cohorts of interest. One can also increase the number of genes considered in the PCA plot, as one increases the number of genes, it is bound to decrease the PC1 component.

  • Boxplot Visualization

    Boxplot can be really useful in understanding the distribution of expression within a dataset. For any downstream analysis such as differential expression or pathway analysis, the distribution has to be normal since they use tests which assume this distribution.

  • Plots

    A box and whisker plot (a boxplot) is a graph that presents information from a five-number summary namely lower extreme, lower quartile, median, upper quartile, and upper extreme. In this plot, the median is marked by a vertical line inside the box; the ends of the box are upper and lower quartiles; the two lines outside the box extend to the highest and lowest observations. It is useful for knowing the nature of distribution (i.e., skewed) and potential unusual observations.

  • Heatmap

    A heatmap is a graphical representation of data that uses a system of color-coding to represent different values. This heatmap shows the cohort wise mean expression of a particular gene. The samples are aggregated on the basis of a given cohort and the mean is calculated based on the cohort information.

  • Differential Expression

    Differential expression analysis means taking the normalised read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. For example, we use statistical testing to decide whether, for a given gene, an observed difference in read counts is significant, that is, whether it is greater than what would be expected just due to natural random variation.

  • X2K Analysis

    X2K infers upstream regulatory networks from signatures of differentially expressed genes. By combining transcription factor enrichment analysis, protein-protein interaction network expansion, with kinase enrichment analysis, X2K produces inferred networks of transcription factors, proteins, and kinases predicted to regulate the expression of the inputted gene list.

  • Gene Ontology Plot

    Gene Ontology Annotation Plot is a simple but useful tool for visualizing, comparing and plotting GO (Gene Ontology) annotation results.

  • Enrichr

    Enrichr, includes new gene-set libraries, an alternative approach to rank enriched terms, and various interactive visualization approaches to display enrichment results using the JavaScript library, Data Driven Documents (D3).

  • GSEA

    Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).

  • Specific Pathway Visualization using Pathview

    Pathview maps, integrates and renders a wide variety of biological data on relevant pathway graphs.

Analyses possible

GTEx

Opening the app

GTEx respository can be accessed using the GTEx card on Discover.

Polly GTEx

After selecting the repository, you’ll be able to see a dashboard with different tissues. Select a dataset and use GTEx application to explore the dataset.

Repository Dashboard

The app will open and you should see the overview page which contains a brief overview of the application, scope and caveats as shown below.

App Description

Analyzing a dataset

As the application starts, it will load the requested dataset. Once it is loaded, it can be explored.

  • Principal Component Analysis (PCA)

Principal Component Analysis: Also known as PCA plot, it is used to see the overall differences between cohorts of interest, if a strong separation is found along X axis (PC1) then that means strong biological differences between cohorts of interest. It provides an aspect to check the quality control of different samples.

Metadata Table

The tab provides a metadata table to check different characterstics of samples. Furthermore various parameters of PCA can be adjusted.

PCA Parameters

A publication quality and an interactive version of the PCA plot is available to explore.

PCA Plot

  • Bar Plot

Barplot provides exploration of different genes either standalone or as a part of different pathways. The distribution can be grouped by different metadata cohorts such as tissue type or tissue-subtype.

Using the Gene Expression toggle, different genes can be queried for different samples.

Gene Expression

Upon selecting the Pathway Visualization option, pathway specific genes can be selected. At a given time more than one pathways can be selected

Pathway Visualization Pathway Visualization Plot

  • GTEx Expression Map

GTEx Expression Map can be used to explore the distribution of selected genes in different GTEx tissues. After exploring the selected tissue and finding list of genes of interest, it presents visualization methods like GTEx Expression Violin and GTEx Expression Heatmap to study the distribution of genes across different tissues.

A single gene can be selected to plot violin for it's expression across different tissues.

GTEX Expression Violin

Multipe genes can be used to make a heatmap for different tissues.

GTEx Expression Heatmap

Single Cell Visualization

Opening the app

Upon opening the Discover application on Polly, choose a relevant data repository which hosts single cell data.

Polly Discover

After selecting a repository, you’ll be able to view a filtering interface which provides parameters that you can use to filter the datasets within the selected repository. Once you select a dataset you can access the integrated tools attached with the repository. You can use Single Cell Visualization application for further analysis of single cell data.

single cell App

The app will open and you should see the overview page which contains a brief overview of the application, scope and caveats as shown below.

App Description

Exploring the data lake

Search for relevant datasets by navigating to the Dataset Search tab in the navigation panel to the left. Keyword search can be applied to the following fields:

  • DatasetID

  • Platform

  • Title

  • Description

  • Disease

  • Pubmed ID

  • Organism

  • Cell Types

  • Tissue

Search options

The search will return all datasets that are associated with your search. The result should look like as shown below.

Search results

The table shown above shows very few columns by default. In order to view the other columns in the table, you can select the fields from Available Columns and click on Show button. Once you have narrowed down the relevant datasets, you can analyze one dataset on the fly within the app.

Analyzing a dataset

You can analyze a single dataset by selecting the checkbox to the left of the entry in the table. Once you’ve selected the checkbox, click on the Load button below the table description.

Select a data set

After clicking the Load button, the app will read the selected dataset. Once the loading finishes, you can check the further tabs to explore the dataset:

  • Dataset Summary

This tab provides quick summary of the selected dataset. The tab reveals the no. of celltype/clusters, genes and cells, available metadata and quality control metrics for the selected dataset. The Value boxes at the top provide information about the no. of genes, cells and celltypes/clusters.

Value Boxes

Below it lies the metadata summary table which contains the different metadata fields and their categories. The table is searchable and clicking on a particular metadata shows it's distribution. For instance if you want to see the distribution of cell types in a study, you can search the keyword 'cell_type' in name search box. Upon click on it a table describing the distribution of cell types will popup.

Metadata Table

Quality Control(QC) Metrics helps in understanding the processing of the dataset. The application provides the opportunity to understand quality control using QC distribution and QC scatter plots. Using QC distribution one can understand the distribution of a single quality control metric in a particular metadata. For instance, if we want to check the distribution of gene counts in different cell type, we can select 'gene_counts' as QC metric and 'cell_type' as cluster.

QC Distribution

Using QC scatter plot one can understand the assosciation between distribution of two different quality control metrics. It can be useful in understanding the distribution of gene counts and UMI counts.

QC Scatter

  • Cell Visualization

Cell visualization provides exploration of cells using dimensionality reduction methods. The tab presents the dimensionality reduction methods such as tSNE, UMAP, PCA and others to visualize the distribution of the cells. The Visualize cells panel on the right, shows the distribution of the cells in an interactive way.

Visualize cells

Using Feature selection panel, a metadata or a gene feature can be selected for plotting.

Feature selection

The Customize visualization panel, offers the scope of customizing visualization features such as highlight non-zero cells, point size and method used based on personal preferences.

Customize visualization

  • Marker

Marker tab provides exploration of distribution of markers. It presents visualization methods like dot plot and violin plot to study the distribution of genes across different metadata. The Marker selection panel, provides the user with options to choose different genes and metadata for plotting marker distribution.

Marker selection

The Marker dot plot panel, is the area for exploring the average expression and distribution of marker using dot plot in an interactive fashion.

Marker Dot plot

The Marker Violin plot panel is the area for exploring the marker distribution using violin plot in an interactive fashion. Using Customize Violin slider, the range of values used for plotting violin can be adjusted. It is useful for observing a section of data such as the non-zero values of the expression.

Marker Violin plot

  • Cell-Type Aggregation

After having insights about different markers in a single cell study, it is imperative to have a look at the expression of the selected markers across different cell types in different studies. Single cell visualization provides pan-dataset exloratory analysis in the form of Cell Type Aggregation. It provides the scope to query the respective repository for median expression of a selected gene across top 20 cell types based on expression. This provides the scope to study expression of a gene in a particular cell type in a given biological context.

cell-type aggregation

Potential use-case example: What is the expression of SARS-CoV2 virus entry specific host protein 'ACE2' across different cell types in different studies?

Use case

The Cell Type Aggregation tab provides the input in the form of text field where query can be made for a gene, using our internal discover services, the input gene is queried across all the datasets in which it is expressed in the repository. To check for ACE2 expression, simply enter ACE2 in the search field. After entering the gene, clicking on Search gene will generate a bar plot showing the median expression and the distribution of gene expression across different studies.

Cell-type aggregation plot

Access through Polly Notebook interface

The Polly Notebook Dockers on Polly have an internal python package called ‘discoverPy’ pre-installed, which can be used to search for datasets in the various data repositories.

Structure of a data repository

A data repository is a collection of different files having different file types. To ensure easy access at a granular level to all datasets a data repository is organized in the following manner. Under this schema, each repository can be considered as a collection of indices which can be used for querying. The discoverPy package can access all indices of a data repository using API endpoints.

Structure of a data repository

Click here for a detailed documentation about Polly Notebooks.

Usage

  • Initialize a discover object

This discover object is used to interact with a dataset repository.

from discoverpy import Discover
discover = Discover() 
discover

Discover object

  • List all available data repositories along with their indices.
discover.get_repositories()

Endpoints for public data

  • Set a repository for fetching the different endpoints.

    Choose a repository from the list of repositories and use it's corresponding it to set the discover object to point to that repository.

    • For single cell repositories use mode='single_cell'.

    • For bulk data repositories use mode='bulk' (default)

For geo repository repo_id is 16.

discover.set_repo('16')

For sc_data_lake repository repo_id is 17.

discover.set_repo("17", mode="single_cell")

After you’ve added the indices for a repository, you can view the discover object

discover

View the discover object

Note that the ‘annotation_repo’ index is added automatically for each repository.

Querying at the dataset level

To search for datasets, the ‘_files’ index can be searched using the metadata fields present in it.

  • Get fields present in the index
discover.dataset_repo.get_all_fields()

Fields present in the index

  • To get a sense of what values are present in each field, one can view the top n entries. Some generic fields are present for each file.

    • __bucket__: S3 bucket name

    • __filetype__: Type of file such as pdf, gct etc.

    • __key__: S3 key of the file

    • __location__: Location of the file within data repository

discover.dataset_repo.get_top_n_examples(n = 30)

Values present in each field

  • Search for a dataset by keyword in a particular field. Searching for “mll” in the field “description” here.
dataset_query_df = discover.dataset_repo.query_dataset_by_field("description","mll") dataset_query_df

Search for a dataser

Querying at the sample level

GCT File Format

The datasets in the public repositories are saved as a .gct file. This is a file format in which data can be stored along with the sample metadata. The data values in the actual matrix along with features (genes) are indexed in the ‘_gct_data’ index of the repository and the sample metadata is index in the ‘_gct_metadata’ of the index of the repository.

GCT file structure

H5AD File Format

The single cell datasets in the public repositories are saved as a .h5ad file. This is a file format in which data can be stored along with the sample metadata.

H5AD file structure

  • Get fields present in the index
discover.sample_repo.get_all_fields()

Fields present in the index

  • To get a sense of what values are present in each field, one can view the top n entries.
discover.sample_repo.get_top_n_examples(n =30)

Values present in field

  • Search for samples by keyword in a particular field. Searching for “M1” in the field “fab_classification_ch1” here.
fab_df = discover.sample_repo.query_samples_by_field("fab_classification_ch1", "M1", n = 100)
fab_df

Search for samples using keywords

  • Search for samples by keywords in all fields. This can be used if the field to search for is not known beforehand.
fab_all_fields = discover.sample_repo.query_samples_by_all_fields("M1", n = 100) fab_all_fields

Search for samples using keywords

Querying at the feature level

The matrix of a .gct/.h5ad file contains the actual values for the different features(genes/metabolites). The ‘_gct_index’ or ‘_h5ad_index’ index of a repository can be queried for features.

  • Get fields present in the index
discover.feature_repo.get_all_fields()

Fields present in the index

  • To get a sense of what values are present in each field, one can view the top n entries.
discover.feature_repo.get_top_n_examples(n=20)

Values present in field

The ‘__index__’ column contains the feature name

  • Get values for a particular feature across all samples in all datasets of the repository. Getting values for “NRAS” gene here.
nras_df =discover.feature_repo.get_feature_values("NRAS", n = 1000)
nras_df

Values for a feature

  • To get features from all single cell datasets, use the variant get_feature_values_sc. See the following example.
hhex_df =discover.feature_repo.get_feature_values_sc("HHEX", n = 1000) 
hhex_df

Values for a feature in single cell datasets

Access annotation repositories

The various gene annotation databases can also be accessed through discoverpy. These can be used to get information about a particular gene or a set of genes.

  • Get all annotation databases
discover.annotation_repo.get_annotation_databases()

Get all annotation databases

  • Get annotations for a list of genes from a particular database. Getting Reactome pathways for the genes.
discover.annotation_repo.get_feature_annotation('reactome', ['ACTA1', 'AHCTF1', 'AKAP13', 'ATP2C1', 'CDK7'])

Get annotations for a list of genes

Advanced queries

You can also perform more complex queries on multiple fields combining them with boolean logic. Some examples are shown here.

  • Get microarray stem cell datasets which did not involve a knockdown experiment
discover.dataset_repo.query_dataset_by_field_combination(and_fields={"platform":"Microarray", "tissue":"stem cells"}, not_fields={"description":"knockdown"}, n = 50)

Advanced query example 1

  • Get samples containing CD34 cells or mononuclear cells do not include de novo samples
discover.sample_repo.query_samples_by_field_combination(or_fields = {"cell_type_ch1":"CD34","cell_type_ch1":"mononuclear"}, not_fields = {"treatment_protocol_ch1":"de novo"}, n = 300)

Advanced query example 2

Downloading datasets

  • You can use the get_file(key, repo_id, file_name) function to download a dataset from a datalake repository. The function has following 3 parameters:

    • __key__: S3 key of the file

    • __repo_id__: Repository id

    • __file_name__: Name of the file with file extentions such as gct, h5ad etc.

discover.get_file('AML_data_lake/data/Microarray/GSE76320/GCT/GSE76320_GPL8321_curated.gct', '1', 'GSE76320_GPL8321_curated.gct')

Videos