Curating OmixAtlas - Single Cell
Single Cell RNASeq Data
1. Dataset Level Metadata
Field | Description | Ontology | GUI Display Name | Polly-Python Display Name |
---|---|---|---|---|
Organism | This field represents the organism from which the samples originated. Organism labels already present in the source metadata are normalized using a normalization model. In case the organism labels are missing, related texts and abstracts are processed and further normalized to get the organism metadata label. | NCBI Taxonomy | Organism | curated_organism |
Tissue | This field represents the tissue(s) from which the samples in the dataset are derived. Tissue labels already present in the source metadata are normalized using a normalization model. In cases where tissue labels are missing, related texts and abstracts are processed and further normalized to get the tissue metadata label. Tissue labels for datasets consists of all tissue names from which the samples are derived. Tissue labels are annotated for samples extracted from a healthy tissue or a diseased tissue. Key specifications for tissue metadata annotations are as follows:
|
Brenda Tissue Ontology | Tissue | curated_tissue |
Drug | This field represents the drug(s) that have been used in the treatment of the samples or relate to the experiment in some other way. Drug labels already present in the source metadata are normalized using a normalization model. In cases where drug labels are missing, related texts and abstracts are processed and further normalized to get the drug metadata label. Drug labels are annotated for the following types of sample treatments:
Drug labels are not annotated for the following types of sample treatments:
Note : Any mention of the drug in the text is included as a drug label irrespective of whether it is being used in the experiment or not. |
PubChem | Drug | curated_drug |
Disease | This field represents the disease(s) being studied in the experiment. Disease labels already present in the source metadata are normalized using a normalization model. In case the disease labels are missing, related texts and abstracts are processed and further normalized to get the disease metadata label. Disease labels are annotated when the samples have been collected from diseased tissue or organism. Examples of such cases are as follows:
Note : In studies, where the cell lines are extracted from a healthy tissue/organism and then conditioned to induce disease, the disease label for such a dataset will be "normal" since the sample is not extracted from any diseased tissue or organism. Key specifications for disease metadata annotations are:
|
MeSH | Disease | curated_disease |
Cell type | This field represents the cell type of the samples within the study. Cell type labels are annotated in cases where the authors have cultured a particular cell type either extracted from tissues or developmental organs or generated in the lab and then used it in further experiment. Key specifications for cell type metadata annotations are as follows:
|
Cell Ontology | Cell Type | curated_cell_type |
Cell line | This field represents the cell line from which the samples were extracted. List of the population of modified cells used for the study. Cell line labels already present in the source metadata are normalized using synonyms present in the cell line ontology we use. Cell line labels are annotated in cases where the authors have cultured a particular cell line or bought it from organizations such as ATCC and then used in further experiments. Eg. MDAMB-231, HEK-293 Key specifications for cell line metadata annotations are as follows:
|
The Cellosaurus | Cell Line | curated_cell_line |
Other metadata fields | ||||
Field | Description | GUI Display Name | Polly-Python Display Name | |
Abstract | This field provides the abstract of the publication associated with the dataset. | NA | abstract | |
Year | This field provides the year in which the dataset or study is published. | Year | year | |
Gene | This field provides the gene(s) studied in the dataset. | Gene | curated_gene | |
Single cell chemistry | This field represents the sequencing method/platform used for sequencing the single cell genome. | Single cell chemsitry | curated_single_cell_chemistry | |
Sampling technique | This field represents the method/procedure used for collecting samples. | Sampling technique | curated_sampling_technique | |
Sampling storage technique | This field represents the method/technique used for preservation or storage of sample for analysis. | Sampling storage technique | curated_storage_technique | |
Summary | This field provides a detailed summary of the publication (can be the abstract) or a summary of the experiment. | Summary (Available for datasets from GEO only) | summary | |
Overall design | This field provides information on the overall design of the experiment as given by the author. | Overall Design (Available for datasets from GEO only) | overall_design | |
Publication | This field provides the link to the publication associated with the dataset. If the associated publication information is not available, then this field provides the link to the data source providing more information on the dataset. | NA | publication | |
Source | This field provides the name of the source repository from where the dataset is fetched. | Source | dataset_source | |
Description | This field provides a brief description of the experiment or the study. |
|
description | |
Data Type | This field provides the type of biomolecular data represented/studied in the dataset. | NA | data_type | |
Dataset ID | This field provides the unique id for the dataset/study to represent a group of samples. | Dataset ID | dataset_id | |
Number of Cells | This field represents the number of cells/observations in the dataset. | Number of Cells | total_num_cells | |
Samples | This field represents the total number of samples in a dataset. | Samples | total_num_samples |
2. Sample Level Metadata
Field | Description | Ontology | GUI Display Name | Polly-Python Display Name |
---|---|---|---|---|
Ontology-driven Fields | ||||
Tissue | This field represents the tissue(s) from which the samples originated. Tissue labels already present in the source metadata are normalized using a normalization model. In cases where tissue labels are missing, related texts and abstracts are processed and further normalized to get the tissue metadata label. Tissue labels are annotated for samples extracted from healthy or diseased tissue. All labels are harmonized with Brenda Tissue Ontology. |
Brenda Tissue Ontology | Tissue | curated_tissue |
Disease | At the sample level, this field represents the disease associated with a particular sample. Disease labels already present in the source metadata are normalized using a normalization model. In case the disease labels are missing, related texts and abstracts are processed and further normalized to get the disease metadata label. Disease labels are annotated for a sample when the samples have been collected from diseased tissue or organism. Examples of such cases are as follows:
At the sample level, disease labels are annotated for the following sample type:
|
MeSH | Disease | curated_disease |
Drug | This field represents the drugs that have been used in the treatment of a sample. Drug labels already present in the source metadata are normalized using a normalization model. In cases where drug labels are missing, related texts and abstracts are processed and and further normalized to get the drug metadata label. Drug labels are annotated for the following types of sample treatments:
Drug labels are not annotated for the following types of sample treatments:
Note: Any mention of the drug in the text is included as a drug label irrespective of whether it is being used in the experiment or not. |
PubChem | Drug | curated_drug |
Cell line | This field represents the cell line from which the sample was derived. Cell line labels already present in the source metadata are normalized using synonyms present in the cell line ontology we use. The cell line field is curated for a sample if the authors have cultured a particular cell line or bought it from organisations such as ATCC and then used in the further experiment. The names of the cell lines are harmonized by the cellosaurus ontology. | The Cellosaurus | Cell Line | curated_cell_line |
Cell Type | This field represents the cell type of the sample. Cell type labels are annotated where the authors have cultured a particular cell type either extracted from tissues or developmental organs or generated in the lab and then used it in the further experiment.The cell type field provides the closest cell type name as per the Cell Ontology. This cell type label can be either source derived or by manual cell type annotation. | Cell Ontology | Cell Type | curated_cell_type |
Other Metadata Fields | ||||
Field | Description | GUI Display Name | Polly-Python Display Name | |
Gene | Gene of interest in the sample | Gene | curated_gene | |
Genetic Modification | This fields represents the kind of genetic modification done on the sample. | Genetic Modification | curated_genetic_modification_type | |
Modified Genes | This fields represents the gene(s) modified in the sample under study. | NA | curated_gene_modified | |
Donor Type | This field represents the type/clinical condition of the donor | Donor Type | curated_donor_type | |
Donor Sample Type | This field represents the location/area from where the tumor samples are collected | Donor Sample Type | curated_donor_sample_type | |
Gender | This fields represents the gender of the organism from which the sample was derived | Gender | curated_gender | |
Minimum age | This fields provides the lower limit of the age range of the organism from which the samples have been obtained for the study. | Minimum age | curated_min_age | |
Maximum age | This fields provides the upper limit of the age range of the organism from which the samples have been obtained for the study. | Maximum age | curated_max_age | |
Age unit | This fields provides the age unit of the organism from which samples have been obtained. It is years for samples from humans and weeks for samples from mice. | Age unit | curated_age_unit | |
Sampling site | This fields provides the location/area from where the tumor samples are collected. | Sampling site | curated_sampling_site | |
Treatment | Name of the treatment given to the samples i.e. name of the chemical/drug/therapy | Treatment Name | curated_treatment_name | |
Treatment type | The type of treatment given to the sample | Treatment Type | curated_treatment_type | |
Response to Treatment | This field indicates the type/extent of response on the treatment | Treatment Response | curated_treatment_response | |
Author Cell Type | This field represents the author cell type as mentioned in the publication associated with the dataset. | Author cell type | curated_raw_cell_type | |
Marker present | This field represents the gene name/names that are differentially expressed in a cluster based on which the cell type of the cluster is annotated. | Marker Present | curated_marker_present | |
Marker absent | This field represents the gene name/names that are absent in a cluster based on which the cell type of the cluster is annotated. | Marker Absent | curated_marker_absent | |
Cell ontology ID | This field represents the unique ID for the cell type according to the Cell Ontology. | NA | curated_cell_ontology_id | |
Clusters | This field provides the cluster number to which each cell belongs after subjecting the dataset to the clustering process (Using Leiden or other algorithms). | Cell Type Cluster | clusters | |
UMI Counts | This field represents the Unique Molecular Identifier count per cell. It represents an absolute number of observed transcripts. The number should be higher than 500 in a cell. | UMI Counts | umi_counts | |
Sample ID | This field represents the unique ID of the sample. | Sample ID | sample_id | |
Cell ID | This field represents the unique ID associated with every cell | NA | cell_id | |
Gene Counts | This field represents the number of genes detected per cell (defined as genes with cpm > 1); cpm=counts per million | Gene Counts | gene_counts | |
Mitochondrial count | This field represents the percentage of mitochondrial counts in total counts for a cell. | NA | percent_mito | |
Title | This field represents the title of the sample, representing the type/genotype/origin/experimental condition of the sample | NA | title |
3. Feature Level Metadata
Field | Description | Polly-Python Display Name |
---|---|---|
Feature ID | This field represents the ID of the feature (gene, metabolite, protein etc) being measured. | feature_id |
Highly variable gene | This field indicates whether the gene is highly variable. For highly variable genes - True; otherwise -False. |
highly_variable |
Number of cells | This field represents the number of cells which are containing the gene. | n_cells |
Cell Type Annotation for single cell datasets: Manual Curation Process
Cell-type labels are assigned at the cell cluster level based on expression signatures using ontology or controlled vocabularies. For datasets, where cell-type annotations are not available from the source (mainly GEO datasets), we manually curate the cell-type information based on the differential marker expression for clusters. In cases where cell type annotations are already available in datasets at source, datasets are not manually re-curated.
I) Manual identification of the cell types and markers from Publications -
Internal curators determine if a particular dataset can be curated for cell type by going through the publications associated with the dataset. In publications, the information on cell type and the corresponding marker is present either in the figures (UMAP, T-SNE plots), text or supplementary files. If this information is not present, then such datasets are marked as 'Not Curatable'. The following types of studies fall under the category of 'Not curatable' datasets.
- Single Cell Type Study - The whole study was done on only one cell type.
- Lineages - The study included the lineage of one cell type. Example - T helper cells, T memory cells etc.
- Cell Cycle Studies - In this study the differential markers were studied for the G1, S, G2 and M phases of the cell cycle for a particular cell type.
- Methods - Different analysis methods were studied
- Publications having \< 2000 cells - These publications used methods which could not be reproduced. Therefore, these were not curated.
- Publication not available
- Marker information not available in the publication
- Marker Cell Type Info Absent
- Marker Info Absent
- Time Point
- Cell lines
- Transitional Cells
- Cell Type Info Absent
- Embryonic Development
- Organoids
- Others
II) Metadata addition- Annotation of clusters
The process of cell-type cluster annotation for curatable datasets is based on the general scRNASeq generalworkflow using the Scanpy library with steps as shown below in the figure:
UMAP/tSNE plots are generated as a result of single-cell raw count processing. By visualization of clusters with UMAP/t-SNE plots, cell type cluster annotation is done.
1. Cluster annotation with raw cell type (cell type terminology used in publication):
Based on the marker expression value for each cluster, the cell type is annotated to the cluster. This annotation is added as a field named curated_raw_cell_type. The raw cell type cluster annotation is compared with cell type annotation from the publication such as:
- All or most cell types are annotated
- UMAP is structurally similar to the one in the publication
- Relative proportions of cell types are matching
- Relative positions of cell types are matching
- Ontological terms and marker information are added for the cell type
2. Ontological terms and marker information:
- Cell type ontology + ontology ID: Cell-type annotation corresponding to Cell Ontology. This is given as the field named: curated_cell_type
- Marker information: Gene name/names that are differentially expressed in the cluster
- curated_marker_present: Gene name/names that are differentially expressed in the cluster
- curated_marker_absent: Gene name/names that are absent in the cluster
NOTE : For manual cell type curation, datasets should be available on Polly.