Skip to content

SQL Query Syntax

Complete Syntax:

The complete syntax for searching and aggregating data is as follows:

[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition ]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
[ OFFSET count [ ROW | ROWS ] ]
[ LIMIT [ count | ALL ] ]

Querying the dataset level metadata:

 query = "SELECT [column_name] FROM [files] WHERE [column_name]='[value]'"
 query = "SELECT [column_name] FROM [repo_name].datasets WHERE [column_name]='[value]'"
 query = "SELECT [column_name] FROM [repo_id].datasets WHERE [column_name]='[value]'"

Querying the sample level metadata:

For all samples except Single Cell
query = "SELECT [column_name] FROM [gct_metadata] WHERE [column_name]='[value]'"
query = "SELECT [column_name] FROM [repo_name].samples WHERE [column_name]='[value]'"
query = "SELECT [column_name] FROM [repo_id].samples WHERE [column_name]='[value]'"
For samples in Single Cell
query = "SELECT [column_name] FROM [h5ad_metadata] WHERE [column_name]='[value]'"

Querying the feature level metadata:

For all features except Single Cell
query = "SELECT [column_name] FROM [gct_data] WHERE [column_name]='[value]'"
query = "SELECT [column_name] FROM [repo_name].features WHERE [column_name]='[value]'"
query = "SELECT [column_name] FROM [repo_id].features WHERE [column_name]='[value]'"
For features in Single Cell
query = "SELECT [column_name] FROM [h5ad_data] WHERE [column_name]='[value]'"

Query specific to source and datatype in an OmixAtlas

If the schema of the repository has multiple source and data types then querying for the specific source and/or datatype is enabled in this release.

from polly import Omixatlas
omixatlas = Omixatlas()
query = """SELECT * FROM repo_name.source_name_in_schema.datatype_name_in_schema.datasets WHERE CONTAINS(curated_disease, 'Multiple Myeloma')
results-omixatlas.query_metadata(query)

results

For clearer understanding of this feature, let’s look at a case:-

In the schema shown below for repo_id: 1659450268526, we can see there are 3 different sources. Out of these sources, the source lincs has 2 datatypes - mutation and transcriptomics

dataset level schema of repo_id 1659450268526

Query examples which will be supported in the above context:-

1 query = """SELECT * FROM 1659450268526.geo.datasets WHERE CONTAINS(curated_disease, 'Multiple Myeloma') """

2 query = """SELECT * FROM 1659450268526.lincs.datasets WHERE CONTAINS(curated_disease, 'Multiple Myeloma') """

3 query = """SELECT * FROM 1659450268526.lincs.transcriptomics.datasets WHERE CONTAINS(curated_disease, 'Multiple Myeloma') """

The response to these queries will have a dataframe with columns which are specific to the source and datatype as per the schema shown above.

A detailed notebook and technical note describing these queries will be updated in the documentation.

Writing conditions with operators

The following operators can be used to define the conditions in the above mentioned queries:

Operators Functions performed
= Equal to operator which can be used to find matching strings with values in the columns
<> Not equal to operator which can be used to find non-matching strings with values in the columns
> Greater than operator which can be used ONLY for integer based columns
< Less than operator which can be used ONLY for integer based columns
>= Greater than or equal to operator which can be used ONLY for integer based columns
<= Less than or equal to operator which can be used ONLY for integer based columns
IS NULL Check if the field value is NULL.
IS NOT NULL Check if the field value is NOT NULL.
AND All values across the parameters searched for have to be present in a dataset for it to be returned as a match when the AND operator is used.

e.g. “organism = ‘Homo sapiens' AND disease = 'Carcinoma, Hepatocellular’” would only return datasets that belong to homo sapiens and have the disease as hepatocellular carcinoma.
OR Atleast any one value across the parameters searched for have to be present in a dataset for it to be returned as a match when the OR operator is used.

e.g. organism = 'Homo sapiens' OR disease = 'Carcinoma, Hepatocellular' would return datasets that belong to homo sapiens or have the disease as hepatocellular carcinoma or match both criteria.
MATCH QUERY(,'value') It works like a fuzzy search. If you add a string for a parameter with this operator, it would return all possible results matching each word in the string. The search output is returned with a “Score” using which the output is sorted.

e.g. MATCH_QUERY(description,'Transcriptomics profiling') would return all datasets having transcriptomics profiling , Transcriptomics and profiling as possible terms within their description. Each dataset would be scored on the basis of matching of the searched string with the information present within the dataset.
MATCH PHRASE(,'value') This can be used for exact phrase match with the information being searched for.

e.g. MATCH_PHRASE(description,'Transcriptomics profiling') would only return the datasets that have Transcriptomics profiling within their description.
MULTI MATCH('query'='value', 'column_name'='value) This can be used to search for text in multiple fields, use MULTI MATCH('query'='value', 'column_name'='value).

e.g. MULTI MATCH('query'='Stem Cells', 'fields'='tissue','description') would return datasets that have "Stem Cell" in either tissue OR description fields.
GROUP BY The GROUP BY operator groups rows that have the same values into summary rows. The GROUP BY statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.
HAVING Use the HAVING clause to aggregate inside each bucket based on aggregation functions (COUNT, AVG, SUM, MIN, and MAX). The HAVING clause filters results from the GROUP BY clause
COUNT(*) This counts each row present in a table/index being queried.

NOTE: The output of this query would return a JSON stating the total number of rows in the table
LIMIT NOTE: The response of any query returns 200 entries by default.
You can extend this by defining the LIMIT of the results you want to query to be able to return.
ORDER BY Can only be used to sort the search results using integer based parameters in the schema. Sorting on the basis of dataset_id, number of samples, _score of the data is available at the dataset-level metadata. ASC or DESC can be used to define whether you want to order the rows in ascending or descending order respectively