Data Workflow - Bulk RNAseq
Bulk RNASeq Data
All GEO RNA-Seq datasets on Polly are processed using the Kallisto Pipeline. The data is processed using the following reference genome, annotation, and complementary DNA sequence data from Ensembl release 107 for each organism. However, approximately 12% of the datasets have been processed with the Ensemble release V90. These will be reprocessed in future based on the Ensemble release V107.
- Homo Sapiens Ensembl release 107, 90
- Genome sequence (fasta)
- Gene annotation set (GTF)
- cDNA sequences (fasta)
-
Mus musculus Ensembl release 107, 90
- Genome sequence (fasta)
- Gene annotation set (GTF)
- DNA sequences (fasta)
-
Rattus norvegicus Ensembl release 107, 90
- Genome sequence (fasta)
- Gene annotation set (GTF)
- cDNA sequences (fasta)
Process flow
Details of the processing steps:
- Detecting organisms and fetching relevant genome, annotation, and complementary DNA sequence data from Ensembl.
- Downloading the transcriptome sequencing data (.sra files) from SRA using sratoolkit prefetch / AWS S2 URI if publicly available.
- Validating the downloaded .sra file using vdb-validate.
- Identifying if the SRA data is (single-end) or (paired-end)using fastq-dump. Both single-end (SE) and paired-end (PE) sequencing data are processed with the exclusion of color-space sequence data.
- Extracting fastq files with parallel-fastq-dump.
- Performing basic quality control checks on the .fastq reads using FastQC. (Diagnose basespace / colorspace, quality encoding, read length)
- Trimming of Bases with phred quality \<10 on the 3′ ends and discarded reads shorter than 18 nucleotide using Skewer.
- Adapter sequences at the 3′ end are detected using Minion.
- If the predicted adapter sequence is not present in the genome and exceeds a frequency of 2.5% then the adapter sequences are clipped using Skewer.
- Adapter contamination detection using bowtie and clipping using a skewer.
- Transcript-level expression counts are generated using Kallisto by mapping all the reads that pass quality control to the genome. Command: "kallisto quant" . All counts are reported on the gene level by taking a simple sum of Transcript-level counts. (NOTE: Kallisto pseudo counts are rounded to integer values)
- For every SRR accession, the generated counts are collected into a single (.gct) file and multiple SRR counts per GSM ID (sample) are aggregated.
- At the feature level, the Ensembl gene IDs are mapped to the respectiveHGNC symbol, MGI Symbol or RGI symbol. Counts for duplicate genes are dropped using Mean Average Deviation Score.
- Each sample is then annotated with relevant metadata using our custom curation models for fields like disease, tissue, cell line, drug etc.
- If requested, the counts matrix is normalised using DESeq2 VST (Variance Stabilizing Transformation).
- GCT having Raw Counts is pushed to the Omix Atlas - Bulk RNASeq OmixAtlas.
Tools Used for the processing
Tool | Task | Usage |
---|---|---|
GEOparse | Query GEO and fetch sample IDs (GSMs). | |
pySRAdb | Query SRA and fetch run IDs corresponding to sample IDs (GSMs) and create GSM: SRR mappings. | |
SRA toolkit | Download SRA files | prefetch SRRXXXXXX |
SRA toolkit | Validate downloaded SRA files | vdb-validate |
SRA toolkit | diagnose single or paired-end | fastq-dump |
SRA toolkit | dump fastq | parallel-fastq-dump, |
FastQC | Diagnose basespace / colorspace, quality encoding, read length | fastqc |
parallel-fastq-dump | Rapid decompression of sequence data from .sra files | parallel-fastq-dump |
Minion | 3' adapter detection | minion search-adapter |
Bowtie2 | Adapter contamination detection | bowtie2 |
Skewer |
|
skewer |
FASTX-Toolkit | Progressive 5' trimming | fastx_trimmer |
Kallisto | Transcript-level mapping | Kallisto quant |
Custom script (make GCT) | Collect transcript counts, sample metadata and make a counts matrix then make a GCT file | |
GEOtron | Curate sample and data-set level information and attach it to the GCT file |