# Introduction

## Overview

The *TraceFinder Downstream Analysis* application performs the downstream analysis of the LC-MS data from the TraceFinder. It allows the user to do various statistical analyses to find valuable insights from the data.

### Scope of the Application

- LC-MS data from TraceFinder can be used for the downstream analysis.
- Performs quality checks using PCA analysis.
- Visualizes data distribution using boxplots to observe the median and quantiles for each feature.
- Performs statistical analysis using limma to get actionable insights.
- Provides barplot visualization to look at observed values of individual features.

# Getting Started

## User Input

The user needs to upload the following files to the workspace

*Intensity Matrix:*Raw intensity file with metabolite names.*Metadata:*Metadata file containing the cohort and other information.

## Tutorial

Select the *TraceFinder Downstream analysis* application from the *Studio Applications* section.

*Select the workspace* from the drop-down and provide the *Name of the session* and click on *Launch.*

## Pre-processing

### Upload

Select the intensity file and the metadata. Select *transform the data by 1* if you want to transform the data by one or else *Proceed* if you do not want to perform any transformation.

## Normalization

Select the type of normalization that you want to do - sum/median/mean of internal standards.

**Normalization by sum**: Â The sum of all the internal standards will be used for normalization. The sample wise data is taken and is divided by the sample wise sum of all the internal standards to perform this normalization.

**Normalization by mean**: The average of all the internal standards will be used for normalization. The sample wise data is taken and is divided by the sample wise average of all the internal standards to perform this normalization.

**Normalization by median:** The median of all the internal standards will be used for normalization. The sample wise data is taken and is divided by the sample wise average of all the internal standards to perform this normalization.

This will give a *boxplot* of normalized intensity for all the samples.

## Quality Check

### PCA

This component allows you to simplify the complexity of high-dimensional data while retaining the trends and patterns in it. It projects the data onto a lower dimension with an objective to find the best summary of the data using a limited number of principal components that help in understanding the clustering pattern between biologically grouped and ungrouped samples.

*Cohort Column*: Dropdown to select one of the metadata columns.*Top N Variants*: The top N variable entities will be used for PCA calculation. Define the number in this box. The default number used is 1500.

It generates two outputs:

**PCA Plot**: A plot is created where the samples are labeled based on the cohort selected in the metadata column. When you hover over the points, sample ID and percentage of variance explained by each principal component are displayed along with the cohort.

**PCA Score**: Table of the first 10 PC values and metadata columns.

### Boxplot

A *boxplot* is a graph that presents information from a five-number summary namely lower extreme, lower quartile, median, upper quartile, and upper extreme. In this plot: the median is marked by a vertical line inside the box; the ends of the box are upper and lower quartiles; the two lines outside the box extend to the highest and lowest observations. It is useful for knowing the nature of distribution (i.e., skewed) and potential unusual observations.

*Select gene/metabolites*: Dropdown to select the metabolites to plot.*Select metadata column*: Dropdown to select one of the metadata columns.

The *boxplot* is generated based on the selected metabolites and the selected cohort column from metadata.

### Differential Expression

This component allows the search for differentially expressed (DE) genes, that is, genes that show differences in expression level between conditions or in other ways are associated with given predictors or responses.

*Cohort Column*: Dropdown to select one of the metadata columns.*Cohort A*: Dropdown to select a cohort from the metadata column selected.*Cohort B*: Dropdown to select another cohort from the metadata column selected.*Normalization*: Perform log^{2}normalization if data is not normalized.*Algorithm*: You can select any one of the two algorithms -*limma*or*Unpaired t-test*.

*Limma* is an R package for the analysis of gene expression microarray data, especially the use of linear models for analyzing designed experiments and the assessment of differential expression. Limma provides the ability to analyze comparisons between many RNA targets simultaneously in arbitrary complicated designed experiments.

An *unpaired t-test* (also known as an independent t-test) is a statistical procedure that compares the averages/means of two independent or unrelated groups to determine if there is a significant difference between the two.

*P-Value Correction*: You can select the*Benjamini-Hochberg*method to correct the*p-value*for False Discovery Rate or the*Bonferroni*method to correct the*p-value*for Type I errors.*P-Value Metric*: Plot and calculate significance using the selected metric. The*p-value*is the value returned by the algorithm while the*Adjusted p-value*is the corrected value after applying one of the correction methods above.*P-value threshold*: You can select the appropriate threshold for the selected*p-value*metric.*p-values*lower than this threshold will be marked as significant.*Absolute Log2FC Threshold*: You can select the appropriate fold change threshold. Log_{2}fold change values higher than this will be marked as significant.

Once all the parameters are selected, execute the component by clicking on *Run Task*.

It will generate two outputs:

**Differential Expression**: Table with Differential Expression results with *p-value* and fold change.

**Volcano Plot:** Based on the parameters specified, a volcano plot is displayed. The volcano plot helps in visualizing lipids that are significantly dysregulated between two cohorts.

### ANOVA

This component allows the user to do a statistical t-test of whether two or more population means are equal.

*Cohort Column*: Dropdown to select one of the metadata columns.*Cohorts*: Dropdown to select two or more cohorts from the cohort column selected.

This will give the following output:

**ANOVA barplot**

**ANOVA Boxplot**

**ANOVA table**

## Studio Dashboard

Data Studio lets you visualize your data with the number of highly configurable charts and tables, which you can save and add to dashboards and then customize as needed. The Visualization Dashboard provides an at-a-glance view of the selected visualization charts. The dashboard is customizable and can be organized in the most effective way to help you understand complex relationships in your data and can be used to create engaging and easy-to-understand reports. A template of the report can also be defined to generate the output if required.

The generated reports are interactive and can be shared with the collaborators. You can easily communicate and act on the customized data where all the members of your team can compare, filter and organize the exact data they need on the fly, in one report.