Workflow Overview

Introduction

Txseq is designed to efficiently parallelise the processing of bulk RNA-sequencing data on compute clusters. The workflow can start from either FASTQ or BAM inputs.

1. Assessment of read quality

Read quality can be assessed using the FASTQC quality control tool with pipeline_fastqc.py.

2. Mapping and Quantitation

Txseq supports the following workflows:

Salmon and tximeta for fast, sensitive and accurate pseudo alignment based quantitation(see pipeline_salmon.py for more details).
Hisat2 and featureCounts for more traditional mapping based quantitation (see pipeline_hisat.py and pipeline_feature_counts for more details).

It is recommended to run both workflows. The BAM files generated by Hisat2 are necessary for the generation of post-mapping QC statistics with Picard. These give essential insight into library quality that is not possible to obtain from analysis with Salmon alone.

3. Post-mapping QC

Useful insight can be gained from examining read mapping statistics. Txseq can compute a suite of metrics using the ‘CollectRnaSeqMetrics’, ‘EstimateLibraryComplexity’, ‘CollectAlignmentSummaryMetrics’ and ‘CollectInsertSizeMetrics’ tools from the Picard toolkit. It also computes the fraction of spliced reads. This functionality is implemented in pipeline_bam_qc.

4. Downstream analysis

For statistical and exploratory data analysis it is recommended to use tximeta length-corrected count estimates from Salmon. For visualising gene expression levels, it is recommended to use Salmon TPMs after applying an inter-sample normalisation routine such as e.g. upper-quartile normalisation.

Examples of how the outputs can be used to assess read quality, perform exploratory analysis and to perform differential expression analysis are provided as R markdown notebooks for mouse hsc example.