Workflow Overview

Introduction

Txseq is designed to efficiently parallelise the processing of bulk RNA-sequencing data on compute clusters. The workflow can start from either FASTQ or BAM inputs.

1. Assessment of read quality

Read quality can be assessed using the FASTQC quality control tool with pipeline_fastqc.py.

2. Mapping and Quantitation

Txseq supports the following workflows:

  1. Salmon and tximeta for fast, sensitive and accurate pseudo alignment based quantitation(see pipeline_salmon.py for more details).

  2. Hisat2 and featureCounts for more traditional mapping based quantitation (see pipeline_hisat.py and pipeline_feature_counts for more details).

It is recommended to run both workflows. The BAM files generated by Hisat2 are necessary for the generation of post-mapping QC statistics with Picard. These give essential insight into library quality that is not possible to obtain from analysis with Salmon alone.

3. Post-mapping QC

Useful insight can be gained from examining read mapping statistics. Txseq can compute a suite of metrics using the ‘CollectRnaSeqMetrics’, ‘EstimateLibraryComplexity’, ‘CollectAlignmentSummaryMetrics’ and ‘CollectInsertSizeMetrics’ tools from the Picard toolkit. It also computes the fraction of spliced reads. This functionality is implemented in pipeline_bam_qc.

4. Downstream analysis

For statistical and exploratory data analysis it is recommended to use tximeta length-corrected count estimates from Salmon. For visualising gene expression levels, it is recommended to use Salmon TPMs after applying an inter-sample normalisation routine such as e.g. upper-quartile normalisation.

Examples of how the outputs can be used to assess read quality, perform exploratory analysis and to perform differential expression analysis are provided as R markdown notebooks for mouse hsc example.