Workflow Overview
=================

Introduction
------------

Txseq is designed to efficiently parallelise the processing of bulk RNA-sequencing data on compute clusters. The workflow can start from either FASTQ or BAM inputs.


1. Assessment of read quality
------------------------------

Read quality can be assessed using the `FASTQC quality control tool <https://www.bioinformatics.babraham.ac.uk/projects/fastqc/>`_ with :doc:`pipeline_fastqc.py <pipelines/pipeline_fastqc>`.

2. Mapping and Quantitation
---------------------------

Txseq supports the following workflows:

#. `Salmon <https://github.com/COMBINE-lab/salmon>`_ and `tximeta <https://bioconductor.org/packages/release/bioc/html/tximeta.html>`_ for fast, sensitive and accurate pseudo alignment based quantitation(see :doc:`pipeline_salmon.py <pipelines/pipeline_salmon>` for more details).

#. `Hisat2 <http://daehwankimlab.github.io/hisat2/>`_ and `featureCounts <https://subread.sourceforge.net>`_ for more traditional mapping based quantitation (see :doc:`pipeline_hisat.py <pipelines/pipeline_hisat>` and :doc:`pipeline_feature_counts <pipelines/pipeline_feature_counts>` for more details).

It is recommended to run both workflows. The BAM files generated by Hisat2 are necessary for the generation of post-mapping QC statistics with Picard. These give essential insight into library quality that is not possible to obtain from analysis with Salmon alone. 


3. Post-mapping QC
------------------

Useful insight can be gained from examining read mapping statistics. Txseq can compute a suite of metrics using the 'CollectRnaSeqMetrics', 'EstimateLibraryComplexity', 'CollectAlignmentSummaryMetrics' and 'CollectInsertSizeMetrics' tools from the `Picard toolkit <https://broadinstitute.github.io/picard/>`_. It also computes the fraction of spliced reads. This functionality is implemented in :doc:`pipeline_bam_qc<pipelines/pipeline_bam_qc>`.

4. Downstream analysis
----------------------

For statistical and exploratory data analysis it is recommended to use tximeta length-corrected count estimates from Salmon. For visualising gene expression levels, it is recommended to use Salmon TPMs after applying an inter-sample normalisation routine such as e.g. upper-quartile normalisation.

Examples of how the outputs can be used to assess read quality, perform exploratory analysis and to perform differential expression analysis are provided as R markdown notebooks for :doc:`mouse hsc example <mouse_hscs_example>`.