Workflow Overview
=================
Introduction
------------
Txseq is designed to efficiently parallelise the processing of bulk RNA-sequencing data on compute clusters. The workflow can start from either FASTQ or BAM inputs.
1. Assessment of read quality
------------------------------
Read quality can be assessed using the `FASTQC quality control tool `_ with :doc:`pipeline_fastqc.py `.
2. Mapping and Quantitation
---------------------------
Txseq supports the following workflows:
#. `Salmon `_ and `tximeta `_ for fast, sensitive and accurate pseudo alignment based quantitation(see :doc:`pipeline_salmon.py ` for more details).
#. `Hisat2 `_ and `featureCounts `_ for more traditional mapping based quantitation (see :doc:`pipeline_hisat.py ` and :doc:`pipeline_feature_counts ` for more details).
It is recommended to run both workflows. The BAM files generated by Hisat2 are necessary for the generation of post-mapping QC statistics with Picard. These give essential insight into library quality that is not possible to obtain from analysis with Salmon alone.
3. Post-mapping QC
------------------
Useful insight can be gained from examining read mapping statistics. Txseq can compute a suite of metrics using the 'CollectRnaSeqMetrics', 'EstimateLibraryComplexity', 'CollectAlignmentSummaryMetrics' and 'CollectInsertSizeMetrics' tools from the `Picard toolkit `_. It also computes the fraction of spliced reads. This functionality is implemented in :doc:`pipeline_bam_qc`.
4. Downstream analysis
----------------------
For statistical and exploratory data analysis it is recommended to use tximeta length-corrected count estimates from Salmon. For visualising gene expression levels, it is recommended to use Salmon TPMs after applying an inter-sample normalisation routine such as e.g. upper-quartile normalisation.
Examples of how the outputs can be used to assess read quality, perform exploratory analysis and to perform differential expression analysis are provided as R markdown notebooks for :doc:`mouse hsc example `.