Mouse HSCs example

Introduction

For this example we use paired end RNA-seq data from Regan-Komito et al ‘GM-CSF drives dysregulated hematopoietic stem cell activity and pathogenic extramedullary myelopoiesis in experimental spondyloarthritis’ Nature Communicaitons 2020. The data can be retrieved from the European Nucleotide Archive (ENA) under accession PRJNA521342.

Note

If you are working in the Kennedy workspace on the Oxford BMRC cluster, these data are already available in the “/well/kir/projects/mirror/ena/PRJNA521342/” folder.

1. Getting the configuration files and report templates

Make a suitable local folder and copy the samples.tsv and libraries.tsv for the “mouse_hscs” example into it. Here we choose to run the example in a folder with the path “~/work/hscs_example”

mkdir ~/work/txseq_hscs_example
cd ~/work/txseq_hscs_example
cp -r /path/to/txseq/examples/mouse_hscs/* .
cp -r /path/to/txseq/reports .

Edit the txseq/libraries.tsv file to point to the location of the FASTQ files retrieved from the ENA on your system.

2. Retrieving and preparing annotations

Note

If you are working in the Kennedy workspace on the Oxford BMRC cluster, the required txseq annotations are available in the mirror folder and this step can be skipped.

Follow the instructions on the Genomes and Annotations page to:

  1. Retrieve the required sequence and annotation files.

  2. Run the “txseq ensembl” pipeline to prepare santitised mouse genome and annotation files for RNA-seq analysis.

3 Building the Salmon and Hisat2 transcriptome indexes

Note

If you are working in the Kennedy workspace on the Oxford BMRC cluster, txseq mouse transcriptome indexes have already been built in the mirror folder and this step can be skipped.

The example requires Salmon and Hisat2 indexes. To build these, follow the Building transcriptome indexes instructions.

4. Checking FASTQ read quality

First we run the FastQC tool using pipeline_fastqc.py. Get a copy of the configuration file and edit appropriately to ensure that the correct “libraries.tsv” and “samples.tsv” files paths are set

cd ~/work/txseq_hscs_example
mkdir fastqc
cd fastqc
txseq fastqc config # get a copy of the default configuration file
emacs pipeline_fastqc.yml # edit the configuration file as appropriate

The pipeline can then be run as follows

txseq fastqc make full -v5 -p20

The pipeline parses and store the FastQC output into sqlite database in a file called “csvdb”. To visualise the results, open the associated R Markdown Report (“reports/fastqc.Rmd”) in Rstudio, assign the location of the database to the “fastqc_sqlite_database” variable and knit the report.

5. Mapping with Hisat2

Next we map the data using pipeline_hisat.py. Fetch and edit a copy of the configuration file to set the paths to the “libraries.tsv” and “samples.tsv” files and hisat index

cd ~/work/txseq_hscs_example
mkdir hisat
cd hisat
txseq hisat config # get a copy of the default configuration file
emacs pipeline_hisat.yml # edit the configuration file as appropriate

The pipeline can then be run as follows

txseq hisat make full -v5 -p20

The output BAM files are located in the “hisat.dir” sub-directory.

6. Generating post-mapping QC statistics

After mapping with Hisat2, post-mapping QC statistics are computed using pipeline_bamqc.py. This pipeline runs several Picard tools including CollectRnaSeqMetrics, EstimateLibraryComplexity, AlignmentSummaryMetrics and CollectInsertSizeMetrics as well as some custom scripts.

cd ~/work/txseq_hscs_example
mkdir bamqc
cd bamqc
txseq bamqc config # get a copy of the default configuration file
emacs pipeline_bamqc.yml # edit the configuration file as appropriate

The pipeline can then be run as follows

txseq bamqc make full -v5 -p20

The results are saved in an sqlite database in the “csvdb” file.

7. Quantitation with FeatureCounts

Count tables can be extracted from the BAM file using pipeline_feature_counts.py.

cd ~/work/txseq_hscs_example
mkdir feature_counts
cd feature_counts
txseq feature_counts config # get a copy of the default configuration file
emacs pipeline_feature_counts.yml # edit the configuration file as appropriate

The pipeline can then be run as follows

txseq feature_counts make full -v5 -p20

The results are saved in an sqlite database in the “csvdb” file.

8. Quantitation with Salmon

To quantitate the data using pipeline_salmon.py, we begin by fetching and edit a copy of the configuration file to set the paths to the “libraries.tsv” and “samples.tsv” files and salmon index

cd ~/work/txseq_hscs_example
mkdir salmon
cd salmon
txseq salmon config # get a copy of the default configuration file
emacs pipeline_salmon.yml # edit the configuration file as appropriate

The pipeline can then be run as follows

txseq salmon make full -v5 -p20

The results of the pipeline are stored in the “csvdb” sqlite database and as a tximeta object in the “tximeta.dir/tximeta.RDS” for downstream analysis. Flat tables of TPMs can be retrieved from the database or from the “salmon.dir/salmon.transcripts.tpms.txt.gz” file.

9. Post-mapping QC analysis

After running pipeline_bamqc.py and pipeline_salmon.py post-mapping QC can be performed using the “post_mapping_qc.Rmd” report template.

Make a copy of the Rmd template file and open it in Rstudio to perform the analysis. The report visualises the individual QC statistics and performs a correlation analysis of the QC statistics with gene-expression space principle-components.

This analysis helps to identify confounding technical sources of variation.

10. Exploratory analysis

After running pipeline_salmon.py the similarity between the samples in gene-expression space can be explored using the “exploratory_data_analysis.Rmd” R Markdown report template.

Make a copy of this file and open it in Rstudio to perform the analysis. The report produces plots showing hierarchical clustering of the samples by correlation of their expression profiles, the results of principle components analysis and a UMAP project of the samples.

Together with the post-mapping QC report this analysis is useful for the identification of outliers.

11. DESeq2 analysis

After running pipeline_salmon.py differential expression analysis can be performed using the “differential_expression.Rmd” R Markdown report template.

Make a copy of this file and open it in Rstudio to perform the analysis.