Mouse HSCs example ================== Introduction ------------ For this example we use paired end RNA-seq data from `Regan-Komito et al 'GM-CSF drives dysregulated hematopoietic stem cell activity and pathogenic extramedullary myelopoiesis in experimental spondyloarthritis' Nature Communicaitons 2020 `_. The data can be retrieved from the European Nucleotide Archive (ENA) under accession `PRJNA521342 `_. .. note:: If you are working in the Kennedy workspace on the Oxford BMRC cluster, these data are already available in the "/well/kir/projects/mirror/ena/PRJNA521342/" folder. 1. Getting the configuration files and report templates ------------------------------------------------------- Make a suitable local folder and copy the samples.tsv and libraries.tsv for the "mouse_hscs" example into it. Here we choose to run the example in a folder with the path "~/work/hscs_example" :: mkdir ~/work/txseq_hscs_example cd ~/work/txseq_hscs_example cp -r /path/to/txseq/examples/mouse_hscs/* . cp -r /path/to/txseq/reports . Edit the txseq/libraries.tsv file to point to the location of the FASTQ files retrieved from the ENA on your system. 2. Retrieving and preparing annotations ---------------------------------------- .. note:: If you are working in the Kennedy workspace on the Oxford BMRC cluster, the required txseq annotations are available in the mirror folder and this step can be skipped. Follow the instructions on the :doc:`Genomes and Annotations ` page to: #. Retrieve the required sequence and annotation files. #. Run the "txseq ensembl" pipeline to prepare santitised mouse genome and annotation files for RNA-seq analysis. 3 Building the Salmon and Hisat2 transcriptome indexes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. note:: If you are working in the Kennedy workspace on the Oxford BMRC cluster, txseq mouse transcriptome indexes have already been built in the mirror folder and this step can be skipped. The example requires Salmon and Hisat2 indexes. To build these, follow the :doc:`Building transcriptome indexes ` instructions. 4. Checking FASTQ read quality ------------------------------ First we run the `FastQC tool `_ using :doc:`pipeline_fastqc.py`. Get a copy of the configuration file and edit appropriately to ensure that the correct "libraries.tsv" and "samples.tsv" files paths are set :: cd ~/work/txseq_hscs_example mkdir fastqc cd fastqc txseq fastqc config # get a copy of the default configuration file emacs pipeline_fastqc.yml # edit the configuration file as appropriate The pipeline can then be run as follows :: txseq fastqc make full -v5 -p20 The pipeline parses and store the FastQC output into sqlite database in a file called "csvdb". To visualise the results, open the associated R Markdown Report ("reports/fastqc.Rmd") in Rstudio, assign the location of the database to the "fastqc_sqlite_database" variable and knit the report. 5. Mapping with Hisat2 ---------------------- Next we map the data using :doc:`pipeline_hisat.py`. Fetch and edit a copy of the configuration file to set the paths to the "libraries.tsv" and "samples.tsv" files and hisat index :: cd ~/work/txseq_hscs_example mkdir hisat cd hisat txseq hisat config # get a copy of the default configuration file emacs pipeline_hisat.yml # edit the configuration file as appropriate The pipeline can then be run as follows :: txseq hisat make full -v5 -p20 The output BAM files are located in the "hisat.dir" sub-directory. 6. Generating post-mapping QC statistics ---------------------------------------- After mapping with Hisat2, post-mapping QC statistics are computed using :doc:`pipeline_bamqc.py`. This pipeline runs several `Picard `_ tools including *CollectRnaSeqMetrics*, *EstimateLibraryComplexity*, *AlignmentSummaryMetrics* and *CollectInsertSizeMetrics* as well as some custom scripts. :: cd ~/work/txseq_hscs_example mkdir bamqc cd bamqc txseq bamqc config # get a copy of the default configuration file emacs pipeline_bamqc.yml # edit the configuration file as appropriate The pipeline can then be run as follows :: txseq bamqc make full -v5 -p20 The results are saved in an sqlite database in the "csvdb" file. 7. Quantitation with FeatureCounts ---------------------------------- Count tables can be extracted from the BAM file using :doc:`pipeline_feature_counts.py`. :: cd ~/work/txseq_hscs_example mkdir feature_counts cd feature_counts txseq feature_counts config # get a copy of the default configuration file emacs pipeline_feature_counts.yml # edit the configuration file as appropriate The pipeline can then be run as follows :: txseq feature_counts make full -v5 -p20 The results are saved in an sqlite database in the "csvdb" file. 8. Quantitation with Salmon --------------------------- To quantitate the data using :doc:`pipeline_salmon.py`, we begin by fetching and edit a copy of the configuration file to set the paths to the "libraries.tsv" and "samples.tsv" files and salmon index :: cd ~/work/txseq_hscs_example mkdir salmon cd salmon txseq salmon config # get a copy of the default configuration file emacs pipeline_salmon.yml # edit the configuration file as appropriate The pipeline can then be run as follows :: txseq salmon make full -v5 -p20 The results of the pipeline are stored in the "csvdb" sqlite database and as a tximeta object in the "tximeta.dir/tximeta.RDS" for downstream analysis. Flat tables of TPMs can be retrieved from the database or from the "salmon.dir/salmon.transcripts.tpms.txt.gz" file. 9. Post-mapping QC analysis --------------------------- After running :doc:`pipeline_bamqc.py` and :doc:`pipeline_salmon.py` post-mapping QC can be performed using the "post_mapping_qc.Rmd" report template. Make a copy of the Rmd template file and open it in Rstudio to perform the analysis. The report visualises the individual QC statistics and performs a correlation analysis of the QC statistics with gene-expression space principle-components. This analysis helps to identify confounding technical sources of variation. 10. Exploratory analysis ----------------------- After running :doc:`pipeline_salmon.py` the similarity between the samples in gene-expression space can be explored using the "exploratory_data_analysis.Rmd" R Markdown report template. Make a copy of this file and open it in Rstudio to perform the analysis. The report produces plots showing hierarchical clustering of the samples by correlation of their expression profiles, the results of principle components analysis and a UMAP project of the samples. Together with the post-mapping QC report this analysis is useful for the identification of outliers. 11. DESeq2 analysis ------------------- After running :doc:`pipeline_salmon.py` differential expression analysis can be performed using the "differential_expression.Rmd" R Markdown report template. Make a copy of this file and open it in Rstudio to perform the analysis.