Pipeline feature_counts.py

Overview

This pipeline counts the number of reads mapped to transcript/gene models. It uses the featureCounts algorithm from the Subread package .

Configuration

The pipeline requires a configured pipeline_feature_counts.yml file.

A default configuration file can be generated by executing:

txseq salmon feature_counts

Inputs

The pipeline requires the following inputs

samples.tsv: see Configuration files
txseq annotations: the location where the pipeline_ensembl.py was run to prepare the annotatations.
bam files: the location of a folder containing the bam files named by “sample_id”.

Requirements

The following software is required:

Subread

Output files

The pipeline produces the following outputs:

per-sample results: in the “feature.counts.dir” subdirectory
An sqlite database: in a file named “csvdb” which contains the per-gene counts.

Code

txseq.pipeline_feature_counts.count(infile, sentinel): Run featureCounts.

txseq.pipeline_feature_counts.loadCounts(infiles, outfile): Combine and load count data in the project database.

txseq.pipeline_feature_counts.geneCounts(infile, outfile): Prepare a gene-by-sample table of featureCounts counts.

txseq.pipeline_feature_counts.loadGeneCounts(infile, outfile): Load the gene-by-sample matrix of count data in the project database.

txseq.pipeline_feature_counts.loadTranscriptInfo(infile, outfile): Load the annotations for salmon into the project database.

txseq.pipeline_feature_counts.nGenesDetected(infile, outfile): Count of genes detected by featureCount at counts > 0 in each sample.

txseq.pipeline_feature_counts.loadNGenesDetected(infile, outfile): Load the numbers of genes expressed to the project database.