Configuration Files
Defining samples and sequencing files
Txseq requires two tab-seperated files to be provided that specify the sample and sequencing library information. It is not necessary to rename or merge FASTQ files before running txseq: the pipelines will automatically combine sequence data for the same samples and label outputs by the given “sample_id”.
(1) “samples.tsv”
A tab-separated text file with the following mandatory columns:
“sample_id”: a unique identifier for the sample
“type”: either ‘SE’ for single end or ‘PE’ for paired end
“strand”: either ‘none’, ‘forward’ or ‘reverse’ (see note below).
Sample metadata can also be stored in this table for downstream analysis for example with columns such as:
“condition”
“replicate”
“age”
“sex”
“genotype”
“batch”
Note
strand values of ‘none’, ‘forward’ and ‘reverse’ will be used to set parameter values in txseq pipelines as follows:
“none”: data is treated as unstranded. This is appropriate for e.g. Illumina Truseq and most single-cell protocols. :
hisat: default, i.e. –rna-strandedness not set
cufflinks: fr-secondstrand
HT-seq: no
PICARD: NONE
SALMON: (I)U
“forward”: The first read (if paired) or read (if single end) corresponds to the transcript strand e.g. Directional Illumina, Standard Solid.
hisat: SE: F, PE: FR
cufflinks: fr-secondstrand
HT-seq: yes
PICARD: FIRST_READ_TRANSCRIPTION_STRAND
SALMON: (I)SF
“reverse”: The first read (if paired) or read (if single end) corresponds to the reverse complement of the transcript strand e.g. dUTP, NSR, NNSR
hisat: SE: R, PE: RF
cufflinks: fr-firststrand
HT-seq: reverse
PICARD: SECOND_READ_TRANSCRIPTION_STRAND
SALMON: (I)SR
An example samples.tsv file is shown below:
sample_id tissue cell_type replicate type strand condition
curdlan_bm_gmp_R1 bm gmp R1 PE none curdlan
curdlan_bm_gmp_R2 bm gmp R2 PE none curdlan
curdlan_bm_gmp_R3 bm gmp R3 PE none curdlan
curdlan_bm_lthsc_R2 bm lthsc R2 PE none curdlan
curdlan_bm_lthsc_R3 bm lthsc R3 PE none curdlan
curdlan_bm_sthsc_R1 bm sthsc R1 PE none curdlan
sscontrol_bm_gmp_R1 bm gmp R1 PE none sscontrol
sscontrol_bm_lthsc_R2 bm lthsc R2 PE none sscontrol
sscontrol_bm_mpp_R2 bm mpp R2 PE none sscontrol
sscontrol_bm_mpp_R3 bm mpp R3 PE none sscontrol
sscontrol_bm_sthsc_R2 bm sthsc R2 PE none sscontrol
curdlan_bm_mpp_R1 bm mpp R1 PE none curdlan
curdlan_bm_mpp_R2 bm mpp R2 PE none curdlan
curdlan_bm_mpp_R3 bm mpp R3 PE none curdlan
curdlan_bm_sthsc_R3 bm sthsc R3 PE none curdlan
sscontrol_bm_mpp_R1 bm mpp R1 PE none sscontrol
sscontrol_bm_sthsc_R1 bm sthsc R1 PE none sscontrol
sscontrol_bm_sthsc_R3 bm sthsc R3 PE none sscontrol
curdlan_bm_lthsc_R1 bm lthsc R1 PE none curdlan
curdlan_bm_sthsc_R2 bm sthsc R2 PE none curdlan
sscontrol_bm_gmp_R2 bm gmp R2 PE none sscontrol
sscontrol_bm_gmp_R3 bm gmp R3 PE none sscontrol
sscontrol_bm_lthsc_R1 bm lthsc R1 PE none sscontrol
sscontrol_bm_lthsc_R3 bm lthsc R3 PE none sscontrol
(2) “libraries.tsv” (optional)
Required when starting from FASTQ files.
A tab-separated text file with the following mandatory columns
“sample_id”: these values must match those in the sample_id in the samples.tsv
“lane”: an integer representing the sequencing lane/unit.
“flow_cell”: an integer representing the flow cell.
- “fastq_path”: For SE libraries, the fastq file path. For PE libraries: the
read 1 fastq: the path for read 2 is imputed by the pipelines.
Note
When samples have been sequenced across multiple lanes, use one line per lane. Comma-separated lane and fastq_path values are not supported. Quality control analysis is performed at lane level; lanes will be aggregated for quantitation.
Note
Paired-end fastq files must end with “1|2.fastq.gz” or “fastq.1|2.gz”. For paired end samples the Read 1 and Read 2 FASTQ files for the same lane must be located in the same folder.
An example libraries.tsv file is shown below:
sample_id lane flow_cell fastq_path
curdlan_bm_gmp_R1 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545249_1.fastq.gz
curdlan_bm_gmp_R2 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545250_1.fastq.gz
curdlan_bm_gmp_R3 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545251_1.fastq.gz
curdlan_bm_lthsc_R2 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545253_1.fastq.gz
curdlan_bm_lthsc_R3 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545254_1.fastq.gz
curdlan_bm_sthsc_R1 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545258_1.fastq.gz
sscontrol_bm_gmp_R1 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545261_1.fastq.gz
sscontrol_bm_lthsc_R2 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545265_1.fastq.gz
sscontrol_bm_mpp_R2 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545268_1.fastq.gz
sscontrol_bm_mpp_R3 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545269_1.fastq.gz
sscontrol_bm_sthsc_R2 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545271_1.fastq.gz
curdlan_bm_mpp_R1 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545255_1.fastq.gz
curdlan_bm_mpp_R2 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545256_1.fastq.gz
curdlan_bm_mpp_R3 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545257_1.fastq.gz
curdlan_bm_sthsc_R3 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545260_1.fastq.gz
sscontrol_bm_mpp_R1 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545267_1.fastq.gz
sscontrol_bm_sthsc_R1 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545270_1.fastq.gz
sscontrol_bm_sthsc_R3 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545272_1.fastq.gz
curdlan_bm_lthsc_R1 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545252_1.fastq.gz
curdlan_bm_sthsc_R2 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545259_1.fastq.gz
sscontrol_bm_gmp_R2 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545262_1.fastq.gz
sscontrol_bm_gmp_R3 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545263_1.fastq.gz
sscontrol_bm_lthsc_R1 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545264_1.fastq.gz
sscontrol_bm_lthsc_R3 1 1 /well/kir/projects/mirror/ena/PRJNA521342/SRR8545266_1.fastq.gz
Configuring and running pipelines
Run the txseq –help command to view the help documentation and find available pipelines to run.
The txseq pipelines are written using cgat-core pipelining system. From more information please see the CGAT-core paper. Here we illustrate how the pipelines can be run using the cellranger pipeline as an example.
Following installation, to find the available pipelines run:
txseq -h
Next generate a configuration yml file:
txseq salmon config -v5
To fully run e.g. the txseq salmon pipeline the following command is used:
txseq salmon make full -v5 -p20
The “-v5” flag sets the verbosity level to the maximum level and the “-p20” flag tells the pipeline to launch upto 20 jobs in parallel: this number should be set according to the sample number and availability of compute resources.
It is also possible to run individual pipeline tasks to get a feel of what each one is doing. Individual tasks can then be executed by name, e.g.
txseq salmon make quant -v5 -p20
Note
If any upstream tasks are out of date they will automatically be run before the named task is executed.
Getting Started
To get started please see the Mouse hscs example.