Configuration Files

Defining samples and sequencing files

Txseq requires two tab-seperated files to be provided that specify the sample and sequencing library information. It is not necessary to rename or merge FASTQ files before running txseq: the pipelines will automatically combine sequence data for the same samples and label outputs by the given “sample_id”.

(1) “samples.tsv”

A tab-separated text file with the following mandatory columns:

“sample_id”: a unique identifier for the sample
“type”: either ‘SE’ for single end or ‘PE’ for paired end
“strand”: either ‘none’, ‘forward’ or ‘reverse’ (see note below).

Sample metadata can also be stored in this table for downstream analysis for example with columns such as:

“condition”
“replicate”
“age”
“sex”
“genotype”
“batch”

Note

strand values of ‘none’, ‘forward’ and ‘reverse’ will be used to set parameter values in txseq pipelines as follows:

“none”: data is treated as unstranded. This is appropriate for e.g. Illumina Truseq and most single-cell protocols. :
- hisat: default, i.e. –rna-strandedness not set
- cufflinks: fr-secondstrand
- HT-seq: no
- PICARD: NONE
- SALMON: (I)U
“forward”: The first read (if paired) or read (if single end) corresponds to the transcript strand e.g. Directional Illumina, Standard Solid.
- hisat: SE: F, PE: FR
- cufflinks: fr-secondstrand
- HT-seq: yes
- PICARD: FIRST_READ_TRANSCRIPTION_STRAND
- SALMON: (I)SF
“reverse”: The first read (if paired) or read (if single end) corresponds to the reverse complement of the transcript strand e.g. dUTP, NSR, NNSR
- hisat: SE: R, PE: RF
- cufflinks: fr-firststrand
- HT-seq: reverse
- PICARD: SECOND_READ_TRANSCRIPTION_STRAND
- SALMON: (I)SR

An example samples.tsv file is shown below:

sample_id	tissue	cell_type	replicate	type	strand	condition
curdlan_bm_gmp_R1	bm	gmp	R1	PE	none	curdlan
curdlan_bm_gmp_R2	bm	gmp	R2	PE	none	curdlan
curdlan_bm_gmp_R3	bm	gmp	R3	PE	none	curdlan
curdlan_bm_lthsc_R2	bm	lthsc	R2	PE	none	curdlan
curdlan_bm_lthsc_R3	bm	lthsc	R3	PE	none	curdlan
curdlan_bm_sthsc_R1	bm	sthsc	R1	PE	none	curdlan
sscontrol_bm_gmp_R1	bm	gmp	R1	PE	none	sscontrol
sscontrol_bm_lthsc_R2	bm	lthsc	R2	PE	none	sscontrol
sscontrol_bm_mpp_R2	bm	mpp	R2	PE	none	sscontrol
sscontrol_bm_mpp_R3	bm	mpp	R3	PE	none	sscontrol
sscontrol_bm_sthsc_R2	bm	sthsc	R2	PE	none	sscontrol
curdlan_bm_mpp_R1	bm	mpp	R1	PE	none	curdlan
curdlan_bm_mpp_R2	bm	mpp	R2	PE	none	curdlan
curdlan_bm_mpp_R3	bm	mpp	R3	PE	none	curdlan
curdlan_bm_sthsc_R3	bm	sthsc	R3	PE	none	curdlan
sscontrol_bm_mpp_R1	bm	mpp	R1	PE	none	sscontrol
sscontrol_bm_sthsc_R1	bm	sthsc	R1	PE	none	sscontrol
sscontrol_bm_sthsc_R3	bm	sthsc	R3	PE	none	sscontrol
curdlan_bm_lthsc_R1	bm	lthsc	R1	PE	none	curdlan
curdlan_bm_sthsc_R2	bm	sthsc	R2	PE	none	curdlan
sscontrol_bm_gmp_R2	bm	gmp	R2	PE	none	sscontrol
sscontrol_bm_gmp_R3	bm	gmp	R3	PE	none	sscontrol
sscontrol_bm_lthsc_R1	bm	lthsc	R1	PE	none	sscontrol
sscontrol_bm_lthsc_R3	bm	lthsc	R3	PE	none	sscontrol

(2) “libraries.tsv” (optional)

Required when starting from FASTQ files.

A tab-separated text file with the following mandatory columns

“sample_id”: these values must match those in the sample_id in the samples.tsv
“lane”: an integer representing the sequencing lane/unit.
“flow_cell”: an integer representing the flow cell.
“fastq_path”: For SE libraries, the fastq file path. For PE libraries: the
read 1 fastq: the path for read 2 is imputed by the pipelines.

Note

When samples have been sequenced across multiple lanes, use one line per lane. Comma-separated lane and fastq_path values are not supported. Quality control analysis is performed at lane level; lanes will be aggregated for quantitation.

Note

Paired-end fastq files must end with “1|2.fastq.gz” or “fastq.1|2.gz”. For paired end samples the Read 1 and Read 2 FASTQ files for the same lane must be located in the same folder.

An example libraries.tsv file is shown below:

sample_id	lane	flow_cell	fastq_path
curdlan_bm_gmp_R1	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545249_1.fastq.gz
curdlan_bm_gmp_R2	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545250_1.fastq.gz
curdlan_bm_gmp_R3	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545251_1.fastq.gz
curdlan_bm_lthsc_R2	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545253_1.fastq.gz
curdlan_bm_lthsc_R3	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545254_1.fastq.gz
curdlan_bm_sthsc_R1	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545258_1.fastq.gz
sscontrol_bm_gmp_R1	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545261_1.fastq.gz
sscontrol_bm_lthsc_R2	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545265_1.fastq.gz
sscontrol_bm_mpp_R2	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545268_1.fastq.gz
sscontrol_bm_mpp_R3	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545269_1.fastq.gz
sscontrol_bm_sthsc_R2	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545271_1.fastq.gz
curdlan_bm_mpp_R1	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545255_1.fastq.gz
curdlan_bm_mpp_R2	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545256_1.fastq.gz
curdlan_bm_mpp_R3	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545257_1.fastq.gz
curdlan_bm_sthsc_R3	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545260_1.fastq.gz
sscontrol_bm_mpp_R1	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545267_1.fastq.gz
sscontrol_bm_sthsc_R1	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545270_1.fastq.gz
sscontrol_bm_sthsc_R3	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545272_1.fastq.gz
curdlan_bm_lthsc_R1	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545252_1.fastq.gz
curdlan_bm_sthsc_R2	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545259_1.fastq.gz
sscontrol_bm_gmp_R2	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545262_1.fastq.gz
sscontrol_bm_gmp_R3	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545263_1.fastq.gz
sscontrol_bm_lthsc_R1	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545264_1.fastq.gz
sscontrol_bm_lthsc_R3	1	1	/well/kir/projects/mirror/ena/PRJNA521342/SRR8545266_1.fastq.gz

Configuring and running pipelines

Run the txseq –help command to view the help documentation and find available pipelines to run.

The txseq pipelines are written using cgat-core pipelining system. From more information please see the CGAT-core paper. Here we illustrate how the pipelines can be run using the cellranger pipeline as an example.

Following installation, to find the available pipelines run:

txseq -h

Next generate a configuration yml file:

txseq salmon config -v5

To fully run e.g. the txseq salmon pipeline the following command is used:

txseq salmon make full -v5 -p20

The “-v5” flag sets the verbosity level to the maximum level and the “-p20” flag tells the pipeline to launch upto 20 jobs in parallel: this number should be set according to the sample number and availability of compute resources.

It is also possible to run individual pipeline tasks to get a feel of what each one is doing. Individual tasks can then be executed by name, e.g.

txseq salmon make quant -v5 -p20

Note

If any upstream tasks are out of date they will automatically be run before the named task is executed.

Getting Started

To get started please see the Mouse hscs example.