pipeline_ensembl.py

Overview

This pipeline post-processes Ensembl annotation files to prepare a set of reference files suitable for the analysis of RNA-sequencing data. We use here Ensembl sequences rather than Gencode sequences to avoid this issue: Salmon issue 214 . Gencode and Ensembl GTF files are in any case equivalent.

This pipeline prepares genome and annotation files that (a) only include records from PRIMARY contigs, i.e. chromosomes, mitochondrial sequences, unlocalised and unplaced scaffolds but not alternative (ALT) contigs and (b) exclude multi-placed transcript sequences by masking/excluding sequence/records from the Y chromosome PAR region.

It performs the following tasks:

Makes a version of primary assembly FASTA file in which the Y chromosome PAR regions are hard masked.
Makes a FASTA file containing the coding (cDNA) and non-coding (ncRNA) Ensembl transcripts from PRIMARY contigs that are not multi-placed transcript sequences.
Makes a similarly filtered version of the Ensembl geneset GTF file.
Makes a transcript -> gene map for use with sample
Makes a transcript information table that contains information on transcript and gene names and biotypes.

Configuration

The pipeline requires a configured pipeline_ensembl.yml file.

A default configuration file can be generated by executing:

txseq ensembl config

Input files

The pipeline requires the following inputs

The Ensembl primary assembly FASTA sequences
The Ensembl geneset in GTF format
The Ensembl cDNA FASTA sequences
The Ensembl ncRNA FASTA sequences
PAR region definitions in BED format

The location of these three files must be specified in the ‘pipeline_ensembl.yml’ file.

Output files

The pipeline creates an “api” folder with the following files for use by downstream pipelines:

api.dir/txseq.genome.fa.gz: a copy of the Ensembl primary assembly in which Y PAR regions are hard masked
api.dir/txseq.transcript.fa.gz: all records from the Ensembl cDNA and ncRNA transcript fasta files that are on primary contigs and not in the Y PAR region
api.dir/txseq.geneset.gtf.gz: all records from the Ensembl gtf file that are on primary contigs and not in the Y PAR region
api.dir/txseq.transcript.to.gene.map: a tab-seperated list of transcript_id -> gene_id mappings for use with Salmon
api.dir/txseq.transcript.info.tsv.gz: a tab-seperated table of transcription information (including transcript_name, transcript_biotype, gene_name and gene_biotype)

Code

txseq.pipeline_ensembl.extractYPAR(infile, outfile): Make a BED file containing the coordinates of the PAR regions on the Y chromosome

txseq.pipeline_ensembl.hardMaskYPAR(infile, sentinel): Hard mask the chromosome Y PAR region

txseq.pipeline_ensembl.contigs(infile, sentinel): Get a list of the contigs present in the primary assembly

txseq.pipeline_ensembl.filteredTranscriptFasta(infiles, sentinel): Filter ensembl cdna & ncrna fasta files to exclude genes on non primary contigs and genes in the Y PAR region.

txseq.pipeline_ensembl.filteredGTF(infiles, sentinel): Filter the ensembl geneset to exclude genes on non primary contigs and genes in the Y PAR region.

txseq.pipeline_ensembl.transcriptToGeneMap(infile, sentinel): Make a map of transcripts to genes for use by salmon

txseq.pipeline_ensembl.transcriptInfo(infile, sentinel): Extract transcript information from the GTF

txseq.pipeline_ensembl.full(): Target to run the full pipeline