Genomes and Annotations
Introduction
In general, for RNA-seq analysis it is best to use genome sequences and gene annotations that:
Include sequences and gene annotations from PRIMARY contigs, i.e. chromosomes, mitochondrial sequences, unlocalised and unplaced scaffolds
Do not include sequences and gene annotations from alternative (ALT) contigs
Exclude multi-placed transcript sequences by masking/excluding sequence/records from the Y chromosome PAR region.
The txseq repository uses genome sequences and gene annotations retrieved from Ensembl. These are pre-processed using the “txseq ensembl” command which addresses all the issues above and outputs sanitised genome sequence and annotation files.
Note
If you are using the KIR BMRC workspace, sanitised genome sequences and gene annotations can be found in the “/well/kir/projects/mirror/txseq/” directory.
Retrieving genome sequences and gene annotations
The following files are required:
The Ensembl primary assembly FASTA sequences
The Ensembl geneset in GTF format
The Ensembl cDNA FASTA sequences
The Ensembl ncRNA FASTA sequences
PAR region definitions in BED format
The current Ensembl genome and annotation files can retrieve from the Ensembl FTP website.
PAR region locations can be retrieved from e.g. the Genome Reference Consortium, e.g.
Example 1: obtaining genome sequences and genes annotation files for analysis of human data
The human Ensembl genome and annotation files (for Ensembl release 110) can be retrieved using the following commands:
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
The human PAR coordinates should then be used to prepare a bed file, for example for the GRCh38.p14 release of the human genome the (tab-separated) file should look like this:
X 10001 2781479 PAR.1
X 155701383 156030895 PAR.2
Y 10001 2781479 PAR.1
Y 56887903 57217415 PAR.2
Example 2: obtaining genome sequences and genes annotation files for analysis of mouse data
The mouse Ensembl genome and annotation files (for Ensembl release 110) can be retrieved using the following commands:
wget https://ftp.ensembl.org/pub/release-110/gtf/mus_musculus/Mus_musculus.GRCm39.110.gtf.gz
wget https://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
wget https://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/cdna/Mus_musculus.GRCm39.cdna.all.fa.gz
wget https://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/ncrna/Mus_musculus.GRCm39.ncrna.fa.gz
The mouse PAR locations should thenbe used to prepare a bed file, for example for the GRCm39 release of the mouse genome the (tab-separated) file should look like this:
X 168752755 169376592 PAR
Y 90757114 91355967 PAR
Preparing the txseq-sanitised genome and annotations
It is recommended to prepare txseq-sanitised genome sequences and annotations in a central location for use in all of your RNA-seq projects.
In a suitable directory, obtain a copy of the pipeline_ensembl.py configuration file:
txseq ensembl config
After editing the .yml file to provide the locations of the Ensembl genome, Ensembl annotations and the PAR bed file, execute the pipeline with the following command:
txseq ensembl make full -v5 -p20
The output of the pipeline is an “api.dir” folder that contains the following files that can be used to build indexes for RNA-seq mapping and quantification tools:
txseq.geneset.gtf.gz - the sanitised geneset
txseq.genome.fa.gz - the sanitised and PAR masked genome
txseq.transcript.fa.gz - the sanitised transcripts
txseq.transcript.info.tsv.gz - a flat tsv table of transcript information (for the santitised transcript set)
txseq.transcript.to.gene.map - a flat 2-column tsv table containing a map of transcripts -> genes (for the sanitised transcript set)