The focus of software development in the Hammell lab revolves around improving quality control and maximizing data recovery from high throughput sequencing projects. In particular, the lab is interested in improving bioinformatics analysis of repetitive sequences, particularly transposable elements, in order to elucidate novel (and previously ignored) biological insights of their functions in development and diseases.
TEToolkit is a software package that utilizes both unambiguously (uniquely) and ambiguously (multi-) mapped reads to perform differential enrichment analyses from high throughput sequencing experiments. Currently, most expression analysis software packates are not optimized for handling the complexities involved in quantifying highly repetitive regions of the genome, especially transposable elements (TE), from short sequencing reads. Although transposon elements make up between 20 to 80% of many eukaryotic genomes and contribute significantly to the cellular transcriptome output, the difficulty in quantifying their abundances from high throughput sequencing experiments has led them to be largely ignored in most studies. The TEToolkit provides a noticeable improvement in the recovery of TE transcripts from RNA-Seq experiments and identification of peaks associated with repetitive regions of the genome.
You can download the software package from PyPi and GitHub. The transposable element GTF files required by TETranscripts (see tool description below) and example data files (BAM) are available at this location.
The two tools, TEtranscripts and TEcount, quantify both gene and transposable element (TE) transcript abundances from RNA-Seq experiments, utilizing both uniquely and ambiguously mapped short read sequences. It processes the short reads alignments (BAM files) and proportionally assigns read counts to the corresponding gene or TE based on the user-provided annotation files (GTF files). In addition, TEtranscripts combines multiple libraries and perform differential analysis using DESeq2.
GTF files for gene annotation can be obtained from UCSC RefSeq, Ensembl, iGenomes or other annotation databases. GTF files for TE annotations are customly generated from UCSC RepeatMasker or other annotation database. They contain two custom attributes, class_id and family_id, corresponding to the class (e.g. LINE) and family (e.g. L1) of the corresponding transposable element. A unique ID (e.g. L1Md_Gf_dup1) is also assigned for each TE annotation in the transcript_id attribute. Pre-generated TE GTF files are available for a number of organisms, and can be downloaded here. If the organism or genome build of your interest is not available, please contact us and provide a curated annotation of the transposable elements (e.g. genomic location and TE name/type). We will do our best to help you generate the suitable TE GTF file.
TEtranscripts analysis workflow
TEToolkit is an open-source software released under the GNU General Public License version 3 (GPLv3).
Note: TEToolkit will not work with Python 3.
- Jin Y., Tam O.H., Paniagua E. and Hammell M. (2015). TEtranscripts: A package for including transposable elements in differential expression analysis of RNA-seq datasets. Bioinformatics 31: 3593-3599. Pubmed ID: 26206304
For more information about how to use TEtranscripts:
ezBAMQC is a tool to check the quality of either one or many mapped next-generation-sequencing datasets. It conducts comprehensive evaluations of aligned sequencing data from multiple aspects including: clipping profile, mapping quality distribution, mapped read length distribution, genomic/transcriptomic mapping distribution, inner distance distribution (for paired-end reads), ribosomal RNA contamination, transcript 5’ and 3’ end bias, transcription dropout rate, sample correlations, sample reproducibility, sample variations. It outputs a set of tables and plots and one HTML page that contains a summary of the results. Many metrics are designed for RNA-seq data specifically, but ezBAMQC can be applied to any mapped sequencing dataset such as RNA-seq, CLIP-seq, GRO-seq, ChIP-seq, DNA-seq and so on.
The ezBAMQC software package is written for Python 2.7.x. To install ezBAMQC from a pre-compiled package, it requires pysam (v0.8.3 or higher), R (2.15.x or greater) and the corrplot R package. If compiling ezBAMQC from source, you will need a compiler with C++11 support. This is provided by GNU GCC (version 4.8.1 or greater) in Linux, or Xcode (version 4.2 or greater) in MacOSX.
ezBAMQC is an open-source software released under the GNU General Public License version 3 (GPLv3).
Note: ezBAMQC will not work with Python 3.
Please see our Github page for more details