The focus of software development in the Hammell lab revolves around improving quality control and maximizing data recovery from high throughput sequencing projects. In particular, the lab is interested in improving bioinformatics analysis of repetitive sequences, particularly transposable elements, in order to elucidate novel (and previously ignored) biological insights of their functions in development and diseases.

We have developed the following tools to address various bioinformatics needs:

TEToolkit

Summary

TEToolkit is a software package that utilizes both unambiguously (uniquely) and ambiguously (multi-) mapped reads to perform differential enrichment analyses from high throughput sequencing experiments. Currently, most expression analysis software packates are not optimized for handling the complexities involved in quantifying highly repetitive regions of the genome, especially transposable elements (TE), from short sequencing reads. Although transposon elements make up between 20 to 80% of many eukaryotic genomes and contribute significantly to the cellular transcriptome output, the difficulty in quantifying their abundances from high throughput sequencing experiments has led them to be largely ignored in most studies. The TEToolkit provides a noticeable improvement in the recovery of TE transcripts from RNA-Seq experiments and identification of peaks associated with repetitive regions of the genome.

If you encounter any issues or have any questions about TEToolkit, please refer to our FAQ page, check out our Github, or contact us.

Download instructions

You can download the software package from PyPi and GitHub. The transposable element GTF files required by TETranscripts (see tool description below) and example data files (BAM) are available at this location.

Tool Description

The two tools, TEtranscripts and TEcount, quantify both gene and transposable element (TE) transcript abundances from RNA-Seq experiments, utilizing both uniquely and ambiguously mapped short read sequences. It processes the short reads alignments (BAM files) and proportionally assigns read counts to the corresponding gene or TE based on the user-provided annotation files (GTF files). In addition, TEtranscripts combines multiple libraries and perform differential analysis using DESeq2.

GTF files for gene annotation can be obtained from UCSC RefSeq, Ensembl, iGenomes or other annotation databases. GTF files for TE annotations are customly generated from UCSC RepeatMasker or other annotation database. They contain two custom attributes, class_id and family_id, corresponding to the class (e.g. LINE) and family (e.g. L1) of the corresponding transposable element. A unique ID (e.g. L1Md_Gf_dup1) is also assigned for each TE annotation in the transcript_id attribute. Pre-generated TE GTF files are available for a number of organisms, and can be downloaded here. If the organism or genome build of your interest is not available, please contact us and provide a curated annotation of the transposable elements (e.g. genomic location and TE name/type). We will do our best to help you generate the suitable TE GTF file.

Workflow implemented in TEtranscripts

TEtranscripts analysis workflow

The TEToolkit software package is written for Python (2.6.x or 2.7.x), and requires pysam (v0.9.x or higher), R (2.15.x or greater) and DESeq2 (1.10.x or greater).

TEToolkit is an open-source software released under the GNU General Public License version 3 (GPLv3).

Note: TEToolkit will not work with Python 3.

Citation

Please cite the following article when using TEtranscripts:

For more information about how to use TEtranscripts:

ezBAMQC

Summary

ezBAMQC is a tool to check the quality of either one or many mapped next-generation-sequencing datasets. It conducts comprehensive evaluations of aligned sequencing data from multiple aspects including: clipping profile, mapping quality distribution, mapped read length distribution, genomic/transcriptomic mapping distribution, inner distance distribution (for paired-end reads), ribosomal RNA contamination, transcript 5’ and 3’ end bias, transcription dropout rate, sample correlations, sample reproducibility, sample variations. It outputs a set of tables and plots and one HTML page that contains a summary of the results. Many metrics are designed for RNA-seq data specifically, but ezBAMQC can be applied to any mapped sequencing dataset such as RNA-seq, CLIP-seq, GRO-seq, ChIP-seq, DNA-seq and so on.

If you encounter any issues or have any questions about ezBAMQC, please contact us or check out our Github page

Download instructions

You can download the source code from PyPi or GitHub, or a pre-compiled version of the software package from GitHub.

Tool Description

The ezBAMQC software package is written for Python 2.7.x. To install ezBAMQC from a pre-compiled package, it requires pysam (v0.8.3 or higher), R (2.15.x or greater) and the corrplot R package. If compiling ezBAMQC from source, you will need a compiler with C++11 support. This is provided by GNU GCC (version 4.8.1 or greater) in Linux, or Xcode (version 4.2 or greater) in MacOSX.

ezBAMQC is an open-source software released under the GNU General Public License version 3 (GPLv3).

Note: ezBAMQC will not work with Python 3.

Citation

Please see our Github page for more details