The focus of software development in the Hammell lab revolves around improving quality control and maximizing data recovery from high throughput sequencing projects. In particular, the lab is interested in improving bioinformatics analysis of repetitive sequences, particularly transposable elements, in order to elucidate novel (and previously ignored) biological insights of their functions in development and diseases.

We have developed the following tools to address various bioinformatics needs:

TEToolkit

Summary

TEToolkit is a software package that utilizes both unambiguously (uniquely) and ambiguously (multi-) mapped reads to perform differential enrichment analyses from high throughput sequencing experiments. Currently, most expression analysis software packates are not optimized for handling the complexities involved in quantifying highly repetitive regions of the genome, especially transposable elements (TE), from short sequencing reads. Although transposon elements make up between 20 to 80% of many eukaryotic genomes and contribute significantly to the cellular transcriptome output, the difficulty in quantifying their abundances from high throughput sequencing experiments has led them to be largely ignored in most studies. The TEToolkit provides a noticeable improvement in the recovery of TE transcripts from RNA-Seq experiments and identification of peaks associated with repetitive regions of the genome.

If you encounter any issues or have any questions about TEToolkit, please refer to our FAQ page or contact us

Download instructions

You can download the software package from PyPi and GitHub. The transposable element GTF files required by TETranscripts (see tool description below) and example data files (BAM) are available at this location.

Tool Description

There are two tools provided within the TEToolkit:

TEtranscripts quantifies both gene and transposable element (TE) transcript abundances from RNA-Seq experiments, utilizing both uniquely and ambiguously mapped short read sequences. It processes the short reads alignments (BAM files) and proportionally assigns read counts to the corresponding gene or TE based on the user-provided annotation files (GTF files). GTF files for gene annotation can be obtained from UCSC RefSeq, Ensembl, iGenomes or other annotation databases. GTF files for TE annotations are customly generated from UCSC RepeatMasker or other annotation database. They contain two custom attributes, class_id and family_id, corresponding to the class (e.g. LINE) and family (e.g. L1) of the corresponding transposable element. A unique ID (e.g. L1Md_Gf_dup1) is also assigned for each TE annotation in the transcript_id attribute. Pre-generated TE GTF files are available for a number of organisms, and can be downloaded here. If the organism or genome build of your interest is not available, please contact us, and we’ll be happy to generate them for you.

Workflow implemented in TEtranscripts

TEtranscripts analysis workflow

TEpeaks identifies regions enriched for protein binding or modification to repetitive DNA and RNA sequences. It has been utilized in a variety of high throughput sequencing experiments, such as ChIP-Seq, CLIP-Seq and RIP-Seq. The tool performs peak calling utilizing a method that extends the approach implemented by MACS, utilizing ambiguously mapped reads and bin-correlation normalization to identify narrow enriched repetitive regions typically missed by standard approaches. Differential peak enrichment can also be performed to identify regions of differential protein-association in high throughput sequencing experiments. Note: TEpeaks is currently in alpha, and may not be fully functional.

The TEToolkit software package is written for Python (2.6.x or 2.7.x), and requires pysam (v0.8.2.1 or higher), R (2.15.x or greater) and DESeq (1.5.x or greater).

TEToolkit is an open-source software released under the GNU General Public License version 3 (GPLv3).

Note: TEToolkit will not work with Python 3.

Citation

Please cite the following article when using TEtranscripts:

BAMQC

Summary

BAMQC is a software package that performs quality control on alignment files from high throughput sequencing experiments and assesses their suitability for downstream analysis.

If you encounter any issues or have any questions about BAMQC, please contact us

Download instructions

You can download the source code from PyPi or GitHub, or a pre-compiled version of the software package from GitHub.

Tool Description

The BAMQC software package is written for Python 2.7.x. To install BAMQC from a pre-compiled package, it requires pysam (v0.8.3 or higher), R (2.15.x or greater) and the corrplot R package. If compiling BAMQC from source, you will need a compiler with C++11 support. This is provided by GNU GCC (version 4.8.1 or greater) in Linux, or Xcode (version 4.2 or greater) in MacOSX.

BAMQC is an open-source software released under the GNU General Public License version 3 (GPLv3).

Note: BAMQC will not work with Python 3.

Citation