Sep 4, 2015

Research Adventure with ENCODE Data

At NYU, first-year PhD students in the Sackler Institute start their first semester with a week-long full-time "Research Adventure" workshop.  I was asked (at short notice) to mentor a group of students for something in Bionformatics. Since I had recently attended the 2015 ENCODE Users Meeting, I decided to make the workshop all about working with ENCODE data.

I included tutorials about access to ENCODE data, an Intro to Linux for complete computing novices (quite a few of our students), Genomic Intervals in the UCSC Genome Browser, use of BEDTools to compare genomic intervals for various factors, and an a tutorial in R for data display. Later in the week we looked at gene expression with RNA-seq using TopHat and Cufflinks. The general plan for the 5-day workshop (for 6 students) was as follows:

9-11:00 am          Lecture (2 hr): Introduction to Gene Regulation and Epigenetics
11-12:00 am        Lecture (1 hr): Use of the HighPerformance Computing Cluster
12:00-2 pm          Working Lunch with HPC System Manager (2 hr): Set up HPC account for each student, practice Linux commands, move files from laptop to HPC account
2-4 pm                  Exercise 1: Tutorials for Accessing ENCODE data through the ENCODEPortal, UCSC Genome Browser and ENSEMBL Browser

9-11:00 am          Lecture & Demo: (2 hr): The UCSC Genome Browser, BED file format, and BEDTools software
11-12:00 am        Exercise 2: BEDTools Tutorial
12-1:00 pm          Lunch
1-3:00 pm            Exercise 3: Use of ENCODE Data and BEDTools to compute the Intersection of DNAse hypersensitive sites with promoters of all RefSeq genes

9-10:30 am          Lecture: Computing Gene Expression with RNA-Seq (1.5 hr)
10:30-12 am        Exercise 4: Align ENCODE RNA-seq data to hg19 reference genome with TopHat
12-1:00 pm          Lunch
1-4 pm                  Continue work on Exercise 4

9-10:00 am          Lecture (1 hr): Intro to data visualization with R
10-12:00 am       Exercise 5: TryRCodeschool tutorial.
12-1:00 pm          Lunch
1-2:00 pm            Lecture (1 hr): Differential Gene Expression with Cufflinks
2-4:00 pm            Planning for Research Project – choose ENCODE data for transcription factors, gene expression, and epigenetic markers. Literature search.

9-12:00 am          Work on Research Project
12-1:00 pm          Lunch

1-4:00 pm            Work on Data analysis and prepare presentation

I had six students in our Research team: Elaine Fisher, Reuben Moncada, Shushan Sargsian, Beny Shapiro, Jong Shin, and Bo Xia, I have pasted images from their final presentation below (can't upload PowerPoint or PDF in this Blogger).

My overall impression of the week was that the students learned a huge amount of computing skills, but it was a bit bumpy when we got to the RNA-seq methods. They had really good success comparing various Transcription Factor binding sites to known genes (promoter region, TSS, 3'UTR, exons, introns, 5'UTR), finding interactions between TF's by finding overlapping or nearby binding sites, We also found nice overlaps between ChIP-seq TF binding sites and DNAse sensitive sites, histone modification sites, and computationally predicted TF binding sites. Also, the students did a nice job of measuring overlapping vs. nearby binding sites (bedtools slop), and measuring the significance of intersections using bedtools shuffle to create a statistical model of random intersections as a control.

FASTQ data download and alignment is slow and error prone (we had a lot of trouble making SGE scripts that would run correctly on our compute cluster). I should have shown TopHat just as a demo and used a small local FASTQ data file as an example rather than download and re-align ENCODE data. Using Cufflinks/Cuffdiff to compare gene expression from different cell lines was feasible with real ENCODE BAM files, but we had to learn this earlier in the week and spend more time to create SGE scripts that would run nicely with multithreading (to complete in a reasonable amount of time).

If I did this sort of tutorial again, I would figure out a way for the students to measure differential gene expression between cell lines from pre-computed ENCODE RNA-seq quantified data (wig files).  

Jul 31, 2015

Coffee Berry Borer genome published

Our paper on the de novo genome sequence and annotation of the Coffee Berry Borer (a beetle) is published today in Nature Scientific Reports. This was a really fun project, where I was pushed to do a lot more in-depth study of insect biology (such as antimicrobial and cytochrome P450 proteins). We also discovered that this beetle has captured a bunch of bacterial proteins into its genome (horizontal gene transfer) - which seems odd, but was actually previously reported for this insect and many others. Interestingly, most of these captured bacterial proteins provide starch digesting enzymes, which support the beetle's lifestyle of living entirely inside of the coffee bean and eating nothing but coffee! We are of course hoping that these genes can be used as some sort of target for control of the pest, which causes something like a billion $$ of annual damage worldwide to our beloved coffee. 

Jul 29, 2015

I am writing new lectures and organizing a lot of teaching material to teach 4 (!) classes this fall at two different universities (NYU and Fordham). I would like to keep the teaching materials in a nice easily accessible online location, and easily share with my students without a lot of hassle to sign them all up or whatever. I had a fairly good experience with Google Drive for a short course this Spring, so I'm trying it out now. Here is the master link to all of my 2015 teaching material:

Stuff will appear, change, possibly disappear from this location as I keep sorting and rewriting, up to and during the classes. Most of the material is my own, some journal articles that I provide as readings to my students, and some shameless theft of good lectures, exercises, and tutorials from other folks smarter or better at explaining stuff than I am.

We are also planning to make Screencast type videos of most of the lectures, which get dumped on YouTube. I will try to find some sensible way of organizing them and sharing via this NGS blog.

Jul 16, 2015

CSHL Press has made the RNA-seq chapter of my Next-Gen Seq book available free from their website: RNA Sequencing with Next-Generation Sequencing.

Cold Spring Harbor Laboratory Press banner image
Next-Generation DNA Sequencing Informatics, Second Edition banner image

May 28, 2015

New 'Next-Gen Seq 2' book is at the printer

The second edition of the Next-Generation Sequencing Informatics book (that I edit) is at the printer and available for pre-order at Cold Spring Harbor Press and Amazon. We think it will ship on June 30th, maybe a bit sooner.

[James Hadfield at CoreGenomics blog has posted a review: ]

We have added new chapters on the latest sequencing technology, QC, de novo transcript assembly, proteogenomics and lots of updates and expansion in areas such as RNA-seq and ChIP-seq. It has a beautiful cover and its not too expensive.

Here is the official publication blurb:

Next-generation DNA sequencing (NGS) technology has revolutionized biomedical research, making genome and RNA sequencing an affordable and frequently used tool for a wide variety of research applications including variant (mutation) discovery, gene expression, transcription factor analysis, metagenomics, and epigenetics. Bioinformatics methods to support DNA sequencing have become and remain a critical bottleneck for many researchers and organizations wishing to make use of NGS technology. Next-Generation DNA Sequencing Bioinformatics, Second edition, provides thorough, plain language introduction to the necessary informatics methods and tools for analyzing NGS data as did the first edition, and provides detailed descriptions of algorithms, strengths and weaknesses of specific tools, pitfalls and alternative methods. Four new chapters in this edition cover: experimental design, sample preparation, and quality assessment of NGS data; Public databases for DNA Sequencing data; De novo transcript assembly; proteogenomics; and emerging sequencing technologies. The remaining chapters from the first edition have been updated with the latest information. This book also provides extensive reference to best-practice bioinformatics methods for NGS applications and tutorials for common workflows. The second edition of Next-Generation DNA Sequencing Bioinformatics addresses the informatics needs of students, laboratory scientists, and computing specialists who wish to take advantage of the explosion of research opportunities offered by new DNA sequencing technologies.

and the Table of Contents:

1) Introduction to DNA Sequencing
Stuart M. Brown
2) Quality Control and Data Processing
Stuart M. Brown
3) History of Sequencing Informatics
Stuart M. Brown
4) Public Sequence Databases
Stuart M. Brown
5) Visualization of Next-Generation Sequencing Data
Philip Ross Smith, Kranti Konganti, and Stuart M. Brown
6) DNA Sequence Alignment
Efstratios Efstathiadis
7) Genome Assembly Using Generalized de Bruijn Digraphs
D. Frank Hsu
8) De Novo Assembly of Bacterial Genomes from Short Sequence Reads
Silvia Argimón and Stuart M. Brown
9) De Novo Transcriptome Assembly
Lisa Cohen, Steven Shen, and Efstratios Efstathiadis
10) Genome Annotation
Steven Shen and Stuart M. Brown
11) Using NGS to Detect Genome Sequence Variants
Jinhua Wang
12) ChIP-seq
Stuart M. Brown, Zuojian Tang, Christina Schweikert, and D. Frank Hsu
13) RNA-seq with Next-Generation Sequencing
Stuart M. Brown and Jeremy Goecks
14) Metagenomics
Guillermo I. Perez-Perez, Miroslav Blumenberg, and Alexander V. Alekseyenko
15) Proteogenomics
Kelly V. Ruggles and David Fenyö
16) DNA Sequencing Technologies and Applications
Gerald A. Higgins and Brian D. Athey
17) Cloud-based Next-Generation Sequencing Informatics
Konstantinos Krampis, Efstratios Efstathiadis, and Stuart M. Brown