Oct 23, 2015

Masters in Biomedical Informatics at NYU School of Medicine

We are starting a new Masters program in Biomedical Informatics at NYU School of Medicine in 2016.  We currently have about a dozen PhD students, but the Masters program is intended to serve a wider group with more diverse backgrounds.

Sep 4, 2015

Research Adventure with ENCODE Data

At NYU, first-year PhD students in the Sackler Institute start their first semester with a week-long full-time "Research Adventure" workshop.  I was asked (at short notice) to mentor a group of students for something in Bionformatics. Since I had recently attended the 2015 ENCODE Users Meeting, I decided to make the workshop all about working with ENCODE data.

I included tutorials about access to ENCODE data, an Intro to Linux for complete computing novices (quite a few of our students), Genomic Intervals in the UCSC Genome Browser, use of BEDTools to compare genomic intervals for various factors, and an a tutorial in R for data display. Later in the week we looked at gene expression with RNA-seq using TopHat and Cufflinks. The general plan for the 5-day workshop (for 6 students) was as follows:

9-11:00 am          Lecture (2 hr): Introduction to Gene Regulation and Epigenetics
11-12:00 am        Lecture (1 hr): Use of the HighPerformance Computing Cluster
12:00-2 pm          Working Lunch with HPC System Manager (2 hr): Set up HPC account for each student, practice Linux commands, move files from laptop to HPC account
2-4 pm                  Exercise 1: Tutorials for Accessing ENCODE data through the ENCODEPortal, UCSC Genome Browser and ENSEMBL Browser

9-11:00 am          Lecture & Demo: (2 hr): The UCSC Genome Browser, BED file format, and BEDTools software
11-12:00 am        Exercise 2: BEDTools Tutorial
12-1:00 pm          Lunch
1-3:00 pm            Exercise 3: Use of ENCODE Data and BEDTools to compute the Intersection of DNAse hypersensitive sites with promoters of all RefSeq genes

9-10:30 am          Lecture: Computing Gene Expression with RNA-Seq (1.5 hr)
10:30-12 am        Exercise 4: Align ENCODE RNA-seq data to hg19 reference genome with TopHat
12-1:00 pm          Lunch
1-4 pm                  Continue work on Exercise 4

9-10:00 am          Lecture (1 hr): Intro to data visualization with R
10-12:00 am       Exercise 5: TryRCodeschool tutorial.
12-1:00 pm          Lunch
1-2:00 pm            Lecture (1 hr): Differential Gene Expression with Cufflinks
2-4:00 pm            Planning for Research Project – choose ENCODE data for transcription factors, gene expression, and epigenetic markers. Literature search.

9-12:00 am          Work on Research Project
12-1:00 pm          Lunch

1-4:00 pm            Work on Data analysis and prepare presentation

I had six students in our Research team: Elaine Fisher, Reuben Moncada, Shushan Sargsian, Beny Shapiro, Jong Shin, and Bo Xia, I have pasted images from their final presentation below (can't upload PowerPoint or PDF in this Blogger).

My overall impression of the week was that the students learned a huge amount of computing skills, but it was a bit bumpy when we got to the RNA-seq methods. They had really good success comparing various Transcription Factor binding sites to known genes (promoter region, TSS, 3'UTR, exons, introns, 5'UTR), finding interactions between TF's by finding overlapping or nearby binding sites, We also found nice overlaps between ChIP-seq TF binding sites and DNAse sensitive sites, histone modification sites, and computationally predicted TF binding sites. Also, the students did a nice job of measuring overlapping vs. nearby binding sites (bedtools slop), and measuring the significance of intersections using bedtools shuffle to create a statistical model of random intersections as a control.

FASTQ data download and alignment is slow and error prone (we had a lot of trouble making SGE scripts that would run correctly on our compute cluster). I should have shown TopHat just as a demo and used a small local FASTQ data file as an example rather than download and re-align ENCODE data. Using Cufflinks/Cuffdiff to compare gene expression from different cell lines was feasible with real ENCODE BAM files, but we had to learn this earlier in the week and spend more time to create SGE scripts that would run nicely with multithreading (to complete in a reasonable amount of time).

If I did this sort of tutorial again, I would figure out a way for the students to measure differential gene expression between cell lines from pre-computed ENCODE RNA-seq quantified data (wig files).  

Jul 31, 2015

Coffee Berry Borer genome published

Our paper on the de novo genome sequence and annotation of the Coffee Berry Borer (a beetle) is published today in Nature Scientific Reports. This was a really fun project, where I was pushed to do a lot more in-depth study of insect biology (such as antimicrobial and cytochrome P450 proteins). We also discovered that this beetle has captured a bunch of bacterial proteins into its genome (horizontal gene transfer) - which seems odd, but was actually previously reported for this insect and many others. Interestingly, most of these captured bacterial proteins provide starch digesting enzymes, which support the beetle's lifestyle of living entirely inside of the coffee bean and eating nothing but coffee! We are of course hoping that these genes can be used as some sort of target for control of the pest, which causes something like a billion $$ of annual damage worldwide to our beloved coffee. 

Jul 29, 2015

I am writing new lectures and organizing a lot of teaching material to teach 4 (!) classes this fall at two different universities (NYU and Fordham). I would like to keep the teaching materials in a nice easily accessible online location, and easily share with my students without a lot of hassle to sign them all up or whatever. I had a fairly good experience with Google Drive for a short course this Spring, so I'm trying it out now. Here is the master link to all of my 2015 teaching material:


Stuff will appear, change, possibly disappear from this location as I keep sorting and rewriting, up to and during the classes. Most of the material is my own, some journal articles that I provide as readings to my students, and some shameless theft of good lectures, exercises, and tutorials from other folks smarter or better at explaining stuff than I am.

We are also planning to make Screencast type videos of most of the lectures, which get dumped on YouTube. I will try to find some sensible way of organizing them and sharing via this NGS blog.

Jul 16, 2015

CSHL Press has made the RNA-seq chapter of my Next-Gen Seq book available free from their website: RNA Sequencing with Next-Generation Sequencing.


Cold Spring Harbor Laboratory Press banner image
Next-Generation DNA Sequencing Informatics, Second Edition banner image