Jan 28, 2016

The Tardigrade Miscalculation

There was a lot of publicity back in November about the genome sequence of theTardigrade (Hypsibius dujardini), a small animal (0.05 – 1mm) that is somewhat similar to nematodes. These are fascinating little creatures that have been described as incredibly resistant to all manner of physical stress – high and low temperatures (reportedly from -272oC to +151oC), high pressure, complete vacuum (Tardigrades in Space = TARDIS {I kid you not}), ionizing radiation, and can survive without food or water for more than 10 years as kind of a dehydrated little lump. 

The reason the genome of the Tardigrade was such big news in November is that the group doing the bioinformatics analysis claimed that the genome contained 6,663 genes from bacteria, a full sixth of the genome, and twice as many horizontally transferred genes as have ever been seen in any other organism (Boothby et al, PNAS 112(52):15976-81. doi: 10.1073/pnas.1510461112. PMID: 26598659). This "weird science" observation was covered by National Geographic, Science News,  Phys.org, Meta Science News, and of course the Univ. of North Carolina press site.
However, it seems quite clear now that this claim about horizontal DNA from bacteria (and maybe other phyla) in the genome of the Tardigrade was wrong. In fact, another group (, , , , , , , , ,  also working on the sequence of the exact same species has rapidly published a preprint manuscript on the bioRxiv preprint server "The genome of the tardigrade Hypsibius dujardini"  that clearly refutes the claims of Boothby et al. and points out their mistakes in genome analysis: "Cross-comparison of the assemblies, using raw read and RNA-Seq data, confirmed that the overwhelming majority of the putative HGT candidates in the previous genome were predicted from scaffolds at very low coverage and were not transcribed."
It is quite easy to get contaminants when you are doing whole genome sequencing for a multicellular organism. You grind up your target species, extract DNA and put it into the sequencing machine. Any bacteria and other small organisms on the surface or in the gut come along for the ride and can contribute their DNA to the sequencing library. Surprisingly, a small amount of bacterial contaminating DNA (perhaps just 1%) can lead to a large number of bacterial contigs in the final genome assembly. I can think of a couple of reasons for this, based on the small size of bacterial genomes (~1 MB), vs metazoan genomes (most >100 MB). First, relative genome coverage of a contaminant bacteria will be much higher for each KB of sequence data, so the 1% of contaminating DNA may have deep coverage of a bacterial genome. Second, any two bacterial DNA fragments randomly selected from a library have a much higher chance to overlap (less complex genome), so they will assemble better.
There are a few QC steps that one can take on the raw data. There is a nice tool called Kraken  (Wood DE, Salzberg SL Genome Biology 2014, 15:R46) that can quickly  run through an entire FASTQ file (4 million reads per minute on a single core) and mark each read according to a set of reference genomes based on exact matching of 31 base k-mers. The Kraken team also make available a pre-built 4 GB database constructed from complete bacterial, archaeal, and viral genomes in RefSeq.  DeconSeq is another good tool to find contaminants with an easy web interface. Of course, some legitimate reads from any target organism will share k-mer sized chunks with some bacteria, viruses, etc. (and some sequences from contaminating bacteria will not be in any database), so one has to make some tough choices about what to remove from the data before assembly.
After assembly, there are some additional steps one can take to flag contaminants. It is extremely helpful (I would now say required) to have some RNA-seq data from the same organism. RNA-seq data is prepared using a poly-A protocol, so no bacterial RNA contaminants should be present. Any contigs (with predicted genes) that do not contain a reasonable amount of aligned RNA-seq reads are highly suspect. Any contig that has predicted genes only from a different species is clearly a red flag.
While the authors of the original have not (yet) published a retraction, the citation in PubMed does carry a link to the refuting article  provided by author Sujai Kumar
Rather than rant on about proper workflows for genome annotation (a best practices document does exist: Mark Yandell & Daniel Ence, Nature Reviews Genetics 13, 329-342 doi:10.1038/nrg3174) let me just say to the authors, the reviewers and the editors at PNAS that "EXTRODINARY CLAIMS REQUIRE EXTRODINARY EVIDENCE" (Carl Sagan). Or as said by Laplace: “The weight of evidence for an extraordinary claim must be proportioned to its strangeness.”

Jan 20, 2016

Cancer Moonshot in the Cloud

I've been reading a bit about the "Cancer Moonshot" discussion at the Davos economics conference.


Naturally I'm interested in the possible increase in funding for genomics and bioinformatics research, but also the discussion of 'big data'  and sharing of genomics data are issues that I bump into all the time. It is almost impossible to overstate the amount of hoops an ordinary scientist has to jump through to obtain access or to share human genomic data that has already been published. There is an entire system of "authorized access" that requires not only that scientists swear to handle genomic data securely and make no attempt to connect genomic data back to patient identities, but also that the University (or research institute) where they work must monitor and enforce these rules. I have had to deal with this system to upload human microbiome data (DNA sequences from bacteria found in or on the human body) that are contaminated with some human DNA. [But not with the coffee beetle genome!] Then I had to apply again for authorization to view my own data to make sure it had been loaded properly.

Why is cancer genomic data protected? Unfortunately, some annoyingly clever people such as Yaniv Erlich have shown that it is possible (fairly easy in his hands) to identify people by name and address just from some of their genome sequence. Patients who agree to participate in research are supposed to be guaranteed privacy - they wanted to share information with scientists about the genetic nature of their tumors, not to share their health care records with nosy neighbors, privacy hackers and identity thieves.

Why do we need thousands of cancer genomes? One key goal of cancer research is personalized medicine - matching up people with customized treatments based on the genetics of their cancer. Current technology is pretty good for DNA sequencing of tumors - for a single cancer patient we can come up with a list of somatic mutations (found only in the tumor) for a few thousand dollars worth of sequencing effort (and a poorly measured amount of bioinformatician and oncologist time). One of the biggest challenges right now is sorting through the list of mutations to figure out which ones are important drivers of cancer growth and disease severity - and should therefore be targeted by drugs or other therapy. Some mutations are well known to be bad actors, others are new mutations in genes that have been found to be mutated in other cancers, others are complete unknowns. Data is needed from (hundreds of) thousands of tumors together with records of treatment response and other medical outcomes in order to build strong predictive models that will reliably advise the doctor about the medical importance of each observed mutation. Another challenge is the heterogeneity of cells within a single person's cancer. As DNA sequencing technology improves, investigators have started to sequence small bits of tumors, or even single cells. They observe different mutations in different cells or sub-clones. Now a key question is if the common resistance to drug treatment is a result of new mutations that occur during (or after) treament, or if the resistant cells already exist in the tumor, but are selected for growth by drug treatment. Overall, this means that precision cancer treatment may require a large number of different genome sequences from each patient, both during diagnosis and to monitor the course of treatment and post-treatment.

So cancer genomic research requires thousands of genomes (deeply sequenced for accuracy and control of artifacts), which means that each authorized investigator must download terabytes of data, and then come up with the data storage and compute power to run his or her clever analysis. In addition to the strictly administrative hurdles of applying for and maintaining an authorized access to cancer genomic data, there are the problems of data transfer, data storage, and big computing power. So the NIH (or other funding agency) has to pay once to generate the cancer genomic data, then again to store it and provide a high bandwidth web or FTP data sharing system, then again to administer the authorized access system, then again for each interested scientist to build a local computing system powerful enough to download, store, and analyze the data (and for University administrators to triple check that they are doing it properly, and again for the NIH administrators to check up on the University administrators to insure they are doing their checking properly). This is an impressive amount of redundancy and wasted effort, even for the US Government.

There is an obvious solution to this problem: 'Use the Cloud, Luke'.  A single Cloud computing system can store all cancer genomic data in a central location, together with a sufficiently massive amount of compute power so that authorized investigators can log in and run their analysis remotely. This technology already exists; Google, Amazon, Microsoft, IBM, Verizon, and at least a dozen other companies already have data centers large enough to handle the necessary data storage and compute tasks. It would be handy to build a whizz bang compute system with all kinds of custom software designed for cancer genomics, but that would take time (and government contractors). A better, faster, simpler system would just stick the genomic data in a central location and let researchers launch virtual machines with whatever software they want (or design for themselves). Amazon EC2 has this infrastructure already in place. It could be merged with the NIH authorized access system in a week-long hackathon. Cancer research Funding agencies could award Cloud compute credits (or just let people budget for Cloud computing in the standard grant application).

Cancer Moonshot:

Use the force luke - use the cloud luke

Oct 23, 2015

Masters in Biomedical Informatics at NYU School of Medicine

We are starting a new Masters program in Biomedical Informatics at NYU School of Medicine in 2016.  We currently have about a dozen PhD students, but the Masters program is intended to serve a wider group with more diverse backgrounds.

Sep 4, 2015

Research Adventure with ENCODE Data

At NYU, first-year PhD students in the Sackler Institute start their first semester with a week-long full-time "Research Adventure" workshop.  I was asked (at short notice) to mentor a group of students for something in Bionformatics. Since I had recently attended the 2015 ENCODE Users Meeting, I decided to make the workshop all about working with ENCODE data.

I included tutorials about access to ENCODE data, an Intro to Linux for complete computing novices (quite a few of our students), Genomic Intervals in the UCSC Genome Browser, use of BEDTools to compare genomic intervals for various factors, and an a tutorial in R for data display. Later in the week we looked at gene expression with RNA-seq using TopHat and Cufflinks. The general plan for the 5-day workshop (for 6 students) was as follows:

9-11:00 am          Lecture (2 hr): Introduction to Gene Regulation and Epigenetics
11-12:00 am        Lecture (1 hr): Use of the HighPerformance Computing Cluster
12:00-2 pm          Working Lunch with HPC System Manager (2 hr): Set up HPC account for each student, practice Linux commands, move files from laptop to HPC account
2-4 pm                  Exercise 1: Tutorials for Accessing ENCODE data through the ENCODEPortal, UCSC Genome Browser and ENSEMBL Browser

9-11:00 am          Lecture & Demo: (2 hr): The UCSC Genome Browser, BED file format, and BEDTools software
11-12:00 am        Exercise 2: BEDTools Tutorial
12-1:00 pm          Lunch
1-3:00 pm            Exercise 3: Use of ENCODE Data and BEDTools to compute the Intersection of DNAse hypersensitive sites with promoters of all RefSeq genes

9-10:30 am          Lecture: Computing Gene Expression with RNA-Seq (1.5 hr)
10:30-12 am        Exercise 4: Align ENCODE RNA-seq data to hg19 reference genome with TopHat
12-1:00 pm          Lunch
1-4 pm                  Continue work on Exercise 4

9-10:00 am          Lecture (1 hr): Intro to data visualization with R
10-12:00 am       Exercise 5: TryRCodeschool tutorial.
12-1:00 pm          Lunch
1-2:00 pm            Lecture (1 hr): Differential Gene Expression with Cufflinks
2-4:00 pm            Planning for Research Project – choose ENCODE data for transcription factors, gene expression, and epigenetic markers. Literature search.

9-12:00 am          Work on Research Project
12-1:00 pm          Lunch

1-4:00 pm            Work on Data analysis and prepare presentation

I had six students in our Research team: Elaine Fisher, Reuben Moncada, Shushan Sargsian, Beny Shapiro, Jong Shin, and Bo Xia, I have pasted images from their final presentation below (can't upload PowerPoint or PDF in this Blogger).

My overall impression of the week was that the students learned a huge amount of computing skills, but it was a bit bumpy when we got to the RNA-seq methods. They had really good success comparing various Transcription Factor binding sites to known genes (promoter region, TSS, 3'UTR, exons, introns, 5'UTR), finding interactions between TF's by finding overlapping or nearby binding sites, We also found nice overlaps between ChIP-seq TF binding sites and DNAse sensitive sites, histone modification sites, and computationally predicted TF binding sites. Also, the students did a nice job of measuring overlapping vs. nearby binding sites (bedtools slop), and measuring the significance of intersections using bedtools shuffle to create a statistical model of random intersections as a control.

FASTQ data download and alignment is slow and error prone (we had a lot of trouble making SGE scripts that would run correctly on our compute cluster). I should have shown TopHat just as a demo and used a small local FASTQ data file as an example rather than download and re-align ENCODE data. Using Cufflinks/Cuffdiff to compare gene expression from different cell lines was feasible with real ENCODE BAM files, but we had to learn this earlier in the week and spend more time to create SGE scripts that would run nicely with multithreading (to complete in a reasonable amount of time).

If I did this sort of tutorial again, I would figure out a way for the students to measure differential gene expression between cell lines from pre-computed ENCODE RNA-seq quantified data (wig files).  

Jul 31, 2015

Coffee Berry Borer genome published

Our paper on the de novo genome sequence and annotation of the Coffee Berry Borer (a beetle) is published today in Nature Scientific Reports. This was a really fun project, where I was pushed to do a lot more in-depth study of insect biology (such as antimicrobial and cytochrome P450 proteins). We also discovered that this beetle has captured a bunch of bacterial proteins into its genome (horizontal gene transfer) - which seems odd, but was actually previously reported for this insect and many others. Interestingly, most of these captured bacterial proteins provide starch digesting enzymes, which support the beetle's lifestyle of living entirely inside of the coffee bean and eating nothing but coffee! We are of course hoping that these genes can be used as some sort of target for control of the pest, which causes something like a billion $$ of annual damage worldwide to our beloved coffee.