Mar 2, 2016

Irreproducible results

I get frustrated by the lack of stability in genomics data collection and analysis, which of course leads to irreproducible results. I imagine (naively I'm sure) that in physics one measures a quant or a nanode of some particle and it stays measured that way for years and decades. I do accept the inevitable technology changes that lead us to measure similar things, such as gene expression, in different ways (Northern blots, microarrays, RNA-seq). However, my lab-based collaborators become very frustrated when the exact same data (such as RNA-seq FASTQ files) produce different results, such as changes in the list of top differentially expressed (DE) genes with different p-values, when analyzed with different software.  This frustration grows even more severe when the different results come from a different version of the same software!

I was working through my RNA-seq tutorial with a group of students this week and they pointed out that my tutorial worksheet was wrong. Cufflinks did not produce any significant DE genes with our test data comparing two small RNA-seq data files. This was surprising to me, since I did the exact same workflow with the same data files last year and it worked out fine. So golly and darn it, I got hit with the irreproducible results bug.  We keep past versions of software available on our computing server with an Environment Modules system, so I was able to quickly run a test of the exact same data files (aligned with the same Tophat version to the same reference genome) using different versions of Cufflinks. We have the following versions installed:
cufflinks/2.0.2   (July 2012)
cufflinks/2.1.1   (April 2013)
cufflinks/2.2.0   (March 2014)
cufflinks/2.2.1  (May 2014)

I just used the simple Cuffidiff workflow and looked at the gene_exp.diff output file for each software version. The results are quite different. Version 2.0.2 has 46 genes called "significant=yes" (multi-test adjusted q-value less than 0.05) with q-values running as low as 4.14E-10 (ok, one has a q-value of zero).  Now this is not a great result from a biostatistics standpoint, since how can you expect to get significant p-values from RNA expression levels with two samples an no replicates? But it did make for an expedient exercise, since we could take the DE genes into DAVID and look for enriched biological functions and pathways.

Then in version 2.1.1 we have two "significant=yes" genes. In version 2.2.0 we have zero significant genes, and in version 2.2.1 also zero.

The top 10 genes, ranked by q-value also differ. There are no genes in common between the top 10 list for version 2.0.2 and 2.1.1, and only the top 2 genes are shared by version 2.2.0. Thankfully, there are no differences in top genes or q-values between 2.2.0 and 2.2.1 (versions released only 2 months apart).  I'm sure that Cole Trapnell et al. are diligently improving the software, but the consequences for those of us trying to use the tools to make some sense out of biology experiments can be unsettling.






Feb 14, 2016

RNA-seq Workshop


We ran a tutorial at NYU Med Center last week on the basics of RNA-seq data analysis. This tutorial was based on the use of our High-Performance Linux cluster, so we actually presented the class as 2 sessions: A 2-hour session on basic Linux commands (for the complete novice) plus writing and submitting Sun Grid Engine scripts.  Then in the 2nd 2-hour session we focus on TopHat/Cufflinks data processing with some sample data files. So this tutorial has some parts that are rather specific to the NYUMC computing system, but it may be quit similar to what computing resources might be available at other schools and research centers.

The YouTube links to the Linux session (screencast):
https://youtu.be/M3RVfv6lUtc

and the RNA-seq session (screencast):
https://youtu.be/hksQlJLwKqohttps://youtu.be/hksQlJLwKqo

A wiki website with the Powerpoint slides and a collection of other resources:
https://genome.med.nyu.edu/hpcf/wiki/RNA-seq_tutorial_2016

Jan 28, 2016

The Tardigrade Miscalculation

There was a lot of publicity back in November about the genome sequence of theTardigrade (Hypsibius dujardini), a small animal (0.05 – 1mm) that is somewhat similar to nematodes. These are fascinating little creatures that have been described as incredibly resistant to all manner of physical stress – high and low temperatures (reportedly from -272oC to +151oC), high pressure, complete vacuum (Tardigrades in Space = TARDIS {I kid you not}), ionizing radiation, and can survive without food or water for more than 10 years as kind of a dehydrated little lump. 

 
The reason the genome of the Tardigrade was such big news in November is that the group doing the bioinformatics analysis claimed that the genome contained 6,663 genes from bacteria, a full sixth of the genome, and twice as many horizontally transferred genes as have ever been seen in any other organism (Boothby et al, PNAS 112(52):15976-81. doi: 10.1073/pnas.1510461112. PMID: 26598659). This "weird science" observation was covered by National Geographic, Science News,  Phys.org, Meta Science News, and of course the Univ. of North Carolina press site.
 
However, it seems quite clear now that this claim about horizontal DNA from bacteria (and maybe other phyla) in the genome of the Tardigrade was wrong. In fact, another group (, , , , , , , , ,  also working on the sequence of the exact same species has rapidly published a preprint manuscript on the bioRxiv preprint server "The genome of the tardigrade Hypsibius dujardini"  that clearly refutes the claims of Boothby et al. and points out their mistakes in genome analysis: "Cross-comparison of the assemblies, using raw read and RNA-Seq data, confirmed that the overwhelming majority of the putative HGT candidates in the previous genome were predicted from scaffolds at very low coverage and were not transcribed."
 
It is quite easy to get contaminants when you are doing whole genome sequencing for a multicellular organism. You grind up your target species, extract DNA and put it into the sequencing machine. Any bacteria and other small organisms on the surface or in the gut come along for the ride and can contribute their DNA to the sequencing library. Surprisingly, a small amount of bacterial contaminating DNA (perhaps just 1%) can lead to a large number of bacterial contigs in the final genome assembly. I can think of a couple of reasons for this, based on the small size of bacterial genomes (~1 MB), vs metazoan genomes (most >100 MB). First, relative genome coverage of a contaminant bacteria will be much higher for each KB of sequence data, so the 1% of contaminating DNA may have deep coverage of a bacterial genome. Second, any two bacterial DNA fragments randomly selected from a library have a much higher chance to overlap (less complex genome), so they will assemble better.
 
There are a few QC steps that one can take on the raw data. There is a nice tool called Kraken  (Wood DE, Salzberg SL Genome Biology 2014, 15:R46) that can quickly  run through an entire FASTQ file (4 million reads per minute on a single core) and mark each read according to a set of reference genomes based on exact matching of 31 base k-mers. The Kraken team also make available a pre-built 4 GB database constructed from complete bacterial, archaeal, and viral genomes in RefSeq.  DeconSeq is another good tool to find contaminants with an easy web interface. Of course, some legitimate reads from any target organism will share k-mer sized chunks with some bacteria, viruses, etc. (and some sequences from contaminating bacteria will not be in any database), so one has to make some tough choices about what to remove from the data before assembly.
 
After assembly, there are some additional steps one can take to flag contaminants. It is extremely helpful (I would now say required) to have some RNA-seq data from the same organism. RNA-seq data is prepared using a poly-A protocol, so no bacterial RNA contaminants should be present. Any contigs (with predicted genes) that do not contain a reasonable amount of aligned RNA-seq reads are highly suspect. Any contig that has predicted genes only from a different species is clearly a red flag.
 
While the authors of the original have not (yet) published a retraction, the citation in PubMed does carry a link to the refuting article  provided by author Sujai Kumar
 
Rather than rant on about proper workflows for genome annotation (a best practices document does exist: Mark Yandell & Daniel Ence, Nature Reviews Genetics 13, 329-342 doi:10.1038/nrg3174) let me just say to the authors, the reviewers and the editors at PNAS that "EXTRODINARY CLAIMS REQUIRE EXTRODINARY EVIDENCE" (Carl Sagan). Or as said by Laplace: “The weight of evidence for an extraordinary claim must be proportioned to its strangeness.”
 
 
 
 
 

Jan 20, 2016

Cancer Moonshot in the Cloud

I've been reading a bit about the "Cancer Moonshot" discussion at the Davos economics conference.

http://www.weforum.org/events/world-economic-forum-annual-meeting-2016/sessions/cancer-moonshot-a-call-to-action

Naturally I'm interested in the possible increase in funding for genomics and bioinformatics research, but also the discussion of 'big data'  and sharing of genomics data are issues that I bump into all the time. It is almost impossible to overstate the amount of hoops an ordinary scientist has to jump through to obtain access or to share human genomic data that has already been published. There is an entire system of "authorized access" that requires not only that scientists swear to handle genomic data securely and make no attempt to connect genomic data back to patient identities, but also that the University (or research institute) where they work must monitor and enforce these rules. I have had to deal with this system to upload human microbiome data (DNA sequences from bacteria found in or on the human body) that are contaminated with some human DNA. [But not with the coffee beetle genome!] Then I had to apply again for authorization to view my own data to make sure it had been loaded properly.

Why is cancer genomic data protected? Unfortunately, some annoyingly clever people such as Yaniv Erlich have shown that it is possible (fairly easy in his hands) to identify people by name and address just from some of their genome sequence. Patients who agree to participate in research are supposed to be guaranteed privacy - they wanted to share information with scientists about the genetic nature of their tumors, not to share their health care records with nosy neighbors, privacy hackers and identity thieves.

Why do we need thousands of cancer genomes? One key goal of cancer research is personalized medicine - matching up people with customized treatments based on the genetics of their cancer. Current technology is pretty good for DNA sequencing of tumors - for a single cancer patient we can come up with a list of somatic mutations (found only in the tumor) for a few thousand dollars worth of sequencing effort (and a poorly measured amount of bioinformatician and oncologist time). One of the biggest challenges right now is sorting through the list of mutations to figure out which ones are important drivers of cancer growth and disease severity - and should therefore be targeted by drugs or other therapy. Some mutations are well known to be bad actors, others are new mutations in genes that have been found to be mutated in other cancers, others are complete unknowns. Data is needed from (hundreds of) thousands of tumors together with records of treatment response and other medical outcomes in order to build strong predictive models that will reliably advise the doctor about the medical importance of each observed mutation. Another challenge is the heterogeneity of cells within a single person's cancer. As DNA sequencing technology improves, investigators have started to sequence small bits of tumors, or even single cells. They observe different mutations in different cells or sub-clones. Now a key question is if the common resistance to drug treatment is a result of new mutations that occur during (or after) treament, or if the resistant cells already exist in the tumor, but are selected for growth by drug treatment. Overall, this means that precision cancer treatment may require a large number of different genome sequences from each patient, both during diagnosis and to monitor the course of treatment and post-treatment.

So cancer genomic research requires thousands of genomes (deeply sequenced for accuracy and control of artifacts), which means that each authorized investigator must download terabytes of data, and then come up with the data storage and compute power to run his or her clever analysis. In addition to the strictly administrative hurdles of applying for and maintaining an authorized access to cancer genomic data, there are the problems of data transfer, data storage, and big computing power. So the NIH (or other funding agency) has to pay once to generate the cancer genomic data, then again to store it and provide a high bandwidth web or FTP data sharing system, then again to administer the authorized access system, then again for each interested scientist to build a local computing system powerful enough to download, store, and analyze the data (and for University administrators to triple check that they are doing it properly, and again for the NIH administrators to check up on the University administrators to insure they are doing their checking properly). This is an impressive amount of redundancy and wasted effort, even for the US Government.

There is an obvious solution to this problem: 'Use the Cloud, Luke'.  A single Cloud computing system can store all cancer genomic data in a central location, together with a sufficiently massive amount of compute power so that authorized investigators can log in and run their analysis remotely. This technology already exists; Google, Amazon, Microsoft, IBM, Verizon, and at least a dozen other companies already have data centers large enough to handle the necessary data storage and compute tasks. It would be handy to build a whizz bang compute system with all kinds of custom software designed for cancer genomics, but that would take time (and government contractors). A better, faster, simpler system would just stick the genomic data in a central location and let researchers launch virtual machines with whatever software they want (or design for themselves). Amazon EC2 has this infrastructure already in place. It could be merged with the NIH authorized access system in a week-long hackathon. Cancer research Funding agencies could award Cloud compute credits (or just let people budget for Cloud computing in the standard grant application).


Cancer Moonshot:

Use the force luke - use the cloud luke


Oct 23, 2015

Masters in Biomedical Informatics at NYU School of Medicine

We are starting a new Masters program in Biomedical Informatics at NYU School of Medicine in 2016.  We currently have about a dozen PhD students, but the Masters program is intended to serve a wider group with more diverse backgrounds.