Thursday, June 23, 2011

Big-Ass Servers™ and the myths of clusters in bioinformatics

Spending $55k for a 512GB machine (Big-Ass Server™ or BAS™) can be a tough sell for a bioinformatics researcher to pitch to a department head.

Dell PowerEdge r900, available in orange and lemon-lime
Speaking as someone who keeps his copy of CLR safely stored in the basement, ready to help rebuild society after a nuclear holocaust, I am painfully aware of the importance of algorithm development in the history of computing, and the possibilities for parallel computing to make problems tractable.

Having recently spent 3 years in industry, however, I am now more inclined to just throw money at problems. In the case of hardware, I think this approach is more effective than clever programming for many of the current problems posed by NGS.

From an economic and productivity perspective, I believe most bioinformatics shops doing basic research would benefit more from having access to a BAS™ than a cluster. Here's why:
  • The development of multicore/multiprocessor machines and memory capacity has outpaced the speed of networks. NGS analyses tends to be more memory-bound and IO-bound rather than CPU-bound, so relying on a cluster of smaller machines can quickly overwhelm a network.
  • NGS has forced the number of high-performance applications from BLAST and protein structure prediction to doing dozens of different little analyses, with tools that change on a monthly basis, or are homegrown to deal with special circumstances. There isn't time or ability to write each of these for parallel architectures.
If those don't sound very convincing, here is my layman's guide to dealing with the myths you might encounter concerning NGS and clusters:

Myth: Google uses server farms. We should too.

Google has to focus on doing one thing very well: search.

Bioinformatics programmers have to explore a number of different questions for any given experiment. There is not time to develop a parallel solution to many of these questions as they will lead to dead ends.

Many bioinformatic problems, de-novo assembly being a prime example, are notoriously difficult to divide among several machines without being overwhelmed with messaging. You can imagine trying to divide a jigsaw puzzle among friends sitting several tables, you would spend more time talking about the pieces than fitting them together.

Myth: Our development setup should mimic our production setup

An experimental computing structure with a BAS™ allows for researchers to freely explore big data without having to think about how to divide it efficiently. If an experiment is successful and there is the need to scale-up to a clinical or industrial platform, that can happen later.

Myth: Clusters have been around a long time so there is a lot of shell-based infrastructure to distribute workflows

There are tools for queueing jobs, but those are often quite helpless to assist in managing workflows that are written as parallel and serial steps - for example, waiting for steps to finish before merging results.

Various programming languages have features to take advantage of clusters. For example, R has SNOW. But Rsamtools requires you to load BAM files into memory, so a BAS™ is not just preferable for NGS analysis with R, it's required.

Myth: The rise of cloud computing and Hadoop means that homegrown clusters are irrelevant that but also means we don't need a BAS™

The popularity of cloud computing in bioinformatics is also driven by the newfound ability to rent time on a BAS™. The main problem with cloud computing is the bottleneck posed by transferring GBs data to the cloud.

Myth: Crossbow and Myrna are based on Hadoop, we can develop similar tools

Ben Langmead, Cole Trapnell, and Michael Schatz, alums of Steven Salzberg's group at UMD, have developed NGS solutions using the Hadoop MapReduce framework.
  • Crossbow is a Hadoop-based implementation of Bowtie.
  • Myrna is an RNA-Seq pipeline.
  • Contrail is a de novo short read assembler.
These are difficult programs to develop, and these examples are also somewhat limited experimental proofs of concept or are married to components that may be undesirable for certain analyses. The Bowtie stack (Bowtie, Tophat, Cufflinks), while revolutionary in its implementation of Burroughs-Wheeler algorithm, is itself is built around the limitations of computers in the year 2008. For many it lacks the sensitivity to deal with, for example, 1000 Genomes data.

The dynamic scripting languages used most bioinformatics programmers are not as well suited to Hadoop as Java. To imply we can all develop similar tools of this sophistication is unrealistic. Many bioinformatics programs are not even threaded, much less designed to work amongst several machines.

Myth: embarrassingly parallel problems imply a cluster is needed


A server with 4 quad-core processors is often adequate for handling these embarrassing problems. Dividing the work just tends to lead to further embarrassments.


Here is a particularly telling quote from Biohaskell developer Ketil Malde on Biostar:
In general, I think HPC are doing the wrong thing for bioinformatics. It's okay to spend six weeks to rewrite your meteorology program to take advantage of the latest supercomputer (all of which tend to be just a huge stack of small PCs these days) if the program is going to run continously for the next three years. It is not okay to spend six weeks on a script that's going to run for a couple of days.

In short, I keep asking for a big PC with a bunch of the latest Intel or AMD core, and as much RAM as we can afford.

Myth: We don't have money for a BAS™ because we need a new cluster to handle things like BLAST

IBM System x3850 X5 expandable to 1536GB, mouse not included
Even the BLAST setup we think of as being the essence of parallel (a segmented genome index - every node gets a part of the genome) is often not the one that many institutions have settled on. Many rely on farming out queries to a cluster in which every node has the full genome index in memory.

Secondly, the mpiBLAST appears to be more suited to dividing an index among older machines than today's, which typically have >32GB RAM. Here is a telling FAQ entry:

I benchmarked mpiBLAST but I don't see super-linear speedup! Why?!

mpiBLAST only yields super-linear speedup when the database being searched is significantly larger than the core memory on an individual node. The super-linear speedup results published in the ClusterWorld 2003 paper describing mpiBLAST are measurements of mpiBLAST v0.9 searching a 1.2GB (compressed) database on a cluster where each node has 640MB of RAM. A single node search results in heavy disk I/O and a long search time.

Your comments on this topic are welcome!