As scientists continue to catalog genomic variations in everything from plants to people, today’s computers are struggling to provide the power needed to find the secrets hidden within mass amounts of genomic data.
A team led by Christopher Batten, associate professor in the School of Electrical and Computer Engineering, is responding with the Panorama project, a five-year, $5 million National Science Foundation-funded effort to create the first integrated rack scale acceleration paradigm specifically for computational pangenomics.
The project includes seven principal investigators from three universities, including Cornell, the University of Washington and the University of Tennessee Health Science Center (UTHSC).
Computational genomics is “undergoing a sea change,” Batten said. The traditional method of examining DNA using a single linear reference genome is quickly giving way to a new paradigm using graph-based models that can address the sequence and variation in large collections of related genomes.
“With a single reference genome, you could understand other genomes as they relate to that single reference,” Batten said, “but it’s hard to understand how they relate to each other, and to everything else.”
Genetic researchers would be thrilled to investigate pangenome graphs that include millions of genomes, but for now it’s impossible. The computing power needed is just not available. The demands of graph-based pangenomics require rethinking the entire software/hardware stack. But this is not simply a “big data” problem.
“Yes, the data is big because there is a lot of data,” Batten said. “It’s also sparse because it’s irregular; not every sequence is the same and elements are missing. It’s dynamic because geneticists are adding newly sequenced genomes every day. And since each DNA sequence is unique to each person, we must keep it private.”
Building a computer system that can get answers from this big, sparse, dynamic and private dataset requires a collaborative approach from computer systems researchers working simultaneously on different layers of the stack.
“We need to rethink how we build computers,” Batten said. “That’s why this project is so ambitious. In the past, you just waited two years and your computers would naturally become faster. But the slowing of Moore’s law means that inevitable improvements in performance are just not occurring anymore. So you need a cross stack approach to really make an impact.”
That impact will take the shape of a prototype computer the team will design and build. Most laptop computers have four to 10 cores, or central processing units; the Panorama prototype will have 1 million. The project’s vision for this powerful new computing tool is analogous to the impact of the Hubble Space Telescope: It will enable computational biologists to observe what was previously unobservable.
The team Batten assembled to build this revolutionary system includes experts in computational biology; programming languages and compilers; computer architecture; and security and privacy.
It started with a chance meeting at an open-source software and hardware conference Batten attended in Belgium in January 2020 with longtime friend and research collaborator Michael Taylor, an associate professor of electrical and computer engineering at the University of Washington. There they connected with UTHSC assistant professor Pjotr Prins, one of the world’s leading researchers in computational genomics.
Other investigators on the project include Erik Garrison from UTHSC; Zhiru Zhang and Ed Suh of Cornell ECE; and Adrian Sampson, assistant professor of computer science at the Cornell Ann S. Bowers College of Computing and Information Science.
In pangenomics, the goal is not to understand a single individual – it’s to analyze the genomes of an entire population and study the relationships between individuals.
“Imagine sampling 1,000 salmon from a given river to understand the biodiversity in that river,” Sampson said. “Researchers are also interested in the way each individual salmon differs from every other salmon. In a sample of 1,000 salmon, there are nearly 500,000 pairs of salmon to be compared to each other to understand the entire pangenome.”
The Panorama project introduces totally new challenges in hardware design and programming.
“We have an opportunity to generate specialized, single-purpose hardware that is really only capable of solving these enormous genomics problems,” Sampson said. “This is not an easy task, but if we can achieve it, we’ll help biologists solve problems that they can’t even begin to approach with the computers they have today.”
Eric Laine is a communications specialist in the School of Electrical and Computer Engineering.