A GPS for the Genome

UNSWTV interviews Pete and Fede and previews some high-dimensionality models they've developed for PASTE.

Share This Project:

Science

A GPS for the Genome

How can you fund something really, really important, but really, really speculative?

We're talking about a breakthrough with the potential to revolutionize multiple industries - medicine, energy and agriculture. If we can demonstrate the validity of this approach we're on the brink of a whole new bioscience:

  • Transform your own genome into a cure for whatever ails you - cancer, chronic disease, even old age.

 

  • Tailor microbes to efficiently convert garbage dumps and atmospheric carbon into cheap petroleum and bio-friendly plastics.

 

  • Engineer a synthetic ecology to make deserts fertile, restore endangered species, heal the ocean and husband wasteland.

 

How can this one breakthrough achieve so much?

Right now genomics and proteomics are generating vast amounts of data, everything we'd need to be able to personalise medicine, tailor agricultural microbes and engineer ecological remedies.

The problem is there's too much data with too many combinations of relationships in too many dimensions for existing tools to be able to cope. Scientists and engineers are like blind men in a foreign land who can't see where they're going or understand what their guides are saying.

The PASTE project gives them new eyes and lucid guides. It's based on a new number system we call Permutahedral Indexing - P.I. for short, an N-dimensional map that efficiently locates and interrelates complex datasets in the space of all possible data. P.I. does this efficiently even when the data has hundreds of independent dimensions and comes in petabytes and exabytes.

Existing genomic tools use trial and error on small scales to feel their way to local solutions to complex problems. PASTE locates and immediately discards whole fields of bad solutions, eliminating the need for supercomputers to find the best possible solution on a global scale.

It's not that we're claiming to have some way of instantly searching and scaling lots of data in lots of dimensions. P.I. gives us a way to reformulate the data into a more structured form that we don't have to search and scale. Like a GPS, P.I. tells us where we are in the whole data universe. So we don't have to explore blindly to zero in on the most efficient path to where we want to get.



Prove It!

Proving this is exactly what the PASTE project is about. We're going up against the oldest, most respected, best understood and most limiting tool in all of bioinformatics, an algorithm known as BLAST. And we're not just trying to improve on BLAST. We aim to obsolete it.

25 years after its invention, BLAST remains the state of the art in bioinformatic sequence homology searching. The computers that run BLAST have grown exponentially faster over the years, bringing the cost of sequencing a single genome down from billions to just thousands. At the same time, the available data has grown exponentially from megabytes to exabytes.

So a large fraction of the world's high performance computing facilities are still chugging away at BLAST, burning up millions of dollars per day of CPU time. PASTE will enable these  data mining problems to be solved for pennies and in instants on any researcher's individual laptop.

But PASTE isn't just about running  BLAST quicker and cheaper. It's about doing it better:

  • Where BLAST only finds local alignments between sequences, PASTE will find the best possible global alignment.

 

  • BLAST can't handle statistically skewed sample populations where PASTE regards skew as just another dimension in its model.

 

  • PASTE can mine the genomic and proteomic dynamism found in living microbial communities where BLAST isn't even supposed to do that.

 

PASTE is the proof in the pudding for P.I. We want to apply P.I. technology to many other fields of science and engineering too but to do so we have to prove it out with just one. What better target than the genes and proteins that make up our bodies and the natural world?



What Will We Do With The Funds?

Right now we have modelled the P.I. system and described algorithms we'll use to convert genomic and proteomic data into P.I. But we need time to implement these, test them, build an environment around them to visualise their results, fine-tune parameters, and run and compare their metrics with BLAST.

Of the two of us, Fede does the science and Pete the software. We'll use your money to get these things done right. We need the help of mathematicians to show what we're doing is formally correct, and we want a small team to package and commercialise the PASTE toolset for industrial applications.

If we fund more than 100% of this project we'll use the extra money to apply PASTE to solve specific problems in medicine, agriculture and ecology, starting with problems that are the immediate concern of the people who fuel our project.

So in addition to all the very cool rewards we're offering you for getting involved, we want you to tell us about the specific biological problems you want solved. Since you're making our work possible, your priorities are our priorities, and we aim to "pay it forward" by solving problems that directly concern you - for the benefit of all mankind.

 

History

We trace the origin of PI back to Gottfried Leibniz's search for a "Characteristica Universalis" a universal language for scientific expression. Leibniz's invention of Calculus is far better known than his work on the Characteristica though the great mathematician Kurt Goedel spent much of his later years attempting to rediscover it. In our view it was hidden in plain sight in the form of Leibniz's educational curriculum for Russia

This was a twenty year project sponsored by Peter the Great to create nothing less than a Russian Enlightenment. It eventually realized something like the Characteristica in the work of the Russian logician Ivan Ivanovitch Zhegalkin. Zhegalkin's work proved of central importance to the 20th century invention of the FPGA and other computing fundamentals in the West as well as research on ternary computing machines in the old Soviet Union. Its profound implications for efficient solution of problems in complex engineering domains became a focus for defense activities on both sides of the cold war. 

In the 1970s the US DoD funded research by Dean Lucas and Laurie Gibson into permutahedra-based tiling methods. Their Generalized Balanced Ternary (GBT) was the subject of applications in image analysis and eventually found commercial application at Spaceimaging.com (now GeoEye) analysing satellite imagery. In the 1980’s Kenneth Happel combined Zhegalkin’s concepts with the tiling methods of Lucas and Gibson to create a general solution to generic representation of information, a self-organizing database where lexical relations between addresses represent a constant multidimensional geometry. Happel founded a European defense technology company on these ideas and licensed its technology to General Dynamics.

One of us (Peter Merel) worked in 2000-2002 as chief architect for Omnigon Technologies, a company licensed to apply Happel's ideas to commercial finance and bioinformatic data. Patents were obtained by Omnigon including 6,747,643 and 7,061,491 but the company closed its doors in the “dot com” crash. PI represents a new and independent approach to the development of Characteristica suited to bioinformatic big data, and PASTE is its application to a replacement for BLAST.