Cancer, Big Data and Storage
A few weeks ago, I was traveling and for some reason The Wall Street Journal was delivered to my hotel room door. When thumbing through it, an article on Big data and Cancer (subscription required) caught my eye. I also found another article on the topic at SmartPlanet, which does not require a subscription.
I was very intrigued with the concept. The SmartPlanet article links to the original research, which says, “Patients are increasingly presenting with 'rare cancers,' more narrowly defined by their molecular characteristics, sometimes making the best course of treatment unclear. Today more than ever, oncologists need real-time decision support to help them provide the most effective treatments tailored to their patients’ unique biology and tumors.”
All I can say is WOW. We all know people who have gotten cancer. Some have lived, some have died, and all have struggled with the treatment. The hope is that this new data will help improve the lives of those who get cancer, improving survivability and lowering the impact of treatment.
Analyzing all this data is going to take a lot of storage. Let’s explore this one big data application and see what the storage impact will be.
Storage, Cancer and You
To get started tailoring treatment for each patient and each tumor, the first thing physicians are going to have to do is sequence the individual. That translates into about 1.5 GB (1000*1000*1000 bytes) per person.
Then they will need to sequence the cancer. I did some checking with some friends and found that the size of some of cancers is up to 360 GB. Most are much smaller, but still pretty large in size — over 30 GB.
To get a full understanding of how cancer treatments impact you as compared to someone else, you have to understand the differences genetically between you and that person. Then you have to characterize the treatments for each person and each cancer. Did you go to x amount of chemo and y amount of radiation? What was the composition of the chemo therapy and how was it administered? How much radiation, and what was the angle and location compared to the location of the cancer? Of course age and gender will also likely have an impact. All of these issues — and likely more — must be described and then compared, so this information must be collected and stored.
Think about this: you need to collect the 1.5 GB for each person and likely extract out the genetic markers. Then you need to analyze the cancer and the treatment data.
According to The American Cancer Society, 12,549,000 people in the U.S. have cancer. So at 1.5 GB per person, that comes out to about 18.8 PB of data — and this does not include the genetics of the cancer.
Clearly, the researchers want to store only what they need. But as I have often said about big data problems, you do not know what you do not know about the data. In many cases, you need to store the raw data so you can learn new things at some future date. This happens all of the time in application area after application area, from genetics to seismic processing to climate analysis.
So let’s assume that someone is going to store all of this data somewhere. This, of course, takes a significant amount of storage space, plus, of course, the computational power to process the data and make the correlations about the best options given the type of cancer and your genetic markers and age, etc.
Cancer is major disease, but it is just one of many diseases that could have improved outcomes from patients based on this type of genetic analysis combined with treatment analysis. Fortunately, the cost of sequencing each of our genomes is dropping far beyond many people’s wildest expectations.
Everyone talks about a data explosion, but my feeling is that it is going to be bigger than anyone thinks. There are going to be many more new applications and new ideas than anyone has imagined up to this time.
The cancer example is just one of numerous examples that storage growth prognosticators have not planned for. For instance, another application that might be on the horizon would be tracking asteroids and other near earth objects given what happened in Russia in March.
I think that if the analysis part of the equation gets into high gear, we will all find that the amount of data collected and saved will be far more than what is currently projected. Then the problem becomes capacity. It is unclear that the various disk and tape drive vendors (that is all we have for high density today) have the capacity to fulfill the demand.
I used the cancer example because cancer impacts almost everyone on the planet in one way or another. Though doctors and scientists have made good progress on some cancers with both early detection and treatment, there are still others that are difficult to detect and nearly impossible to treat. For example, everyone dreads to hear the words "pancreatic cancer."
We are on the brink of having the technology and methods to be able to detect and treat many diseases cost-effectively, but this is going to require large amounts of storage and processing power, along with new methods to analyze the data. Cancer research is just one of the many areas that is banking that it can benefit from big data analysis.
As more and more applications are found for big data analysis, the storage requirements and the computation requirements are going to grow. What will come first, the chicken or the egg? Will we run out of ideas, or will we run out of storage at a reasonable cost? Without the storage, the ideas will not come, as the costs will be too high.