Genome data could outgrow YouTube, Twitter content by 2025 – report

Reuters / National Human Genome Research Institute / Handout
Scientists have warned that the computing resources designed to handle genome data may soon exceed those of giants like Twitter and YouTube. It has been estimated that between 100 million and 2 billion human genomes could be sequenced by 2025.

According to the report, published in the journal PLoS Biology, this means that as much as 2–40 exabytes of storage capacity will be needed by 2025 just for the human genomes. And although the computer scientists believe that these needs can be diminished with effective data compression, “decompression times and fidelity are a major concern in compressive genomics,” they say.

The team estimates that YouTube currently has 300 hours of video being uploaded every minute, and this could “grow to 1,000–1,700 hours per minute (1–2 exabytes of video data per year) by 2025 if we extrapolate from current trends.”

Twitter, meanwhile, currently generates 500 million tweets/day, each about 3 kilobytes including metadata, the report states. “While this figure is beginning to plateau, a projected logarithmic growth rate would suggest a 2.4-fold growth by 2025, to 1.2 billion tweets per day, 1.36 petabytes/year.”

READ MORE: Quoi?! British DNA is 40% French, Oxford study finds

In other words, data acquisition in these domains is expected to grow by up to two orders of magnitude in the next decade, the researchers say.

“Although total genomic data could far exceed the demands for the others, with the right new innovations the net requirements could be similar to the domains of astronomy and YouTube,” according to the report.

The most practical, and perhaps only, solution for distributing genome sequences at a population scale, the researchers say, is to use “cloud-computing systems that minimize data movement and maximize code federation.”

The report adds that new developments from companies like Google, Amazon, and Facebook that include applications designed to “fit the frameworks of distributed computing efficient data centers and distributed storage and cloud computing paradigms” are also expected to be part of the solution.

Last but not least, authentication, encryption, and other security safeguards “must be developed” to ensure that genomic data remain private, the researchers wrap up.