Big data is in the news a lot, but what is it? The first, and maybe most basic level, of big data, is actually the structure. Where databases in the past were limited by the computational power of the specific computer system that housed them, data structures now are more elastic and able to sort and compute large datasets in a distributed fashion. The result is that data can be spread out to many different servers within a data center, or even spread out across the world.
SETI, the Search For Extra Terrestrial Life, for instance, launched a distributed computing system called SETI at home. This system allows you to use your personal computer to perform computational functions to help SETI comb through its data to find signs of extraterrestrial life. This means that you can literally help find ET. Big data algorithms similarly spread data out among dozens or even thousands of servers.
But sometimes big data just gets dropped at your doorstep. CERN, the European Organization for Nuclear Research, has recently published its data for its recent run of its LHC particle collider in Geneva Switzerland. You can find a link to download the data here and have the opportunity to pour through it looking for signals of the next great breakthrough in the world of quantum physics. You, at your computer right now, could stumble upon something that thousands of scientists overlooked.
If you had a computer powerful enough.
Or even if you don’t. You could go to Amazon Web Services, AWS, (or any public cloud) and fire up a cloud server that has more available computational capacity than existed in the entire world only a few decades ago. You could load it with Hadoop or some other operating system for handling extra large data sets, and you could breathe in the excitement of your impending Nobel Prize.
There is data of all kinds spread across the internet waiting to be cross-referenced and cross-pollinated. One of the most difficult pieces isn’t finding the data, it’s knowing what to do with it. With so much volume of data, the most difficult part is separating the wheat and the chaff, finding the good morsels among the piles of raw sewage that is raw data.
But data science is an interesting animal. By the time the data has gotten complex enough to be analyzed by a serious scientist, it has often gone beyond mere interest and becomes an investigation. What might have started on AWS can quickly spiral into a full-scale government research project. To make that work, you need a data center with the flexibility to connect to your AWS instance and seamlessly integrate with your private data, while transferring petabytes and petabytes without breaking the bank.
Beginning in 2017, Agile Data Sites will be able to provide direct connections to AWS through its network, which will mean that you won’t have to transfer that data over the Internet, you can get a direct fiber connection which can reduce your costs dramatically. You can leverage your own servers in your own space to provide the computational ability that you need without breaking the bank on public cloud pricing. You get the best of both worlds, the flexibility of the public cloud and the control of the private cloud.