|By Madhusudan Hanumantha Rao||
|May 11, 2012 09:00 AM EDT||
Large Hadron Collider (LHC), the world's largest particle accelerator that is being used to find the elusive "God particle," generates about 40Terabytes (1 TB=103GB) of data per day from its four main detectors. Assuming an average size of 1GB for a movie, that amounts to a data stream worth 40000 movies(!) in a day. This massive amount of data is distributed to selected institutions across the world for further research. Thus, in one year, the CERN pumps about 15 Petabytes(1PB=106GB) of data into its private network and the Internet. Similarly, multiple information sources across the world are generating data so large as to challenge the technology that stores and processes that data. To put things in a broader perspective, 90% of the world's data was created in the last 2 years!
Not only is data flooding us at huge volumes, but it's also doing it in various forms - it could be thousands of blogs and micro blogs written all over the world (text); the constant stream of images produced by surveillance cameras (video); mobile phone conversations captured by government agencies (voice); infinite business events like financial transactions, change in shop-floor inventory levels, PoS transactions at a grocery store or spikes in a telecom provider's call volumes;. The proliferation of such "unstructured" data poses a difficult change for analytics. Moreover, a piece of information can be multi-dimensional, for e.g. a company's sales figures can be viewed based on geography, time interval, product line etc. Analyzing and obtaining useful information out of an ocean of raw data is an imperative for organizations across the world.
Imagine you are walking the streets of Manhattan, navigating with the help of your smartphone and looking for a birthday gift for your friend. You have spent quite some time shopping for that perfect watch and still haven't found the one that matches your friend's taste and wrist size. When you are just about to give up, you get an alert on your phone that gives the name and directions to the closest store which has the watch you are looking for. This is possible because of a shopping service that has gathered real-time location and shopping preference data, analyzed them and provided a matching store. In order for business to remain competitive, it is necessary that businesses are equipped with the technology that can process data arriving at a fast pace, like the continuous stream of GPS coordinates from the smart phone.
The above three paragraphs summarize respectively the key features of what is termed as "Big Data" - i.e., Volume, Variety and Velocity, known as the 3 Vs. In simple terms, any data that is too big for its underlying infrastructure to store and process it within acceptable time limits can be called Big Data. Nothing about Big Data is absolute - the "bigness" of it depends on an enterprise's capability to handle it, changes with time, is not measured using standard units and its definition is highly subjective.
But why has data exploded so much in recent times? The simple reason is that it is easier for all of us to produce more data than ever before. Earlier in high school, my friends and I used to gawk wide-eyed and open-mouthed at the bulky personal computer with 5.25 inch floppy disk drives. Today, I receive a somewhat similar look when I am seen not having a smartphone in hand. The cost and size of computing hardware has accelerated rapidly in the downward direction over the years, thus making it easier for the common man to own a device, thus creating more data. Similarly, networks are omnipresent in today's world with LANs and WANs becoming the forgotten ancestors of today's youngsters that are the 3Gs and 4Gs. Also, the cost of producing software has plummeted down with the proliferation of the open-source and similar movements, thus hurtling software development to a low-cost or zero-cost zone. In a nut shell, hardware, software and network - ingredients for creating and unleashing data into the world have all become exponentially cheaper by the day and have contributed to Big Data.
As the cost of producing data continues its downward spiral to zero, it is being naturally accompanied by a tremendous increase in the rate of data production itself, as seen earlier. This has led to a spurt in demand for technologies that can store and process huge amounts of data. MapReduce framework has gained wide acceptance as the de-facto standard for Big Data processing. Effectively, this framework divides a large data set into progressively smaller chunks and gets it processed in a distributed computing environment. The framework then gathers the processed information back in smaller chunks and produces a resultant to address the original problem. Google pioneered Big Data technology revolution by using its version of the MapReduce technology and Google File system (GFS) for high-speed searching of remotely held data stores. Apache Hadoop, which is also based on the MapReduce framework, is the upcoming standard for Big Data implementations. The future will see these and other similar solutions getting optimized and targeting industry-specific Big Data pain points. It is estimated that more than half of the data in the world will be stored in Hadoop in the next few years. A key component in addressing Big Data problems is to efficiently connect dissimilar data types that are scattered across the Internet. This has led to the concept of "Linked Data," which builds relationships between data elements and allows the seeker of information to extract meaningful information from them.
The Big Data movement, whether it means mere searching through large sets of data or a more controlled processing of data to gather useful information through data relationships, is here to stay. The time is ripe for software vendors to build solutions that address customers' requirements, which themselves are as diverse as the data we see around us.
- Cloud Computing as a Competitive Advantage: Change at the Speed of Business
- Product Review: FICO Blaze Advisor
- Steps for Improving Capacity Management in the Age of Cloud Computing
- Cloud-Friendly BPM: The Power of Hypermedia-Oriented Architecture
- Using Business Process Management and Workflow Automation
- Understanding Business Intelligence and Your Bottom Line
- Cordys to Present at SYS-CON's Cloud Computing Expo
- GT Software and Cordys Partner For SOA Development
- Computer Inventory Software Management
- Best Practices for Business Transaction Management