If all the hype & deluge of headlines, articles & advanced analytics and reporting material is anything to go by, BIG DATA is the next big thing. At times you may even wonder what have we been doing in the name of analytics & insight generation thus far. So how much of the hype is worth or is it a bunch of data & BI companies that like to use the 2 magic words – BIG DATA & “Hadoop”, is there a solution ready for an enterprise level need.
So what is BIG DATA, one more time?
To begin with, BIG DATA is NOT just “BIG”. It is a misnomer that implies it is only about the size but put simply, it is big, fast & diverse data that can come from varying sources and channels (offline & online) but cannot be processed or analyzed using traditional processes or databases or even data warehouses. It is a methodology and approach (not just a technology solution) to collect, store, analyze and convert the volume, velocity and variety of data into business critical and actionable insights for organizations to get ahead of the competition. A quick view of the key characteristics, the 3 Vs:
• Volume – A shift from managing terabytes to petabytes, exabytes & zettabytes of data. Facebook & Twitter alone generate approx 20 terabytes of data each day.
• Variety – Complex combination of raw, structured, semi-structured and unstructured data from web pages, log files, indexes, social media, emails, documents, sensor data from active & passive systems generated due to the explosion of sensors, smart devices, communication & social collaboration.
• Velocity – The speed at which the data can flow and provide near real time analysis & actionable insights. A capability to parse the data in motion and not just the data at rest.
What is it trying to solve?
For me, it is not so much about solving a problem but creating an opportunity that has been around for a while but never been tapped. It is an attempt to provide businesses with insights, hidden behaviors and patterns that they didn’t know that they didn’t know. If executed successfully, organizations could benefit by:
• Applying predictive models and scoring against fast-moving data and complex event streams for smarter decisions in real time
• Using tips for turning massive amounts of data from online customer behavior and social media activity into valuable and timely business insight
• Becoming a proactive organization by using big data analytics to speed recognition and resolution of problems in customer experiences, supply chains, and business processes
• Addressing new challenges posed by streaming data, social media data, content, and events and so on
How is it (BD) different from a conventional Data Warehouse (DW)?
There are fundamental differences like:
• Variety – DW more ideal for analyzing structured data, BD solves the “variety” challenge
• Processing – Data in DW is usually cleansed, enriched, modeled before being stored, a higher value per byte whereas data in BD does not go through the same quality controls & checks because of the obvious cost. The data is typically stored in its native format.
• Shelf Life – Data in DW can have a much longer shelf life as compared to BD
In a well laid out enterprise solution, a BD solution could push its “reduced” data from a “MapReduce” program permanently into a DW. In other words, BIG DATA will never replace a DW but will compliment it.
How Does It Work – Technology?
Could be summarized as an implementation of Hadoop, Apache’s java based open source computing environment built on top of a distributed clustered file system designed for large-scale data operations. It is based on a “MapReduce” programming paradigm that breaks a massive job into sub-tasks (mappers and reducers) to manipulate data stored across a cluster with hundreds or thousands of servers for massive parallelism.
Hadoop provides a base platform with Java APIs, requires applications to be built on top using development languages like Pig, Hive and Jaql that can abstract some of the internal complexities. An ideal solution would extend the Hadoop platform & framework with enterprise grade security, governance, availability, scalability along with a data visualization tool for easy analysis and insight generation.
It would be too pre-mature to either write it off or treat it as a universal solution to all measurement, analytics & insight but it does possess enough impetus that deserves attention, investment & a well planned and architected execution. It is definitely a step in the right direction for any business, an initiative that is here to stay.