Big Data & Hadoop: An Introductory Guide

Big Data & Hadoop: An Introductory Guide

Big Data Hadoop

Information is the oil of the 21st century, and analytics is the combustion engine.
– Peter Sondergaard, Gartner Group.

It is a capital mistake to theorize before one has data. Insensibly, one begins to twist the facts to suit theories, instead of theories to suit facts.
– Sherlock Holmes by Arthur Conan Doyle.

There are other kinds of data and then there is Big Data. The smallest Big Data file can go up to 1 TB. This is one of the obvious reasons why an individual computer can’t handle data at such staggering levels.

Big Data Origins
Once upon a time in 2002, Google was wondering how to handle large stacks of piled up data. They first tried to save files manually (painstaking, time-consuming process) within a customized distributed file system architecture. Google then sought an automated way to do the same. Subsequently, they released a white paper titled ‘Google Distributed File System’ on the subject. Meanwhile, around 2002-2003, Doug Cutting, search framework development specialist, started practically working on the white paper’s content. Word went around and Yahoo employed Cutting. In 2005, Hadoop was consequently implemented by Yahoo, donated to Apache, as a full-fledged project.
Google now went a step ahead and looked for a similar breakthrough in analytics. Another white paper ‘Google MapReduce Algorithm’ was thus created. Cutting and Yahoo took help of the paper to implement an analytics solution. The project was fully implemented and passed on to Apache again. This is how Hadoop evolved into a complete framework.

Database: Then & Now
Relational databases had a fixed pattern to it and they lingered until year 2002. Then large data-based applications like Twitter and Facebook arrived. Unstructured huge data, inclusive of video, audio and other files began raising issues. Even as Google, Doug Cutting and Yahoo were working on handling huge sets of data, IBM defined what Big Data exactly was. These characteristics have been repeatedly mentioned since then.

Three Big Data Identifiers
As attributed to IBM, the three main features that define Big Data are: 1. Volume 2. Variety and 3. Velocity:

Data volume is so large, it goes beyond an organization’s ability to handle it using conventional methods. Relational databases are just not scalable enough to handle this kind of data.

By velocity, we imply that data comes in so fast, the whole process (gather, analyze, interpret), if conducted using usual methods, could take an year. The flow of data is so fast, it soon exceeds the company’s capability to handle it. Conventional technologies are not able to process this kind of data.

Variety in data is a major reason as to why organizations look for Big Data solutions. Information is available in many different formats, from unstructured, structured to other diverse formats. Not just documents, 3D models, photographs, video/audio elements and uncategorized data are part of the mix. Such distinctions make it difficult to interpret data the usual way. Hence Big Data. A fourth quality is varacity, which concerns data authenticity and the need to analyze the same.

Why Big Data?
As things stand, many companies are absorbing and discarding vast amounts of potentially-important data. To quote real life examples, data collected by a certain ‘loyalty card’ company was not interpreted for further use. Also, at the time of writing, most hospital video data is deleted within weeks of its recording. Thanks to social networking sites and prolific online activity, healthcare, retail, finance and other sectors have started maintaining customer activity databases. It is now possible to extract data from videos and still images, Internet of Things (IOT), purchasing patterns & techniques. Big Data is required to make proficient use of this voluminous, otherwise wasted, information. Three real-life cases of resourceful Big Data usage:

A top telecom company that provides data, Internet and voice services to over 6 million customers has availed Big Data Analytics services to good effect. More than 1 TB data is handled by company’s Big Data specialists on a daily basis.

One of the most tangible beneficiaries has been the retail industry. Top retail giants are switching to Big Data to beat the competition. Trends can be predicted and demand estimated for the best selling items. Online retailers are more at an advantage, they gain knowledge of browsing patterns, demographic information, individual customer purchase details and are able to segment customers according to expected purchase behaviour.

A telling effect of Big Data has been on healthcare, how signs and symptoms can be documented and help reduce hospital and clinic visits.

Big Data & Hadoop
Big data can be put to use only if the right tools are used and varied data types arranged from different sources. Among the many tools and technologies used to interpret Big Data, Hadoop is a leading technology. The Hadoop library is a framework best used for dependable, distributed computing. It provides the first feasible platform for Big Data analytics. The initial users of Big Data are utilizing Hadoop for executing functions, considered impossible earlier. For example, LinkedIn creates over 100 billion customized recommendations every seven days, courtesy Hadoop.

What exactly does Hadoop do?
In simple terms, Hadoop helps you 1. Save the intimidating large size Big Data file (Using Hadoop Distributed File System). 2. It then supports analytics on that file (Using Google MapReduce Algorithm). One can thus conclude that Hadoop is a combination of the algorithm and distributed file system. MapReduce, a data processing framework is another key component. It distributes huge data sets across many servers. Every server then traces the allocated data summary. MapReduce also allows huge data sets to be quickly extracted.

Hadoop distributes data storage across various server computers. An optimized, reliable solution as compared to a costly traditional hardware server, Hadoop discovers system and hardware failures at the application level. Unhindered service is thus ensured – a large group of individual computers are less likely to break down. Cloud-based big data infrastructure is another option for organizations who can’t afford to pay for internal big data infrastructure. Running the whole show on Cloud makes for good economic sense. For instance, data doesn’t have to be downloaded in Cloud.

Video Links: Big Data & Hadoop      

Here is a link that features using Hadoop for something as critical as climate analysis. The video includes use cases.

A short video talks about use cases of Hadoop across industries.