Skip to content
Cuelogic
  • Services
        • Product Engineering
          • Product Development
          • UX Consulting
          • Application Development
          • Application Modernization
          • Quality Assurance Services
        • Cloud Engineering
          • Cloud Services
          • DevOps Services
          • Cloud Migration
          • Cloud Optimization
          • Cloud Computing Services
        • Data & Machine Learning
          • Big Data Services
          • AI Consulting
        • Internet of Things
          • IoT Consulting
          • IoT Application Development
        • Innovation Lab as a Service
        • Cybersecurity Services
        • Healthcare IT Services
  • Company
    • About
    • Culture
    • Current Openings
  • Insights
Tell Us Your Project  ❯
Big Data

Big Data Frameworks

7 Mins Read

Introduction

The term ‘Big Data’ evokes images of large datasets - both structured and unstructured, having varied formats and sourced from various data sources. The volume of data alone does not define Big Data. There are 3V’s that are vital for classifying data as Big Data. These include Volume, Velocity and Veracity. When we speak of data volumes it is in terms of terabytes, petabytes and so on. Velocity is to do with the high speed of data movement like real-time data streaming at a rapid rate in microseconds. Veracity involves the handling approach for both structured and unstructured data.  

Implementation of Big Data infrastructure and technology can be seen in various industries like banking, retail, insurance, healthcare, media, etc. Big Data management functions like storage, sorting, processing and analysis for such colossal volumes cannot be handled by the existing database systems or technologies. Frameworks come into picture in such scenarios. Frameworks are nothing but toolsets that offer innovative, cost-effective solutions to the problems posed by Big Data processing and helps in providing insights, incorporating metadata and aids decision making aligned to the business needs. 

There are many frameworks presently existing in this space. Some of the popular ones are Spark, Hadoop, Hive and Storm. Some score high on utility index like Presto while frameworks like Flink have great potential. There are still others which need some mention like the Samza, Impala, Apache Pig, etc. Some of these frameworks have been briefly discussed below. 

Apache Hadoop

Apache-Hadoop Framework

Hadoop is a Java-based platform founded by Mike Cafarella and Doug Cutting. This open-source framework provides batch data processing as well as data storage services across a group of hardware machines arranged in clusters. Hadoop consists of multiple layers like HDFS and YARN that work together to carry out data processing.

  • HDFS (Hadoop Distributed File System) is the hardware layer that ensures coordination of data replication and storage activities across various data clusters. In the event of a cluster node failure, real-time can still be made available for processing.
  • YARN (Yet Another Resource Negotiator) is the layer responsible for resource management and job scheduling.
  • MapReduce is the software layer that functions as the batch processing engine.

Pros include cost-effective solution, high throughput, multi-language support, compatibility with most emerging technologies in Big Data services, high scalability, fault tolerance, better suited for R&D, high availability through excellent failure handling mechanism.

Cons include vulnerability to security breaches, does not perform in-memory computation hence suffers processing overheads, not suited for stream processing and real-time processing, issues in processing small files in large numbers.

Organisations powered by Hadoop include Amazon, Adobe, AOL, Alibaba, EBay, Facebook, etc.

Apache Spark 

Apache-Spark Framework

The Spark framework was formed at the University of California, Berkeley. It is a batch processing framework with enhanced data streaming processing.  With full in-memory computation and processing optimisation, it promises a lightning fast cluster computing system.

Spark framework is composed of five layers.

  • HDFS and HBASE: They form the first layer of data storage systems. 
  • YARN and Mesos: They form the resource management layer. 
  • Core engine: This forms the third layer.
  • Library: This forms the fourth layer containing Spark SQL for SQL queries while stream processing, GraphX and Spark R utilities for processing graph data and  MLlib for machine learning algorithms.
  • The fifth layer contains an application program interface such as Java or Scala.  

Spark can function as a standalone cluster along with a capable storage layer or it can provide seamless integration with Hadoop.

It supports some of the popular languages like Python, R, Java, and Scala.

Pros include scalability, lightning processing speeds through reduced number of I/O operations to disk, fault tolerance, supports advanced analytics applications with superior AI implementation and seamless integration with Hadoop

Cons include complexity of setup and implementation, language support limitation, not a genuine streaming engine.

Organisations powered by Spark include Alibaba TaoBao, Amazon, Autodesk, Baidu, Hitachi Solutions, NASA JPL – Deep Space Network, Nokia Solutions and Networks, etc.

Storm 

Storm Framework

Storm, an open source framework, was developed in Clojure language specifically for near real-time data streaming. It is an application development platform-independent, can be used with any programming language and guarantees delivery of data with the least latency.

In Storm architecture, there are 2 nodes - the Master Node and Worker/ Supervisor Node. The master node monitors the failures of machines and is responsible for task allocation. In case of a cluster failure, the task is reassigned to another one. 

Pros include ease in setup and operation, high scalability, good speed, fault tolerance, support for a wide range of languages

Cons include complex implementation, debugging issues and not very learner-friendly 

Organisations powered by Storm include Twitter, Yahoo, Verisign, Baidu, Alibaba, etc.

Apache Flink

Apache-Flink Framework

Apache Flink, an open-source framework is equally good for both batch as well as stream data processing. It is suited for cluster environments. It is based on transformations - streams concept. It is also the 4G of Big Data. It is the100 times faster than Hadoop -Map Reduce.

Flink system contains multiple layers

  • Deploy Layer
  • Runtime Layer
  • Library Layer

Pros include low latency, high throughput, fault tolerance, entry by entry processing, ease of batch and stream data processing, compatibility with Hadoop.

Cons include few scalability issues.

Organisations powered by Flink include AWS, Uber, CapitalOne, Alibaba.com, etc.

Hive

Hive Framework

Apache Hive, designed by Facebook, is an ETL (Extract / Transform/ Load) and data warehousing system. It is built on top of the Hadoop –HDFS platform.

The key components of the Hive Architecture include

  • Hive Clients
  • Hive Services
  • Hive Storage and Computing

The Hive engine converts SQL- queries or requests to MapReduce task chains. The engine comprises of 

Parser: It goes through the incoming SQL-requests and sorts them

Optimizer: It goes through the sorted requests and optimises them

Executor: It sends tasks to the Map Reduce framework

Pros include own query language HiveQL similar to SQL, suited for data-intensive jobs, support for a wide range of storages, shorter learning curve

Cons include not suited for online transaction processing.

Organisations powered by Hive include PayPal, Johnson & Johnson, Accenture PLC, Facebook Inc., J. P. Morgan, HortonWorks Inc, Qubole, etc. 

Presto

Presto Framework

Presto is the open-source distributed SQL tool most suited for smaller datasets up to 3Tb.

Presto engine includes a coordinator and multiple workers. When client submits queries, these are parsed, analysed, their execution planned and distributed for processing among the workers by the coordinator.

Pros include least query degradation even in the event of increased concurrent query workload. It has a query execution rate that is three times faster than Hive. Ease in adding images and embedding links. Highly user-friendly.  

Cons include reliability issues

Organisations powered by Presto include AirBnb, Facebook, NetFlix, DropBox, NasDAQ, Uber, etc.

Impala

Impala Framework

Impala is an open-source MPP (Massive Parallel Processing) query engine that runs on multiple systems under a Hadoop cluster. It has been written in C++ and Java. 

It is not coupled with its storage engine. It includes 3 main components

  • Impala Daemon (Impalad): It is executed on every node where Impala is installed.
  • Impala StateStore
  • Impala MetaStore

Impala has its query language like SQL.

Pros include supports in-memory computation hence accesses data without movement directly from Hadoop nodes, smooth integration with BI tools like Tableau, ZoomData, etc., supports a wide range of file formats.

Cons include no support for serialisation and deserialization of data, inability to read custom binary files, table refresh needed for every record addition.

Organisations powered by Impala include Bank of America, J. P. Morgan, Apple, MetLife, etc.

Samza

Samza Framework

Samza is an open-source tool for streaming data processing that was designed at LinkedIn.

It has 3 layers 

  • Streaming
  • Execution 
  • Processing

Pros include operational ease, high performance, horizontal scalability, ability to execute same code for batch processing as well as streaming data and pluggable architecture 

Cons include unsuitable for extremely low latency processing.

Organisations powered by Samza include Optimizely, Expedia, VMWare, ADP, etc.

Conclusion

There is no dearth for frameworks in the market currently for Big Data processing. There is no single framework that is best fit for all business needs. But to highlight a few frameworks, Storm seems best suited for streaming while Spark is the winner for batch processing. For every organisation or business, one’s own data is most valuable.  Investing in Big Data frameworks involves spending. Many frameworks are freely available while some come with a price. Depending on the project needs, avail of trial versions offered.

Understand the goals of the business. If possible, experiment with the framework on a smaller scale project to understand its functioning better. Scalability is an aspect which should be borne in mind for future implementations. At times, the solution may lie in using multiple frameworks depending on its feasibility for the various process components involved. Investing in the   right framework can pave the way for success in business.

Tags
Big Data Frameworks big data frameworks 2019 big data frameworks tools and systems big data tools and techniques Big Data Tools for Data Analysis Data Processing data processing software list data processing tools
Share This Blog
Share on facebook
Share on twitter
Share on linkedin

People Also Read

DevOps
What is Infrastructure as Code and How Can You Leverage It?
8 Mins Read
Quality Engineering
Cypress vs. Selenium: Which is the Superior Testing Tool?
7 Mins Read
Product Development
Micro Frontend Deep Dive – Top 10 Frameworks To Know About
10 Mins Read

Other Categories

case study
Case Studies
e book
E-Book
video
Webinars
Subscribe to our Blog
Subscribe to our newsletter to receive the latest thought leadership by Cuelogic experts, delivered straight to your inbox!
Services
Product Engineering
  • Product Development
  • UX Consulting
  • Application Development
  • Application Modernization
  • Quality Assurance
Menu
  • Product Development
  • UX Consulting
  • Application Development
  • Application Modernization
  • Quality Assurance
Data & Machine Learning
  • Big Data Services
  • AI Consulting
Menu
  • Big Data Services
  • AI Consulting
Innovation Lab as a Service
Cybersecurity Services
Healthcare IT Solutions
Cloud Engineering
  • Cloud Services
  • DevOps Services
  • Cloud Migration
  • Cloud Optimization
  • Cloud Computing Services
Menu
  • Cloud Services
  • DevOps Services
  • Cloud Migration
  • Cloud Optimization
  • Cloud Computing Services
Internet of Things
  • IoT Consulting
  • IoT App Development
Menu
  • IoT Consulting
  • IoT App Development
Company
  • About
  • Culture
  • Current Openings
Menu
  • About
  • Culture
  • Current Openings
We are Global
India  |  USA  | Australia
We are Social
Facebook
Twitter
Linkedin
Youtube
Subscribe to our Newsletter
cuelogic

We are Hiring!

Blogs
  • What is Infrastructure as Code and How Can You Leverage It?
  • Cypress vs. Selenium: Which is the Superior Testing Tool?
  • Micro Frontend Deep Dive – Top 10 Frameworks To Know About
  • Micro Frontends – Revolutionizing Front-end Development with Microservices
  • Decoding Pipeline as Code (With Jenkins)
  • DevOps Metrics : 15 KPIs that Boost Results & RoI
Menu
  • What is Infrastructure as Code and How Can You Leverage It?
  • Cypress vs. Selenium: Which is the Superior Testing Tool?
  • Micro Frontend Deep Dive – Top 10 Frameworks To Know About
  • Micro Frontends – Revolutionizing Front-end Development with Microservices
  • Decoding Pipeline as Code (With Jenkins)
  • DevOps Metrics : 15 KPIs that Boost Results & RoI
cuelogic

We are Hiring!

Blogs
  • What is Infrastructure as Code and How Can You Leverage It?
  • Cypress vs. Selenium: Which is the Superior Testing Tool?
  • Micro Frontend Deep Dive – Top 10 Frameworks To Know About
  • Micro Frontends – Revolutionizing Front-end Development with Microservices
  • Decoding Pipeline as Code (With Jenkins)
  • DevOps Metrics : 15 KPIs that Boost Results & RoI
Menu
  • What is Infrastructure as Code and How Can You Leverage It?
  • Cypress vs. Selenium: Which is the Superior Testing Tool?
  • Micro Frontend Deep Dive – Top 10 Frameworks To Know About
  • Micro Frontends – Revolutionizing Front-end Development with Microservices
  • Decoding Pipeline as Code (With Jenkins)
  • DevOps Metrics : 15 KPIs that Boost Results & RoI
We are Global
India  |  USA  | Australia
We are Social
Facebook
Twitter
Linkedin
Youtube
Subscribe Newsletter
Services
Product Engineering

Product Development

UX Consulting

Application Development

Application Modernization

Quality Assurance Services

Cloud Engineering

Cloud Services

DevOps Services

Cloud Migration

Cloud Optimization

Cloud Computing Services

Data & Machine Learning

Big Data Services

AI Consulting

Internet of Things

IoT Consulting

IoT Application Services

Innovation Lab As A Service
Cybersecurity Services
Healthcare IT Services
Company

About

Culture

Current Openings

Insights
Privacy Policy  
All Rights Reserved @ Cuelogic 2021

Close