Big Data with Hadoop
1.3 Comparison with
Other Systems
1.3.1 Traditional Data
Processing Limitations
For
decades, organizations managed their data using systems designed for a
different era – an era where data was typically structured, came in manageable
quantities, and wasn’t generated at lightning speeds. These traditional methods
generally involved:
●
Single, Powerful Machines:
Relying on one very powerful server (often called vertical scaling) to handle
all data storage and processing. This works well for many applications, but
there’s a limit to how big and powerful one machine can get.
●
Structured Data Focus:
Primarily designed to handle data that fits neatly into rows and columns, like
customer records or inventory lists.
●
Batch Processing:
Often processing data in large batches overnight or periodically, rather than
continuously in real-time.
The
core limitation is that these traditional systems were simply not built to
handle the “3 V’s” of Big Data. They would quickly become
overwhelmed, slow down, or simply crash when faced with petabytes of varied,
fast-moving information.
1.3.2 Relational
Database Management Systems
One
of the most common and widely used traditional data processing systems is the Relational
Database Management System. You might have heard of databases like MySQL,
Oracle, SQL Server, or PostgreSQL.
1.3.2.1
Characteristics and Use Cases of RDBMS
RDBMS
are like highly organized digital filing cabinets. They store data in tables,
which have predefined schemas (like a blueprint for the table, defining
columns and their data types). Data in different tables can be related
to each other using common fields, hence the “relational” in their
name.
●
Characteristics:
○
Structured Data:
Primarily designed for structured data that fits into rows and columns.
○
SQL: They use SQL, a
powerful language, for defining, manipulating, and querying data. It’s very
efficient for complex queries on structured data.
○
ACID Properties:
RDBMS are known for ensuring Atomicity, Consistency, Isolation,
and Durability, which guarantee reliable transaction processing. This
means that data changes are either fully completed or not at all, maintaining
data integrity.
○
Vertical Scalability:
Traditionally, RDBMS scale by upgrading to a more powerful server (more CPU,
more RAM, faster disk).
●
Use Cases:
○
Transactional Systems:
Perfect for online transaction processing applications like banking systems,
e-commerce order processing, or airline reservation systems where data
integrity and consistent transactions are paramount.
○
Business Applications:
Storing customer information, product catalogs,
financial records, and employee data.
○
Web Applications:
Many websites use RDBMS as their backend to store user data, content, etc.
Example: Imagine a
bank’s system. When you transfer money, the RDBMS ensures that the money is
deducted from one account and added to another simultaneously and correctly,
maintaining perfect balance and preventing errors, even if the system crashes
midway.
1.3.2.2
Limitations of RDBMS for Big Data
While
RDBMS are excellent for what they were designed for, they encounter significant
hurdles when faced with Big Data:
⒈
Scalability:
RDBMS primarily rely on vertical scaling (making one server bigger). This has
physical limits and becomes extremely expensive. They struggle to scale out
horizontally across many commodity servers, which is crucial for Big Data
volumes.
⒉
Variety: They are not
well-suited for unstructured or semi-structured data like text, images, videos,
or sensor data. Trying to force this type of data into a rigid table structure
is inefficient and often impossible.
⒊
Schema Rigidity:
The requirement for a predefined schema (a fixed structure for your data) means
that if your data type or structure changes frequently, it’s cumbersome to
update the database schema. Big Data often involves rapidly evolving data
formats.
⒋
Cost: Scaling an
RDBMS vertically to handle large loads can become prohibitively expensive,
requiring specialized hardware and licensing.
⒌
Performance for Big Data Queries:
While SQL is powerful, running complex analytical
queries on massive datasets in traditional RDBMS can be very slow, often taking
hours or days.
In
short, RDBMS are fantastic for structured, transactional data, but they hit a
wall when dealing with the scale, speed, and diversity of Big Data.
1.3.3 Grid Computing
Moving
beyond single-machine limitations, let’s look at another distributed computing
paradigm: Grid Computing.
1.3.3.1
Definition and Principles
Grid
Computing involves using a network of many geographically
dispersed computers (often from different organizations) to work together on a
common task. Think of it like a distributed supercomputer. The idea is to
aggregate unused computing power from various sources and apply it to complex
problems.
●
Principles:
○
Resource Sharing:
Computers contribute their idle CPU cycles, storage, and network bandwidth.
○
Heterogeneity:
The computers in a grid can be diverse (different operating systems, hardware).
○
Geographical Distribution:
Resources can be spread across different locations and organizations.
○
Focus on CPU Cycles:
Often used for computationally intensive tasks that can be broken into many
small, independent sub-problems.
Example: Imagine a
massive scientific calculation that requires more processing power than any
single supercomputer. A grid could link universities’ and research labs’
computers, using their spare capacity to crunch numbers.
1.3.3.2
How Grid Computing Differs from Big Data Approaches
While
both Big Data systems and Grid Computing involve distributed resources, there are
crucial differences:
Characteristic |
Grid Computing |
Big Data (e.g., Hadoop) |
Primary Focus |
CPU-intensive computation |
Data-intensive processing |
Data Handling |
Relatively smaller datasets, often transferred to computation |
Massive datasets that often stay where they are stored; |
Fault Tolerance |
Can be challenging; if a node fails, its |
Designed with inherent fault tolerance; if a node fails, the |
Data Locality |
Less emphasis; data often moved to computation |
High emphasis; computation is moved to the data to minimize |
Resource Type |
Often heterogeneous, geographically dispersed, sometimes |
Typically homogeneous clusters within a data center, |
Typical Usage |
Scientific simulations, rendering, drug discovery |
Large-scale data analytics, machine learning, search indexing |
The
key takeaway is that grid computing is often about moving data to computation
on diverse, sometimes unstable, CPU resources, while Big Data frameworks like Hadoop are designed around moving computation to the
data on clusters optimized for massive I/O and fault tolerance.
1.3.4 Volunteer
Computing
A
specific type of grid computing, and one that highlights the concept of
distributed power, is Volunteer Computing.
1.3.4.1
Concept and Examples
Volunteer
Computing is a form of distributed computing where
individuals voluntarily donate their computer’s unused processing power to
scientific research projects. You download a small client program that runs in
the background, fetching computational tasks, processing them, and sending the
results back.
●
Examples:
○
SETI@home:
One of the most famous, searching for extraterrestrial
intelligence by analyzing radio telescope data.
○
Folding@home:
Simulating protein folding to understand diseases like Alzheimer’s,
Parkinson’s, and COVID-19.
○
BOINC: A platform
that supports many different volunteer computing projects.
1.3.4.2
Its Role in Distributed Computing
Volunteer
computing has demonstrated the power of harnessing collective, distributed
resources for massive computational problems.
●
Benefits:
○
Cost-Effective:
It leverages free, donated computing power, making it very economical for
projects that couldn’t afford supercomputers.
○
Massive Scale:
Can aggregate enormous amounts of processing power.
●
Limitations for Big Data:
○
Unreliability:
Individual volunteer nodes can go offline at any time, requiring robust
mechanisms to handle task re-distribution and result validation.
○
Lack of Data Locality:
Data for processing has to be sent to each volunteer computer, and results sent
back. This creates significant network overhead and is highly inefficient for
very large datasets that Big Data deals with. It’s simply not practical to send
petabytes of data to thousands of individual home computers.
○
Security Concerns:
Less controlled environment compared to dedicated data centers.
○
Not for Real-time:
The asynchronous nature and potential for delays make it unsuitable for
applications requiring real-time or near real-time data processing.
So, while volunteer computing showcases the potential of
distributed power, its lack of data locality, reliability, and real-time
capabilities make it unsuitable for the core challenges of Big Data processing.