Big Data with Hadoop
1.4 A Brief History of Hadoop
The
emergence of Big Data challenges wasn’t something that happened overnight. It
was a gradual build-up, and the solutions, including Hadoop,
evolved to meet these growing demands.
1.4.1 Origins and
Evolution
Hadoop didn’t just appear out
of thin air. Its roots can be traced back to the need to process incredibly
large amounts of data, particularly in the context of search engines. Imagine
the sheer scale of data that Google, for example, needs to process every single
day to index the entire internet and respond to billions of search queries.
The
inspiration for Hadoop came directly from a series of
papers published by Google:
●
Google File System Paper:
This paper described how Google stored its massive datasets across thousands of
commodity servers. They couldn’t rely on expensive, high-end storage systems.
Instead, they built a distributed file system that was designed to be
fault-tolerant and highly scalable, even with inexpensive hardware.
●
MapReduce
Paper: This paper outlined a new programming model for
processing and generating large datasets. It introduced the concepts of
"Map" (transforming data) and "Reduce" (aggregating data)
operations, which could be executed in parallel across a cluster of machines.
Inspired
by these papers, Doug Cutting and Mike Cafarella
began developing an open-source project called Nutch.
Nutch was a web search engine designed to crawl and
index the web on a massive scale. As Nutch grew, they
realized the need for a distributed file system and a distributed processing
framework similar to what Google had described.
In
2005, Doug Cutting split the distributed computing parts from Nutch and created a new project called Hadoop.
The name "Hadoop" actually came from Doug
Cutting’s son’s toy elephant!
So,
in essence, Hadoop was born out of a practical need
to build a web search engine capable of handling web-scale data, by adopting
the groundbreaking ideas proposed by Google. It was
designed from the ground up to handle massive datasets on clusters of commodity
hardware, addressing the very limitations we discussed earlier.
1.4.2 Key Milestones in
Hadoop’s Development
Hadoop quickly gained
traction and evolved significantly over the years. Here are some key
milestones:
●
2005: Doug Cutting
and Mike Cafarella start the Hadoop
project, inspired by Google’s GFS and MapReduce
papers.
●
2006: Doug Cutting
joins Yahoo! and brings Hadoop with him. Yahoo!
becomes a major backer and user of Hadoop, providing
significant resources and testing grounds. This was a critical period for Hadoop’s development, as it was battle-tested on Yahoo!’s
massive data infrastructure.
●
2008: Hadoop becomes a top-level Apache project, signifying its
maturity and robust open-source community support. This is a big deal, as it
means it’s recognized as a foundational technology.
●
2009: Hadoop processes 1 terabyte of data in 17 minutes using 1400
nodes in Yahoo!’s data centers, showcasing its
scalability.
●
2011: Hadoop processes 500 terabytes of data in 23 minutes,
demonstrating further improvements in efficiency and scalability.
●
Post-2011 (Hadoop
2.x and Beyond): The introduction of YARN
marks a significant architectural shift. YARN separated the resource management
and job scheduling functions from MapReduce, allowing
other processing frameworks (like Apache Spark, Apache Tez)
to run on top of Hadoop’s distributed storage. This
transformed Hadoop from just a MapReduce
platform into a broader ecosystem for various Big Data workloads.
These
milestones show Hadoop’s journey from an experimental
project to a robust, scalable, and versatile platform capable of handling
enormous amounts of data. It moved from being a single, integrated solution to
a modular architecture with a shared resource manager, which paved the way for
the rich Hadoop ecosystem we see today.