Chapter 4: Hadoop I/O
Easy
Level Questions
These
questions aim to assess the students’ ability to recall fundamental facts and
basic concepts related to Hadoop I/O.
1.
Define the term "data integrity" in the context of Hadoop and explain why it is crucial for Big Data
processing [CO3].
2.
Name two common compression codecs used in Hadoop
and state a primary benefit of using compression with large datasets [CO3]
3.
What is serialization in Hadoop, and why is it
necessary for data transfer between nodes? [CO3]
4.
Briefly describe the purpose of Writable classes in Hadoop,
noting their role in data serialization and deserialization [CO3]
Moderate
Level Questions
These
questions require students to apply their knowledge, interpret information, and
make connections between different Hadoop I/O
concepts.
5.
Explain how Hadoop ensures data integrity
during data transfer and storage, mentioning at least one specific mechanism
[CO3]
6.
Discuss the trade-offs involved when choosing a compression codec for a Hadoop job, considering factors like compression ratio and
processing speed [CO3]
7.
Compare and contrast Hadoop’s Writable interface with standard Java
serialization. When would you prefer to use Writable? [CO3]
8.
Describe the characteristics and typical use cases of MapFile in Hadoop.
How does it differ from a simple flat file storage?
[CO3]
Difficult
Level Questions
These
questions challenge students to synthesize information, critically evaluate
concepts, and design solutions, demonstrating a deeper understanding of Hadoop I/O.
9.
Analyze the impact of data compression on the overall
performance of a MapReduce job, considering both I/O
operations and CPU utilization. Provide an example where compression might
negatively affect performance [CO3, CO5
10.
Propose a scenario where a custom Writable class would be essential for efficient data
processing in Hadoop. Outline the key considerations
for designing such a class [CO3, CO5]
11.
Discuss the benefits of using file-based data structures like SequenceFile or MapFile over raw text files for storing
intermediate and final output in complex Hadoop
workflows. Consider factors such as splittability,
metadata, and performance [CO3]
12.
Design a strategy for optimizing data input and output operations for a Hadoop cluster that frequently processes large volumes of
small files. Consider how serialization, compression, and file formats could be
leveraged to improve efficiency [CO3, CO5]