Chapter 4: Hadoop I/O  

Easy Level Questions

These questions aim to assess the students’ ability to recall fundamental facts and basic concepts related to Hadoop I/O.

1.     Define the term "data integrity" in the context of Hadoop and explain why it is crucial for Big Data processing [CO3].

2.     Name two common compression codecs used in Hadoop and state a primary benefit of using compression with large datasets [CO3]  

3.     What is serialization in Hadoop, and why is it necessary for data transfer between nodes? [CO3]  

4.     Briefly describe the purpose of Writable classes in Hadoop, noting their role in data serialization and deserialization [CO3]

Moderate Level Questions

These questions require students to apply their knowledge, interpret information, and make connections between different Hadoop I/O concepts.

5.     Explain how Hadoop ensures data integrity during data transfer and storage, mentioning at least one specific mechanism [CO3]  

6.     Discuss the trade-offs involved when choosing a compression codec for a Hadoop job, considering factors like compression ratio and processing speed [CO3]  

7.     Compare and contrast Hadoop’s Writable interface with standard Java serialization. When would you prefer to use Writable? [CO3]  

8.     Describe the characteristics and typical use cases of MapFile in Hadoop. How does it differ from a simple flat file storage? [CO3]  

Difficult Level Questions

These questions challenge students to synthesize information, critically evaluate concepts, and design solutions, demonstrating a deeper understanding of Hadoop I/O.

9.     Analyze the impact of data compression on the overall performance of a MapReduce job, considering both I/O operations and CPU utilization. Provide an example where compression might negatively affect performance [CO3, CO5

10.                        Propose a scenario where a custom Writable class would be essential for efficient data processing in Hadoop. Outline the key considerations for designing such a class [CO3, CO5]  

11.                        Discuss the benefits of using file-based data structures like SequenceFile or MapFile over raw text files for storing intermediate and final output in complex Hadoop workflows. Consider factors such as splittability, metadata, and performance [CO3]  

12.                        Design a strategy for optimizing data input and output operations for a Hadoop cluster that frequently processes large volumes of small files. Consider how serialization, compression, and file formats could be leveraged to improve efficiency [CO3, CO5]