Big Data – Question Bank

 

Chapter 1: Introduction to Big Data & Hadoop

1.    Illustrate how Hadoop can be used to analyze large datasets compared to traditional RDBMS systems.

2.    Define Big Data and explain the four Vs.

3.    Explain the history of Apache Hadoop and how it differs from Grid Computing.

4.    Apply the concept of data locality to explain how Hadoop processes data efficiently.

5.    Analyse the role of data analysis in Hadoop versus traditional RDBMS systems.

6.    Describe the 3Vs of Big Data.

7.    What is the fundamental purpose of Apache Hadoop in Big Data processing?

8.    Name and briefly describe any two core components of the Hadoop ecosystem.

9.    Differentiate between traditional RDBMS and Hadoop systems.

10.                       Explain the architecture of Hadoop ecosystem with diagram.

11.                       What are advantages and limitations of Big Data technologies?

12.                       Explain the concept of distributed computing in Big Data.

13.                       Compare structured, semi-structured, and unstructured data.

 

Chapter 2: MapReduce

1.    Explain the roles of Map and Reduce functions in the MapReduce framework.

2.    Evaluate the effectiveness of the Combiner function in reducing network traffic.

3.    Discuss the concept of "scaling out" in MapReduce and its importance.

4.    Explain the concept of Key-Value pairs in MapReduce.

5.    Explain the role of a Combiner function in a MapReduce job.

6.    Explain the purpose of a Combiner function and justify why it is called a local reducer.

7.    What is a Combiner function? Explain its importance in MapReduce scaling.

8.    Explain the fundamental roles of Map and Reduce functions in a MapReduce job.

9.    Explain in brief the role of a Combiner function and how it improves performance.

10.                       Explain the working of MapReduce with a neat diagram.

11.                       Differentiate between Mapper and Reducer.

12.                       What is partitioning in MapReduce? Explain its importance.

13.                       Explain the role of InputFormat and OutputFormat in MapReduce.

 

Chapter 3: HDFS

1.    Explain the roles of NameNode and DataNode in HDFS.

2.    Illustrate the process of writing data into HDFS with block placement.

3.    Compare HDFS Federation with single NameNode architecture.

4.    Demonstrate any four basic HDFS command-line operations.

5.    What is a block in HDFS and why is it large?

6.    Explain how HDFS achieves fault tolerance to prevent data loss.

7.    Distinguish between NameNode and DataNode in HDFS.

8.    Explain the concept of blocks in HDFS and why they are large.

9.    Explain how HDFS achieves fault tolerance in case of node failure.

10.                       Explain HDFS architecture with diagram.

11.                       What is replication in HDFS? Explain its significance.

12.                       Compare HDFS with traditional file systems.

13.                       Explain read operation in HDFS.

 

Chapter 4: Hadoop I/O & Data Integrity

1.    Explain how Hadoop ensures data integrity during data transfer and storage.

2.    Analyse how checksum verification is performed by a DataNode.

3.    Define data integrity and explain the use of checksums in Hadoop I/O.

4.    Explain the benefits of compression in Big Data storage.

5.    Describe why data integrity is critical in distributed systems.

6.    Illustrate how combining small files into a Sequence File improves performance.

7.    Why is data integrity important in Hadoop? How does it use checksums?

8.    Describe the structure of a MapFile (data file and index file).

9.    Explain steps to create a custom Writable data type in MapReduce.

10.                       Explain different types of compression techniques in Hadoop.

11.                       What is serialization in Hadoop? Explain Writable interface.

12.                       Explain SequenceFile and its advantages.

13.                       Compare MapFile and SequenceFile.

 

Chapter 5: MapReduce Workflow & Tools

1.    Analyse how Hadoop logs help in identifying and resolving issues in a MapReduce job.

2.    Describe the steps involved in configuring a Hadoop development environment.

3.    Explain what happens when output of one MapReduce job is used as input for another job.

4.    Describe the process of retrieving final results of a MapReduce job from HDFS.

5.    Demonstrate how to monitor progress of a running MapReduce job using Hadoop tools.

6.    What are Hadoop logs? Explain their role in debugging.

7.    Explain the concept of Job Control or workflow engine in MapReduce.

8.    Evaluate the importance of proper configuration in executing MapReduce applications.

9.    What are essential components required before writing MapReduce programs?

10.                       Explain YARN architecture and its role in Hadoop.

11.                       What is job scheduling in Hadoop?

12.                       Explain the role of ResourceManager and NodeManager.

13.                       Compare MapReduce v1 and YARN.