Big Data – Question Bank
Chapter 1: Introduction to Big Data & Hadoop
1.
Illustrate how Hadoop can be used to analyze
large datasets compared to traditional RDBMS systems.
2.
Define Big Data and explain the
four Vs.
3.
Explain the history of Apache Hadoop and how it differs from Grid Computing.
4.
Apply the concept of data
locality to explain how Hadoop processes data
efficiently.
5.
Analyse the role of data analysis
in Hadoop versus traditional RDBMS systems.
6.
Describe the 3Vs of Big Data.
7.
What is the fundamental purpose
of Apache Hadoop in Big Data processing?
8.
Name and briefly describe any two
core components of the Hadoop ecosystem.
9.
Differentiate between traditional
RDBMS and Hadoop systems.
10.
Explain the architecture of Hadoop ecosystem with diagram.
11.
What are advantages and
limitations of Big Data technologies?
12.
Explain the concept of
distributed computing in Big Data.
13.
Compare structured,
semi-structured, and unstructured data.
Chapter 2: MapReduce
1.
Explain the roles of Map and
Reduce functions in the MapReduce framework.
2.
Evaluate the effectiveness of the
Combiner function in reducing network traffic.
3.
Discuss the concept of
"scaling out" in MapReduce and its
importance.
4.
Explain the concept of Key-Value
pairs in MapReduce.
5.
Explain the role of a Combiner
function in a MapReduce job.
6.
Explain the purpose of a Combiner
function and justify why it is called a local reducer.
7.
What is a Combiner function?
Explain its importance in MapReduce scaling.
8.
Explain the fundamental roles of
Map and Reduce functions in a MapReduce job.
9.
Explain in brief the role of a
Combiner function and how it improves performance.
10.
Explain the working of MapReduce with a neat diagram.
11.
Differentiate between Mapper and
Reducer.
12.
What is partitioning in MapReduce? Explain its importance.
13.
Explain the role of InputFormat and OutputFormat in MapReduce.
Chapter 3: HDFS
1.
Explain the roles of NameNode and DataNode in HDFS.
2.
Illustrate the process of writing
data into HDFS with block placement.
3.
Compare HDFS Federation with
single NameNode architecture.
4.
Demonstrate any four basic HDFS
command-line operations.
5.
What is a block in HDFS and why
is it large?
6.
Explain how HDFS achieves fault
tolerance to prevent data loss.
7.
Distinguish between NameNode and DataNode in HDFS.
8.
Explain the concept of blocks in
HDFS and why they are large.
9.
Explain how HDFS achieves fault
tolerance in case of node failure.
10.
Explain HDFS architecture with
diagram.
11.
What is replication in HDFS?
Explain its significance.
12.
Compare HDFS with traditional
file systems.
13.
Explain read operation in HDFS.
Chapter 4: Hadoop I/O & Data Integrity
1.
Explain how Hadoop
ensures data integrity during data transfer and storage.
2.
Analyse how checksum verification
is performed by a DataNode.
3.
Define data integrity and explain
the use of checksums in Hadoop I/O.
4.
Explain the benefits of
compression in Big Data storage.
5.
Describe why data integrity is
critical in distributed systems.
6.
Illustrate how combining small
files into a Sequence File improves performance.
7.
Why is data integrity important
in Hadoop? How does it use checksums?
8.
Describe the structure of a MapFile (data file and index file).
9.
Explain steps to create a custom
Writable data type in MapReduce.
10.
Explain different types of
compression techniques in Hadoop.
11.
What is serialization in Hadoop? Explain Writable interface.
12.
Explain SequenceFile
and its advantages.
13.
Compare MapFile
and SequenceFile.
Chapter 5: MapReduce Workflow & Tools
1.
Analyse how Hadoop
logs help in identifying and resolving issues in a MapReduce
job.
2.
Describe the steps involved in
configuring a Hadoop development environment.
3.
Explain what happens when output
of one MapReduce job is used as input for another
job.
4.
Describe the process of
retrieving final results of a MapReduce job from
HDFS.
5.
Demonstrate how to monitor
progress of a running MapReduce job using Hadoop tools.
6.
What are Hadoop
logs? Explain their role in debugging.
7.
Explain the concept of Job
Control or workflow engine in MapReduce.
8.
Evaluate the importance of proper
configuration in executing MapReduce applications.
9.
What are essential components
required before writing MapReduce programs?
10.
Explain YARN architecture and its
role in Hadoop.
11.
What is job scheduling in Hadoop?
12.
Explain the role of ResourceManager and NodeManager.
13.
Compare MapReduce
v1 and YARN.