Big Data with Hadoop

Practical 2:

Word Frequency Count using MapReduce

Objective:
To implement and execute a MapReduce program that calculates the frequency of each word in a given text file, demonstrating the core concepts of distributed data processing.

Introduction:
The Word Count example is the "Hello World" of MapReduce programming. It effectively illustrates how the Map and Reduce functions work together to process large datasets in a distributed manner. In this practical, you will write a Java-based MapReduce application to count the occurrences of every word in an input text file. This will solidify your understanding of the MapReduce data flow, key-value pairs, and job configuration.

Step-by-Step Guide:

1. Prepare Your Development Environment:

Create a new Java project. If using Maven, add Hadoop client dependencies (e.g., hadoop-client, hadoop-common).
Ensure your JAVA_HOME and Hadoop environment variables are correctly set as per Practical 1.

2. Input Data Preparation:

Create a sample text file (e.g., input.txt) with some sentences, for example:

hello hadoop world

hadoop is big data

big data world

Upload this file to HDFS:

hdfs dfs –mkdir /input

hdfs dfs -put input.txt /input/input.txt

3. Crafting the Mapper Class:

The Mapper takes input key-value pairs (byte offset, line of text) and emits intermediate key-value pairs (word, 1).

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class WordFrequencyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

@Override

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken().toLowerCase()); // Convert to lowercase

context.write(word, one); // Emit (word, 1)

}

4. Developing the Reducer Class:

The Reducer aggregates intermediate key-value pairs to emit final key-value pairs (word, total count).

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class WordFrequencyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();

@Override

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result); // Emit (word, total_count)

}

5. Configuring and Submitting the MapReduce Job:

The Driver class sets up the job configuration, specifies Mapper and Reducer classes, and handles input/output paths.

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class WordFrequencyDriver {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

if (otherArgs.length < 2) {

System.err.println("Usage: wordfrequency <in> <out>");

System.exit(2);

}

Job job = Job.getInstance(conf, "word frequency count");

job.setJarByClass(WordFrequencyDriver.class);

job.setMapperClass(WordFrequencyMapper.class);

job.setCombinerClass(WordFrequencyReducer.class); // Optional

job.setReducerClass(WordFrequencyReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0])); // Input path on HDFS

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); // Output path on HDFS

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

6. Compiling and Packaging Your Application:

Compile your Java code into a .jar file.
If using Maven: mvn clean package
Otherwise, manually:

mkdir -p classes

javac –cp "$(hadoop classpath)" -d classes WordFrequencyMapper.java WordFrequencyReducer.java WordFrequencyDriver.java

jar –cvf wordfrequency.jar -C classes/ .

7. Running the Job on Hadoop:

Execute the JAR file on Hadoop. Remove any previous output directory first:

hdfs dfs –rm -r /output

hadoop jar wordfrequency.jar WordFrequencyDriver /input/input.txt /output

8. Verifying the Output:

View the output stored in HDFS:

hdfs dfs -cat /output/part-r-00000

Conclusion:
You have successfully implemented, compiled, and run your first MapReduce application to calculate word frequencies. This practical has demonstrated the end-to-end process of developing a distributed data processing job, from writing Mapper and Reducer logic to deploying and verifying results on a Hadoop cluster. This foundational understanding is critical for tackling more complex Big Data problems.

Big Data with Hadoop

Practical 2:

Word Frequency Count using MapReduce

Step-by-Step Guide:

1. Prepare Your Development Environment:

2. Input Data Preparation:

3. Crafting the Mapper Class:

4. Developing the Reducer Class:

5. Configuring and Submitting the MapReduce Job:

6. Compiling and Packaging Your Application:

7. Running the Job on Hadoop:

8. Verifying the Output:

Quick Links

Help and Support