Big Data : Presentation and Case Study
Internal Evaluation: Big Data and Hadoop
Presentation
Objective: This presentation aims to assess your understanding of key
concepts within the realm of Big Data and Hadoop, as
covered in the course syllabus. You will demonstrate your ability to articulate
these concepts clearly, provide relevant examples, and structure information
effectively for an audience.
Task:
You are required to select one
specific topic from Chapters 1, 2, 3, 4, or 5 of our course syllabus,
"Mastering Big Data and Engineering Applications." For your chosen
topic, you will prepare and deliver a concise presentation.
Presentation
Requirements:
1.
Topic
Selection: Choose one
concept or sub-topic from Chapters 1, 2, 3, 4, or 5 of the book outline. For
example, instead of just "Chapter 1," you might select "The 5 Vs of Big Data," "MapReduce
Data Flow," or "HDFS Architecture."
2.
Presentation
Content:
o Theoretical Explanation: Provide a detailed theoretical explanation of your chosen
topic. Define all necessary terms, elaborate on core concepts, and explain its
significance within the broader Big Data/Hadoop
ecosystem.
o Suitable Examples: Illustrate your explanation with suitable, real-world, or
conceptual examples to enhance understanding.
o Arrangement: Ensure the content is logically structured and presented
in an easy-to-follow manner.
3.
Deliverables/Submission.
o Presentation Slides: Create a set of clear and visually appealing presentation
slides (e.g., PowerPoint, Google Slides). These slides should summarize key
points, include relevant diagrams/illustrations, and serve as visual aids
during your presentation.
o Theoretical Write-up/Notes: Submit a written document detailing the theory behind your
chosen topic. This can be in the form of detailed speaker notes for your
presentation or a concise theoretical summary. This document should reflect the
depth of your understanding and the content you plan to deliver verbally.
4.
Presentation
Duration: Your presentation should be
between 7 to
10 minutes in length. Practice your timing to ensure you stay
within this limit.
Internal Evaluation: Simplified Case Study (10 Marks)
Objective: This case study requires you to apply fundamental concepts
of Big Data, HDFS, and basic MapReduce to a
straightforward business problem. You will demonstrate your ability to identify
Big Data characteristics in a simpler context and propose a basic Hadoop-based solution.
Case
Study Scenario: "LocalBytes Online Grocery"
LocalBytes is a small but growing online grocery store that primarily
serves a single city. They manage their online catalog
and customer orders using a traditional SQL database. Recently, their website
traffic and order volumes have surged, leading to some data management
challenges.
LocalBytes collects two main types of data:
1.
Website
Access Logs: Every time a user
visits a product page, the web server generates a log entry including: timestamp
, user_id
, product_id_viewed
, IP_address
, browser_type
. These logs are generated continuously and stored as
simple text files.
2.
Daily
Sales Records: At the end of each
day, all successful orders are compiled into a summary file containing: order_id
, customer_id
, list_of_product_ids_purchased
, total_amount
, delivery
_status
. These are more structured but growing rapidly.
Their current SQL
database is struggling to efficiently analyze the
large volume of daily website access logs to understand which products are most
frequently viewed. Generating a simple report of the top 10 most viewed
products from the previous day’s logs now takes several hours, impacting their
ability to quickly adapt marketing strategies. The IT team is also concerned
about the storage capacity for ever-growing historical logs.
LocalBytes needs a solution to:
1.
Efficiently store and
manage their increasing volume of website access logs and sales records.
2.
Quickly identify the
most popular (most viewed) product pages from their daily website access logs.
Case
Study Question:
Based on the "LocalBytes Online Grocery" scenario and your
understanding of Big Data and Hadoop fundamentals
(from Chapters 1, 2, and 3 of our syllabus), answer the following:
1.
Identifying
Big Data Characteristics:
o Describe how at least three of the 5 Vs
of Big Data are relevant to LocalBytes‘
data challenges. Provide a specific example from the scenario for each ‘V’ you
identify.
2.
HDFS
for Storage:
o Explain why the Hadoop Distributed File System
would be a beneficial choice for LocalBytes to store
their growing website access logs, particularly in comparison to their current
traditional storage methods. Focus on the advantages HDFS offers for this type
of data.
3.
Basic
MapReduce for Product Popularity:
o Design a simple MapReduce job to help LocalBytes find the top 10 most viewed product pages
from their daily website access logs.
o Assume each line in the website access log files represents
one page view and has the format: timestamp,user_id,product_id_viewed,IP_address,browser_type
.
o Mapper Function: Clearly describe what your Mapper would do. What does it
take as input (per line)? What key-value pair(s) would it output for each
relevant input line?
o Reducer Function: Clearly describe what your Reducer would do. What type of
key-value pairs does it receive? What processing does it perform to determine
the total views for each product, and what would its
final output be?
Submission
Requirements:
·
Provide a written
response to all parts of the question. Your response should be well-structured,
clear, and concise.
·
Ensure your
explanations are detailed enough to demonstrate a fundamental understanding of
the concepts.
·
Use appropriate
terminology from the course material.