Big Data : Presentation and Case Study

Internal Evaluation: Big Data and Hadoop Presentation

Objective: This presentation aims to assess your understanding of key concepts within the realm of Big Data and Hadoop, as covered in the course syllabus. You will demonstrate your ability to articulate these concepts clearly, provide relevant examples, and structure information effectively for an audience.

Task:
You are required to select one specific topic from Chapters 1, 2, 3, 4, or 5 of our course syllabus, "Mastering Big Data and Engineering Applications." For your chosen topic, you will prepare and deliver a concise presentation.

Presentation Requirements:

1. Topic Selection: Choose one concept or sub-topic from Chapters 1, 2, 3, 4, or 5 of the book outline. For example, instead of just "Chapter 1," you might select "The 5 Vs of Big Data," "MapReduce Data Flow," or "HDFS Architecture."

2. Presentation Content:

o Theoretical Explanation: Provide a detailed theoretical explanation of your chosen topic. Define all necessary terms, elaborate on core concepts, and explain its significance within the broader Big Data/Hadoop ecosystem.

o Suitable Examples: Illustrate your explanation with suitable, real-world, or conceptual examples to enhance understanding.

o Arrangement: Ensure the content is logically structured and presented in an easy-to-follow manner.

3. Deliverables/Submission.

o Presentation Slides: Create a set of clear and visually appealing presentation slides (e.g., PowerPoint, Google Slides). These slides should summarize key points, include relevant diagrams/illustrations, and serve as visual aids during your presentation.

o Theoretical Write-up/Notes: Submit a written document detailing the theory behind your chosen topic. This can be in the form of detailed speaker notes for your presentation or a concise theoretical summary. This document should reflect the depth of your understanding and the content you plan to deliver verbally.

4. Presentation Duration: Your presentation should be between 7 to 10 minutes in length. Practice your timing to ensure you stay within this limit.

Internal Evaluation: Simplified Case Study (10 Marks)

Objective: This case study requires you to apply fundamental concepts of Big Data, HDFS, and basic MapReduce to a straightforward business problem. You will demonstrate your ability to identify Big Data characteristics in a simpler context and propose a basic Hadoop-based solution.

Case Study Scenario: "LocalBytes Online Grocery"

LocalBytes is a small but growing online grocery store that primarily serves a single city. They manage their online catalog and customer orders using a traditional SQL database. Recently, their website traffic and order volumes have surged, leading to some data management challenges.

LocalBytes collects two main types of data:

1. Website Access Logs: Every time a user visits a product page, the web server generates a log entry including: timestamp, user_id, product_id_viewed, IP_address, browser_type. These logs are generated continuously and stored as simple text files.

2. Daily Sales Records: At the end of each day, all successful orders are compiled into a summary file containing: order_id, customer_id, list_of_product_ids_purchased, total_amount, delivery_status. These are more structured but growing rapidly.

Their current SQL database is struggling to efficiently analyze the large volume of daily website access logs to understand which products are most frequently viewed. Generating a simple report of the top 10 most viewed products from the previous day’s logs now takes several hours, impacting their ability to quickly adapt marketing strategies. The IT team is also concerned about the storage capacity for ever-growing historical logs.

LocalBytes needs a solution to:

1. Efficiently store and manage their increasing volume of website access logs and sales records.

2. Quickly identify the most popular (most viewed) product pages from their daily website access logs.

Case Study Question:

Based on the "LocalBytes Online Grocery" scenario and your understanding of Big Data and Hadoop fundamentals (from Chapters 1, 2, and 3 of our syllabus), answer the following:

1. Identifying Big Data Characteristics:

o Describe how at least three of the 5 Vs of Big Data are relevant to LocalBytes‘ data challenges. Provide a specific example from the scenario for each ‘V’ you identify.

2. HDFS for Storage:

o Explain why the Hadoop Distributed File System would be a beneficial choice for LocalBytes to store their growing website access logs, particularly in comparison to their current traditional storage methods. Focus on the advantages HDFS offers for this type of data.

3. Basic MapReduce for Product Popularity:

o Design a simple MapReduce job to help LocalBytes find the top 10 most viewed product pages from their daily website access logs.

o Assume each line in the website access log files represents one page view and has the format: timestamp,user_id,product_id_viewed,IP_address,browser_type.

o Mapper Function: Clearly describe what your Mapper would do. What does it take as input (per line)? What key-value pair(s) would it output for each relevant input line?

o Reducer Function: Clearly describe what your Reducer would do. What type of key-value pairs does it receive? What processing does it perform to determine the total views for each product, and what would its final output be?

Submission Requirements:

· Provide a written response to all parts of the question. Your response should be well-structured, clear, and concise.

· Ensure your explanations are detailed enough to demonstrate a fundamental understanding of the concepts.

· Use appropriate terminology from the course material.