Text and Social Media Analysis.- Lab Solutions

SET A 1. SPPU

 Consider any text paragraph. Preprocess the text to remove any special characters and digits. Generate the summary using extractive  summarization process. 

 import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def preprocess_text(text):
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Keep only alphabets and spaces
    text = re.sub(r'\d+', '', text)  # Remove digits
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

def extractive_summarization(text, summary_length=2):
    # Preprocess the text
    preprocessed_text = preprocess_text(text)
    
    # Tokenize sentences
    sentences = preprocessed_text.split('. ')
    
    # Calculate TF-IDF scores
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences)
    
    # Calculate sentence importance (using cosine similarity)
    sentence_scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix).flatten()
    
    # Rank sentences by score
    ranked_sentences = [sentence for _, sentence in sorted(zip(sentence_scores, sentences), reverse=True)]
    
    # Select top N sentences for the summary
    summary = '. '.join(ranked_sentences[:summary_length])
    return summary

# Example text
text = """
The quick brown fox jumps over the lazy dog. The dog barked at the fox. 
The fox ran away quickly. It was a sunny day, and the birds were chirping. 
The dog went back to sleep. The fox found a new place to rest.
"""

# Generate summary
summary = extractive_summarization(text, summary_length=2)
print("Summary:")
print(summary)

SET A 2. SPPU

Consider any text paragraph. remove the stopwords. Tokenize the paragraph to extract words and sentences. Calculate the word frequency distribution and plot the frequencies. Plot the wordcloud of the text. 

Step 1: Set Up the Environment

  1. Install Python: Ensure Python is installed on your system. You can download it from python.org.

  2. Install Required Libraries:

    • Open a terminal or command prompt.

    • Run the following commands to install the necessary libraries:

       
      pip install nltk matplotlib wordcloud
Import NLTK and Download Required Resources

When you run the following code:

 

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

 

 Download punkt
    • What is punkt?

      • punkt is a pre-trained tokenizer model used for splitting text into sentences and words.

      • It is essential for functions like word_tokenize() and sent_tokenize().

    • What is Downloaded?

      • The punkt dataset includes:

        • A pre-trained model for tokenization.

        • Language-specific rules for splitting text.

    • Location:

      • The dataset is saved in the nltk_data/tokenizers/punkt directory.

# Step 1: Import required libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Step 2: Download NLTK datasets (if not already downloaded)
nltk.download('punkt')  # For tokenization
nltk.download('punkt_tab')  # For punkt_tab resource
nltk.download('stopwords')  # For stopwords

# Step 3: Define the text paragraph
text = """
Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language. 
The goal is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. 
NLP is used in various applications such as chatbots, sentiment analysis, and machine translation.
"""

# Step 4: Preprocess the text
# Convert to lowercase
text = text.lower()

# Step 5: Tokenize the text into words and sentences
words = word_tokenize(text)  # Tokenize into words
sentences = sent_tokenize(text)  # Tokenize into sentences

# Step 6: Remove stopwords and non-alphanumeric tokens
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.isalnum() and word not in stop_words]

# Step 7: Calculate word frequency
word_freq = Counter(filtered_words)

# Step 8: Plot word frequency distribution
plt.figure(figsize=(10, 6))
plt.bar(word_freq.keys(), word_freq.values())
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Word Frequency Distribution')
plt.xticks(rotation=45)
plt.show()

# Step 9: Generate and plot word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud')
plt.show()

# Step 10: Print results
print("Filtered Words:", filtered_words)
print("Word Frequencies:", word_freq)

SET A 3. SPPU

Consider the following review messages. Perform sentiment analysis on the messages. 

i. I purchased headphones. I am very happy with the product. 

ii. I saw the movie yesterday. Yhe animation was really good but the script was ok.

iii. I enjoy listening to music.

iv. I take a walk in the part everyday.

            Sentiment analysis is a Natural Language Processing (NLP) technique used to determine the emotional tone or attitude expressed in a piece of text. It is widely used in various applications, such as analyzing customer reviews, social media posts, and feedback, to understand whether the sentiment is positive, negative, or neutral.

In this experiment, we will perform sentiment analysis on a set of review messages using the TextBlob library in Python. The goal is to:

  1. Preprocess the Text: Analyze the given review messages.

  2. Calculate Sentiment: Use TextBlob to compute the polarity (positive/negative/neutral) and subjectivity (objective/subjective) of each message.

  3. Classify Sentiment: Based on the polarity score, classify the sentiment of each review.

  4. Interpret Results: Understand why a particular sentiment was assigned to each review.

Steps to Perform Sentiment Analysis

  1. Install Required Libraries:

    • Install TextBlob and its dependencies:

       
      pip install textblob
      python -m textblob.download_corpora
  2. Analyze Sentiment:

    • Use TextBlob to calculate the polarity and subjectivity of each message.

      • Polarity: Ranges from -1 (negative) to 1 (positive).

      • Subjectivity: Ranges from 0 (objective) to 1 (subjective).

  3. Classify Sentiment:

    • Based on the polarity score, classify the sentiment as:

      • Positive: Polarity > 0

      • Neutral: Polarity = 0

      • Negative: Polarity < 0

Python Code

from textblob import TextBlob
# Review messages
reviews = [
    "I purchased headphones. I am very happy with the product.",
    "I saw the movie yesterday. The animation was really good but the script was ok.",
    "I enjoy listening to music.",
    "I take a walk in the park everyday."
]

# Perform sentiment analysis on each review
for i, review in enumerate(reviews, 1):
    # Create a TextBlob object
    blob = TextBlob(review)
    
    # Get polarity and subjectivity
    polarity = blob.sentiment.polarity
    subjectivity = blob.sentiment.subjectivity
    
    # Classify sentiment based on polarity
    if polarity > 0:
        sentiment = "Positive"
    elif polarity == 0:
        sentiment = "Neutral"
    else:
        sentiment = "Negative"
    
    # Print results
    print(f"Review {i}: {review}")
    print(f"Sentiment: {sentiment} (Polarity: {polarity:.2f}, Subjectivity: {subjectivity:.2f})")
    print("-" * 50)