Text and Social Media Analysis.- Lab Solutions
SET A 1. SPPU
Consider any text paragraph. Preprocess the text to remove any special characters and digits. Generate the summary using extractive summarization process.
import re
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity import numpy as np def preprocess_text(text): # Remove special characters and digits text = re.sub(r'[^a-zA-Z\s]', '', text) # Keep only alphabets and spaces text = re.sub(r'\d+', '', text) # Remove digits text = text.lower() # Convert to lowercase text = re.sub(r'\s+', ' ', text).strip() # Remove extra spaces return text def extractive_summarization(text, summary_length=2): # Preprocess the text preprocessed_text = preprocess_text(text) # Tokenize sentences sentences = preprocessed_text.split('. ') # Calculate TF-IDF scores vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(sentences) # Calculate sentence importance (using cosine similarity) sentence_scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix).flatten() # Rank sentences by score ranked_sentences = [sentence for _, sentence in sorted(zip(sentence_scores, sentences), reverse=True)] # Select top N sentences for the summary summary = '. '.join(ranked_sentences[:summary_length]) return summary # Example text text = """ The quick brown fox jumps over the lazy dog. The dog barked at the fox. The fox ran away quickly. It was a sunny day, and the birds were chirping. The dog went back to sleep. The fox found a new place to rest. """ # Generate summary summary = extractive_summarization(text, summary_length=2) print("Summary:") print(summary)
SET A 2. SPPU
Consider any text paragraph. remove the stopwords. Tokenize the paragraph to extract words and sentences. Calculate the word frequency distribution and plot the frequencies. Plot the wordcloud of the text.
Step 1: Set Up the Environment
Install Python: Ensure Python is installed on your system. You can download it from python.org.
Install Required Libraries:
Open a terminal or command prompt.
Run the following commands to install the necessary libraries:
pip install nltk matplotlib wordcloud
Import NLTK and Download Required Resources
When you run the following code:
nltk.download('punkt') nltk.download('punkt_tab') nltk.download('stopwords')
Download punkt
What is
punkt
?punkt
is a pre-trained tokenizer model used for splitting text into sentences and words.It is essential for functions like
word_tokenize()
andsent_tokenize()
.
What is Downloaded?
The
punkt
dataset includes:A pre-trained model for tokenization.
Language-specific rules for splitting text.
Location:
The dataset is saved in the
nltk_data/tokenizers/punkt
directory.
import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize from collections import Counter import matplotlib.pyplot as plt from wordcloud import WordCloud # Step 2: Download NLTK datasets (if not already downloaded) nltk.download('punkt') # For tokenization nltk.download('punkt_tab') # For punkt_tab resource nltk.download('stopwords') # For stopwords # Step 3: Define the text paragraph text = """ Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language. The goal is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP is used in various applications such as chatbots, sentiment analysis, and machine translation. """ # Step 4: Preprocess the text # Convert to lowercase text = text.lower() # Step 5: Tokenize the text into words and sentences words = word_tokenize(text) # Tokenize into words sentences = sent_tokenize(text) # Tokenize into sentences # Step 6: Remove stopwords and non-alphanumeric tokens stop_words = set(stopwords.words('english')) filtered_words = [word for word in words if word.isalnum() and word not in stop_words] # Step 7: Calculate word frequency word_freq = Counter(filtered_words) # Step 8: Plot word frequency distribution plt.figure(figsize=(10, 6)) plt.bar(word_freq.keys(), word_freq.values()) plt.xlabel('Words') plt.ylabel('Frequency') plt.title('Word Frequency Distribution') plt.xticks(rotation=45) plt.show() # Step 9: Generate and plot word cloud wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq) plt.figure(figsize=(10, 5)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.title('Word Cloud') plt.show() # Step 10: Print results print("Filtered Words:", filtered_words) print("Word Frequencies:", word_freq)
SET A 3. SPPU
Consider the following review messages. Perform sentiment analysis on the messages.
i. I purchased headphones. I am very happy with the product.
ii. I saw the movie yesterday. Yhe animation was really good but the script was ok.
iii. I enjoy listening to music.
iv. I take a walk in the part everyday.
Sentiment analysis is a Natural Language Processing (NLP) technique used to determine the emotional tone or attitude expressed in a piece of text. It is widely used in various applications, such as analyzing customer reviews, social media posts, and feedback, to understand whether the sentiment is positive, negative, or neutral.
In this experiment, we will perform sentiment analysis on a set of review messages using the TextBlob
library in Python. The goal is to:
Preprocess the Text: Analyze the given review messages.
Calculate Sentiment: Use
TextBlob
to compute the polarity (positive/negative/neutral) and subjectivity (objective/subjective) of each message.Classify Sentiment: Based on the polarity score, classify the sentiment of each review.
Interpret Results: Understand why a particular sentiment was assigned to each review.
Steps to Perform Sentiment Analysis
Install Required Libraries:
Install
TextBlob
and its dependencies:pip install textblob python -m textblob.download_corpora
Analyze Sentiment:
Use
TextBlob
to calculate the polarity and subjectivity of each message.Polarity: Ranges from
-1
(negative) to1
(positive).Subjectivity: Ranges from
0
(objective) to1
(subjective).
Classify Sentiment:
Based on the polarity score, classify the sentiment as:
Positive: Polarity > 0
Neutral: Polarity = 0
Negative: Polarity < 0
Python Code
# Review messages reviews = [ "I purchased headphones. I am very happy with the product.", "I saw the movie yesterday. The animation was really good but the script was ok.", "I enjoy listening to music.", "I take a walk in the park everyday." ] # Perform sentiment analysis on each review for i, review in enumerate(reviews, 1): # Create a TextBlob object blob = TextBlob(review) # Get polarity and subjectivity polarity = blob.sentiment.polarity subjectivity = blob.sentiment.subjectivity # Classify sentiment based on polarity if polarity > 0: sentiment = "Positive" elif polarity == 0: sentiment = "Neutral" else: sentiment = "Negative" # Print results print(f"Review {i}: {review}") print(f"Sentiment: {sentiment} (Polarity: {polarity:.2f}, Subjectivity: {subjectivity:.2f})") print("-" * 50)