Logistic Regression Lab

1. Objective

       To understand and apply logistic regression for binary classification problems. This lab will cover the fundamental concepts of logistic regression, model interpretation, and a practical application with a simple example.

2. Introduction

         Logistic regression is a supervised learning algorithm used for binary classification. Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of a categorical outcome, typically binary (0 or 1, True or False  uses logistic regression to predict customer churn, a binary outcome (churn or not churn). It’s commonly used in various fields like and customer analytics. While random forests are gaining popularity for prediction, logistic regression remains valuable for its interpretability.

3. Theory.

         Logistic regression uses the sigmoid function to model the probability:

P(Y=1|X) = 1 / (1 + e^(-(β₀ + β₁X₁ + … + βₙXₙ)))

Where:

  • P(Y=1|X) is the probability of the positive outcome (Y=1) given the input features X.
  • X₁, X₂, …, Xₙ are the independent variables.
  • β₀, β₁, …, βₙ are the coefficients to be estimated. These coefficients represent the importance of each feature. (2023) describes the role of the model coefficients.
  • e is the base of the natural logarithm.

The model learns the coefficients (β values) during training to best fit the data. This is typically done by maximizing the likelihood function.

4. Example: Pass/Fail Prediction

Let’s predict whether a student passes or fails an exam based on study hours.

Data:
Study Hours (X) Pass/Fail (Y)
2     0
4     0
6     1
8     1
10     1

Steps:

  1. Model Building:
    We’ll use a simplified model:
    P(Y=1|X) = 1 / (1 + e^(-(β₀ + β₁X)))
  2. Training:
    Let’s assume (for simplicity – in reality, coefficients are learned through optimization algorithms) that training yields β₀ = -4 and β₁ = 1.
  3. Prediction:
    • If a student studies 5 hours:
      P(Y=1|X=5) = 1 / (1 + e^(-(-4 + 1*5))) = 1 / (1 + e^(-1)) ≈ 0.73.
      Since this probability is > 0.5, we predict the student will pass. (2023) mentions using a cutoff value (default 0.5) for binary classification.
    • If a student studies 3 hours:
      P(Y=1|X=3) = 1 / (1 + e^(-(-4 + 1*3))) = 1 / (1 + e^(-1)) ≈ 0.27.
      We predict the student will fail.

5. Evaluation : 

Several metrics evaluate a classification model:

  • Accuracy: The percentage of correctly classified instances.
  • Precision: Proportion of true positives out of predicted positives.
  • Recall: Proportion of true positives out of actual positives.
  • F1-score: Harmonic mean of precision and recall.

 Assignment  1. Set A 3) SPPU:

 Create ‘User” data set having 5 columns namely: User Id, gender, Age, Estimated Salary and purchased. Build a logistic regression model that can predict whether on the given parameter a person will buy a car or not.

 import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 1: Create the dataset
np.random.seed(42)
data = {
    'User Id': range(1, 501),
    'Gender': np.random.choice(['Male', 'Female'], 500),
    'Age': np.random.randint(18, 65, 500),
    'Estimated Salary': np.random.randint(20000, 150000, 500),
    'Purchased': np.random.choice([0, 1], 500)  # 0: Not Purchased, 1: Purchased
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Step 2: Define independent and target variables
X = df[['Age', 'Estimated Salary']]  # Independent variables
y = df['Purchased']  # Target variable

# Step 3: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Build a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Step 5: Predict and evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Step 6: Predict for a new person
new_person = [[30, 70000]]  # Age = 30, Estimated Salary = 70,000
prediction = model.predict(new_person)

# Output the prediction
if prediction[0] == 1:
    print("The person is likely to buy a car.")
else:
    print("The person is not likely to buy a car.")

# Step 7: Plot the decision boundary and scatter plot
plt.figure(figsize=(10, 6))

# Scatter plot for the data points
sns.scatterplot(x=X_test['Age'], y=X_test['Estimated Salary'], hue=y_test, palette='coolwarm', s=100)
plt.title('Logistic Regression: Decision Boundary')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')

# Create a grid to plot the decision boundary
age_range = np.arange(X_test['Age'].min() - 1, X_test['Age'].max() + 1, 0.1)
salary_range = np.arange(X_test['Estimated Salary'].min() - 1000, X_test['Estimated Salary'].max() + 1000, 100)
age_grid, salary_grid = np.meshgrid(age_range, salary_range)
grid_points = np.c_[age_grid.ravel(), salary_grid.ravel()]

# Predict the class for each point in the grid
grid_predictions = model.predict(grid_points).reshape(age_grid.shape)

# Plot the decision boundary
plt.contourf(age_grid, salary_grid, grid_predictions, alpha=0.3, cmap='coolwarm')
plt.colorbar(label='Purchased (1) or Not Purchased (0)')

plt.legend(title='Purchased', loc='upper right')
plt.show()

 Assignment  1. Set B 2) SPPU:

 Use the iris dataset. Write a Python program to view some basic statistical details like percentile, mean, std etc. of the species of Iris-setosa, Iris-versicolor, and Iris-virginica. Apply logistic regression on the dataset to identify different species. (setosa, versicolor, verginica0 of Iris flowers given just 4 features: sepal, and petal lengths and widths.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Load the Iris dataset from the current directory
df = pd.read_csv('iris.csv')

# Display the first few rows of the dataset
print("Sample of the dataset:")
print(df.head())

# Step 2: View basic statistical details for each species
print("\nBasic Statistical Details:")
print(df.groupby('Species').describe())

# Step 3: Prepare the data for logistic regression
# Independent variables (features): SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm
X = df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
# Target variable: Species
y = df['Species']

# Step 4: Split the data into training and testing sets (80:20 ratio)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Build a logistic regression model
model = LogisticRegression(max_iter=200)  # Increase max_iter for convergence
model.fit(X_train, y_train)

# Step 6: Predict on the testing set
y_pred = model.predict(X_test)

# Step 7: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("\nModel Accuracy:", accuracy)

# Step 8: Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))