Logistic Regression Lab
1. Objective
To understand and apply logistic regression for binary classification problems. This lab will cover the fundamental concepts of logistic regression, model interpretation, and a practical application with a simple example.
2. Introduction
Logistic regression is a supervised learning algorithm used for binary classification. Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of a categorical outcome, typically binary (0 or 1, True or False uses logistic regression to predict customer churn, a binary outcome (churn or not churn). It’s commonly used in various fields like and customer analytics. While random forests are gaining popularity for prediction, logistic regression remains valuable for its interpretability.
3. Theory.
Logistic regression uses the sigmoid function to model the probability:
P(Y=1|X) = 1 / (1 + e^(-(β₀ + β₁X₁ + … + βₙXₙ)))
Where:
- P(Y=1|X) is the probability of the positive outcome (Y=1) given the input features X.
- X₁, X₂, …, Xₙ are the independent variables.
- β₀, β₁, …, βₙ are the coefficients to be estimated. These coefficients represent the importance of each feature. (2023) describes the role of the model coefficients.
- e is the base of the natural logarithm.
The model learns the coefficients (β values) during training to best fit the data. This is typically done by maximizing the likelihood function.
4. Example: Pass/Fail Prediction
Let’s predict whether a student passes or fails an exam based on study hours.
Data:
Study Hours (X) Pass/Fail (Y)
2 0
4 0
6 1
8 1
10 1
Steps:
- Model Building:
We’ll use a simplified model:
P(Y=1|X) = 1 / (1 + e^(-(β₀ + β₁X))) - Training:
Let’s assume (for simplicity – in reality, coefficients are learned through optimization algorithms) that training yields β₀ = -4 and β₁ = 1. - Prediction:
- If a student studies 5 hours:
P(Y=1|X=5) = 1 / (1 + e^(-(-4 + 1*5))) = 1 / (1 + e^(-1)) ≈ 0.73.
Since this probability is > 0.5, we predict the student will pass. (2023) mentions using a cutoff value (default 0.5) for binary classification. - If a student studies 3 hours:
P(Y=1|X=3) = 1 / (1 + e^(-(-4 + 1*3))) = 1 / (1 + e^(-1)) ≈ 0.27.
We predict the student will fail.
- If a student studies 5 hours:
5. Evaluation :
Several metrics evaluate a classification model:
- Accuracy: The percentage of correctly classified instances.
- Precision: Proportion of true positives out of predicted positives.
- Recall: Proportion of true positives out of actual positives.
- F1-score: Harmonic mean of precision and recall.
Assignment 1. Set A 3) SPPU:
Create ‘User” data set having 5 columns namely: User Id, gender, Age, Estimated Salary and purchased. Build a logistic regression model that can predict whether on the given parameter a person will buy a car or not.
import pandas as pd
import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Step 1: Create the dataset np.random.seed(42) data = { 'User Id': range(1, 501), 'Gender': np.random.choice(['Male', 'Female'], 500), 'Age': np.random.randint(18, 65, 500), 'Estimated Salary': np.random.randint(20000, 150000, 500), 'Purchased': np.random.choice([0, 1], 500) # 0: Not Purchased, 1: Purchased } # Convert to DataFrame df = pd.DataFrame(data) # Step 2: Define independent and target variables X = df[['Age', 'Estimated Salary']] # Independent variables y = df['Purchased'] # Target variable # Step 3: Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Step 4: Build a logistic regression model model = LogisticRegression() model.fit(X_train, y_train) # Step 5: Predict and evaluate the model y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print("Model Accuracy:", accuracy) # Step 6: Predict for a new person new_person = [[30, 70000]] # Age = 30, Estimated Salary = 70,000 prediction = model.predict(new_person) # Output the prediction if prediction[0] == 1: print("The person is likely to buy a car.") else: print("The person is not likely to buy a car.") # Step 7: Plot the decision boundary and scatter plot plt.figure(figsize=(10, 6)) # Scatter plot for the data points sns.scatterplot(x=X_test['Age'], y=X_test['Estimated Salary'], hue=y_test, palette='coolwarm', s=100) plt.title('Logistic Regression: Decision Boundary') plt.xlabel('Age') plt.ylabel('Estimated Salary') # Create a grid to plot the decision boundary age_range = np.arange(X_test['Age'].min() - 1, X_test['Age'].max() + 1, 0.1) salary_range = np.arange(X_test['Estimated Salary'].min() - 1000, X_test['Estimated Salary'].max() + 1000, 100) age_grid, salary_grid = np.meshgrid(age_range, salary_range) grid_points = np.c_[age_grid.ravel(), salary_grid.ravel()] # Predict the class for each point in the grid grid_predictions = model.predict(grid_points).reshape(age_grid.shape) # Plot the decision boundary plt.contourf(age_grid, salary_grid, grid_predictions, alpha=0.3, cmap='coolwarm') plt.colorbar(label='Purchased (1) or Not Purchased (0)') plt.legend(title='Purchased', loc='upper right') plt.show()
Assignment 1. Set B 2) SPPU:
Use the iris dataset. Write a Python program to view some basic statistical details like percentile, mean, std etc. of the species of Iris-setosa, Iris-versicolor, and Iris-virginica. Apply logistic regression on the dataset to identify different species. (setosa, versicolor, verginica0 of Iris flowers given just 4 features: sepal, and petal lengths and widths.
from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report # Step 1: Load the Iris dataset from the current directory df = pd.read_csv('iris.csv') # Display the first few rows of the dataset print("Sample of the dataset:") print(df.head()) # Step 2: View basic statistical details for each species print("\nBasic Statistical Details:") print(df.groupby('Species').describe()) # Step 3: Prepare the data for logistic regression # Independent variables (features): SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm X = df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']] # Target variable: Species y = df['Species'] # Step 4: Split the data into training and testing sets (80:20 ratio) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Step 5: Build a logistic regression model model = LogisticRegression(max_iter=200) # Increase max_iter for convergence model.fit(X_train, y_train) # Step 6: Predict on the testing set y_pred = model.predict(X_test) # Step 7: Evaluate the model accuracy = accuracy_score(y_test, y_pred) print("\nModel Accuracy:", accuracy) # Step 8: Print classification report print("\nClassification Report:") print(classification_report(y_test, y_pred))