Simple Linear Regression
1. Objective
To understand and apply linear regression for predicting continuous numerical values. This lab will cover fundamental concepts, model interpretation, and practical application with a simple example. Linear regression is a fundamental algorithm, used in many different settings, and covered in numerous online courses, often implemented with libraries like numpy.
2. Introduction.
Linear regression is a supervised learning algorithm used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between these variables. It’s widely used across many fields, from finance to biology, due to its interpretability and robustness.
3. Theory:
Simple linear regression (one independent variable) can be represented as:
Y = β₀ + β₁X + ε
Where:
- Y is the dependent variable (the value we want to predict).
- X is the independent variable (the predictor).
- β₀ is the y-intercept.
- β₁ is the slope (representing the change in Y for a unit change in X).
- ε is the error term (representing the difference between the predicted and actual values).
Multiple linear regression (multiple independent variables) extends this to:
Y = β₀ + β₁X₁ + β₂X₂ + … + βₙXₙ + ε
4. Example : Pizza Price Prediction.
Let’s predict the price of a pizza based on its diameter.
Data:
Diameter (inches) Price ($)
6 8
8 10
10 12
12 14
14 16
Steps:
- Model Building:
Assume (for simplicity) our model after training is:
Y = 1 + X - Prediction:
If a pizza has a diameter of 11 inches, its predicted price would be:
Y = 1 + 11 = 12
5. Model Evaluation:
Common metrics for evaluating regression models include:
- Mean Squared Error: Average of the squared differences between predicted and actual values. Lower is better.
- R-squared: Represents the proportion of variance in the dependent variable explained by the model. Ranges from 0 to 1, with higher values indicating a better fit
- Root Mean Squared Error: Square root of the MSE. Easier to interpret as it’s in the same units as the dependent variable
Python Implementation : Predicting Ice Cream Sales.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Load the dataset
df = pd.read_csv(‘ice_cream_sales.csv’)
# Display the first few rows of the dataset
print(df.head())
# Features (independent variable)
X = df[[‘Temperature’]]
# Target (dependent variable)
y = df[‘Sales’]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a linear regression model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Predict on the test set
y_pred = model.predict(X_test)
# Calculate the R-squared value
r_squared = model.score(X_test, y_test)
print(f’R-squared: {r_squared:.2f}’)
# Plot all data points
plt.scatter(X, y, color=’blue’, label=’All Data Points’)
# Plot the regression line using the entire range of X values
plt.plot(X, model.predict(X), color=’red’, label=’Regression Line’)
plt.xlabel(‘Temperature’)
plt.ylabel(‘Sales’)
plt.title(‘Ice Cream Sales vs Temperature’)
plt.legend()
plt.show()
# Predict sales for a temperature of 33
Predict_sales = model.predict([[33]])
print(f’Predicted Sales for Temperature 33: {Predict_sales[0]:.2f}’)
Following is the code step by step and explain the functionality of each part in detail. This will help you understand how the linear regression model is implemented and how the data is processed and visualized.
Step 1: Import Necessary Libraries
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_split
Functionality:
- pandas: Used for loading and manipulating the dataset (e.g., reading CSV files, working with DataFrames).
- numpy: Provides support for numerical operations (not explicitly used here but often required in data science tasks).
- pyplot: Used for data visualization (e.g., plotting scatter plots and regression lines).
- LinearRegression: A class from linear_modelused to create and train a linear regression model.
- train_test_split: A function from model_selectionused to split the dataset into training and testing sets.
Step 2: Load the Dataset
df = pd.read_csv(‘ice_cream_sales.csv’)print(df.head())
Functionality:
- read_csv(‘ice_cream_sales.csv’): Reads the dataset from a CSV file named ice_cream_sales.csvinto a Pandas DataFrame.
- head(): Displays the first 5 rows of the dataset to give a quick overview of its structure.
Example Dataset:
Assume the dataset (ice_cream_sales.csv) looks like this:
Temperature | Sales |
25 | 100 |
30 | 150 |
35 | 200 |
40 | 250 |
45 | 300 |
Step 3: Prepare the Data
X = df[[‘Temperature’]]y = df[‘Sales’]
Functionality:
- X = df[[‘Temperature’]]: Extracts the Temperaturecolumn as the feature (independent variable). The double brackets [[‘Temperature’]] ensure that X is a DataFrame (required by sklearn).
- y = df[‘Sales’]: Extracts the Salescolumn as the target (dependent variable). This is a single column (a Pandas Series).
Step 4: Split the Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Functionality:
- train_test_split(X, y, test_size=0.2, random_state=42):
- Splits the dataset into training and testing sets.
- test_size=0.2: 20% of the data is used for testing, and 80% is used for training.
- random_state=42: Ensures the split is reproducible (i.e., the same split will occur every time the code is run).
Output:
- X_train: Features (Temperature) for training.
- X_test: Features (Temperature) for testing.
- y_train: Target (Sales) for training.
- y_test: Target (Sales) for testing.
Step 5: Create and Train the Linear Regression Model
model = LinearRegression()model.fit(X_train, y_train)
Functionality:
- model = LinearRegression(): Creates an instance of the LinearRegression
- fit(X_train, y_train): Trains the model using the training data (X_trainand y_train). The model learns the relationship between Temperature and Sales.
Step 6: Make Predictions
y_pred = model.predict(X_test)
Functionality:
- predict(X_test): Uses the trained model to predict Salesfor the test data (X_test).
- y_pred: Contains the predicted values of Salesfor the test data.
Step 7: Evaluate the Model
r_squared = model.score(X_test, y_test)print(f’R-squared: {r_squared:.2f}’)
Functionality:
- score(X_test, y_test): Calculates the R-squared value, which measures how well the model explains the variance in the target variable.
- R-squared ranges from 0 to 1, where 1 indicates a perfect fit.
- print(f’R-squared: {r_squared:.2f}’): Displays the R-squared value rounded to 2 decimal places.
Step 8: Visualize the Results
plt.scatter(X, y, color=’blue’, label=’All Data Points’)plt.plot(X_test, y_pred, color=’red’, label=’Regression Line’)plt.xlabel(‘Temperature’)plt.ylabel(‘Sales’)plt.title(‘Ice Cream Sales vs Temperature’)plt.legend()plt.show()
Functionality:
- scatter(X, y, color=’blue’, label=’All Data Points’):
- Plots all data points (Temperaturevs Sales) as blue dots.
- label=’All Data Points’: Adds a label for the legend.
- plot(X_test, y_pred, color=’red’, label=’Regression Line’):
- Plots the regression line (predicted values) as a red line.
- label=’Regression Line’: Adds a label for the legend.
- xlabel(‘Temperature’): Labels the x-axis as “Temperature”.
- ylabel(‘Sales’): Labels the y-axis as “Sales”.
- title(‘Ice Cream Sales vs Temperature’): Adds a title to the plot.
- legend(): Displays the legend to differentiate between data points and the regression line.
- show(): Displays the plot.
Assignment 1. Set A 1) SPPU:
Create sales data set having 5 columns namely: id, tv, radio, newspaper and sales. (random 500 entries) build a linear regression model by identifying independent and target variable. Split the variables into training and testing sets, then divide the training and testing sets into a 7:3 ratio, respectively and print them. Build a simple linear regression model.
import pandas as pd
import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score # Set a random seed for reproducibility np.random.seed(42) # Create the dataset data = { 'ID': range(1, 501), 'TV': np.random.randint(1000, 5000, 500), 'Radio': np.random.randint(100, 1000, 500), 'Newspaper': np.random.randint(100, 1000, 500), 'Sales': np.random.randint(10000, 50000, 500) } # Convert to DataFrame df = pd.DataFrame(data) # Display the first few rows of the dataset print("Sample of the dataset:") print(df.head()) # Define independent variable (feature) and target variable X = df[['TV']] # Independent variable (only TV) y = df['Sales'] # Target variable # Split the data into training and testing sets (70:30 ratio) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Print the shapes of the training and testing sets print("\nTraining set shape:", X_train.shape, y_train.shape) print("Testing set shape:", X_test.shape, y_test.shape) # Build a simple linear regression model model = LinearRegression() model.fit(X_train, y_train) # Print the model coefficients print("\nModel coefficients:") print("Intercept:", model.intercept_) print("Coefficient (TV):", model.coef_[0]) # Predict sales on the testing set y_pred = model.predict(X_test) # Evaluate the model using R^2 score r2 = r2_score(y_test, y_pred) print("\nModel R^2 score on testing set:", r2) # Plotting the regression line plt.figure(figsize=(8, 6)) sns.scatterplot(x=X_test['TV'], y=y_test, color='blue', label='Actual Sales') sns.lineplot(x=X_test['TV'], y=y_pred, color='red', label='Predicted Sales') plt.title('Simple Linear Regression: TV vs Sales') plt.xlabel('TV Advertising Budget') plt.ylabel('Sales') plt.legend() plt.show()
Assignment 1. Set A 2) SPPU:
Create ‘realestate’ Data set having 4 columns namely: ID , flat, houses and purchases ( random 500 entries. Build a linear regression model by identifying independent and target variable. Split the variables into training and testing sets and print them. Build a simple linear regression model for predicting purchases.
import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score # Set a random seed for reproducibility np.random.seed(42) # Create the dataset data = { 'ID': range(1, 501), # Unique ID for each entry 'flat': np.random.randint(10, 100, 500), # Random number of flats 'houses': np.random.randint(1, 50, 500), # Random number of houses 'purchases': np.random.randint(100, 1000, 500) # Random number of purchases } # Convert to DataFrame df = pd.DataFrame(data) # Display the first few rows of the dataset print("Sample of the dataset:") print(df.head()) # Define independent variable (feature) and target variable X = df[['flat']] # Independent variable (only flat) y = df['purchases'] # Target variable # Split the data into training and testing sets (70:30 ratio) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Print the shapes of the training and testing sets print("\nTraining set shape:", X_train.shape, y_train.shape) print("Testing set shape:", X_test.shape, y_test.shape) # Build a simple linear regression model model = LinearRegression() model.fit(X_train, y_train) # Print the model coefficients print("\nModel coefficients:") print("Intercept:", model.intercept_) print("Coefficient (flat):", model.coef_[0]) # Predict purchases on the testing set y_pred = model.predict(X_test) # Evaluate the model using R^2 score r2 = r2_score(y_test, y_pred) print("\nModel R^2 score on testing set:", r2) # Plotting scatter plot for flat vs purchases plt.figure(figsize=(8, 6)) sns.scatterplot(x=X_test['flat'], y=y_test, color='blue', label='Actual Purchases') sns.lineplot(x=X_test['flat'], y=y_pred, color='red', label='Predicted Purchases') plt.title('Simple Linear Regression: Flat vs Purchases') plt.xlabel('Number of Flats') plt.ylabel('Purchases') plt.legend() plt.show() # Plotting predicted vs actual purchases plt.figure(figsize=(6, 6)) sns.scatterplot(x=y_test, y=y_pred, color='purple') plt.title('Predicted vs Actual Purchases') plt.xlabel('Actual Purchases') plt.ylabel('Predicted Purchases') plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='black', linestyle='--') # Diagonal line plt.show()
Assignment 1. Set B 1) SPPU:
Build a simple linear regression model for Fish Species Weight Prediction. (Use Fish.csv file for data set.)
import pandas as pd
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score # Step 1: Load the dataset from the current directory df = pd.read_csv('Fish.csv') # Display the first few rows of the dataset print("Sample of the dataset:") print(df.head()) # Step 2: Select independent variable (Length1) and target variable (Weight) X = df[['Length1']] # Independent variable (Length1) y = df['Weight'] # Target variable (Weight) # Step 3: Split the data into training and testing sets (80:20 ratio) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Step 4: Build a simple linear regression model model = LinearRegression() model.fit(X_train, y_train) # Step 5: Predict on the testing set y_pred = model.predict(X_test) # Step 6: Evaluate the model mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("\nModel Evaluation:") print("Mean Squared Error (MSE):", mse) print("R^2 Score:", r2) # Step 7: Plot the regression line and scatter plot plt.figure(figsize=(8, 6)) plt.scatter(X_test, y_test, color='blue', label='Actual Weight') plt.plot(X_test, y_pred, color='red', label='Predicted Weight') plt.title('Simple Linear Regression: Length1 vs Weight') plt.xlabel('Length1 (cm)') plt.ylabel('Weight (g)') plt.legend() plt.show()