Simple Linear Regression

          Linear regression is a fundamental machine learning technique used for predicting a continuous outcome variable based on one or more predictor variables. It assumes a linear relationship between the predictors and the outcome. Imagine drawing a straight line that best fits a scatter plot of your data – that’s the essence of linear regression.

Core Idea

          The goal is to find the equation of this “best-fit” line, which can then be used to predict the outcome for new, unseen data points. The equation of a straight line is typically represented as:

y = mx + c

Where:

  • y is the predicted outcome.
  •  x  is the predictor variable.
  • m is the slope of the line (how much y changes for a unit change in x).
  • c is the y-intercept (the value of y when x is zero).

In linear regression, we aim to find the optimal values of m and c that minimize the difference between the predicted values (y) and the actual values in our dataset. 

Example: Predicting Ice Cream Sales

     Let’s say we want to predict ice cream sales based on the temperature outside. We have historical data on temperature and sales: Download a dataset file House_data.xlsx file, link is given below.       

 

We can plot this data and try to fit a straight line through it. Linear regression helps us find the best line, meaning the one that minimizes the error in our predictions. Let’s imagine the line we find has the equation:

Sales = 10 * Temperature – 100

Now, if the temperature is 40°C, we can predict ice cream sales:

Sales = 10 * 40 – 100 = 300

So, we predict sales of 300 units.

How the “Best-Fit” Line is Determined

The “best-fit” line is found by minimizing a loss function, typically the Mean Squared Error. MSE calculates the average of the squared differences between predicted and actual values. The smaller the MSE, the better the line fits the data. Many libraries  provide functions (like LinearRegression().fit() in scikit-learn within Python) to automate these functions without manual intervention. 

       This example demonstrates the core concept of linear regression. Of course, real-world applications often involve multiple predictor variables and more complex datasets. Understanding this basic example, however, sets the foundation for learning more advanced regression techniques.

Python Implementation : Predicting Ice Cream Sales.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv(‘ice_cream_sales.csv’)

# Display the first few rows of the dataset
print(df.head())

# Features (independent variable)
X = df[[‘Temperature’]]

# Target (dependent variable)
y = df[‘Sales’]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate the R-squared value
r_squared = model.score(X_test, y_test)
print(f’R-squared: {r_squared:.2f}’)

# Plot all data points and the regression line
plt.scatter(X, y, color=’blue’, label=’All Data Points’) # Plot all data points
plt.plot(X_test, y_pred, color=’red’, label=’Regression Line’) # Plot the regression line
plt.xlabel(‘Temperature’)
plt.ylabel(‘Sales’)
plt.title(‘Ice Cream Sales vs Temperature’)
plt.legend()
plt.show()

Predict_sales = model.predict([[33]])

print(Predict_sales)

 Following is the code step by step and explain the functionality of each part in detail. This will help you understand how the linear regression model is implemented and how the data is processed and visualized.

Step 1: Import Necessary Libraries

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_split

Functionality:
  • pandas: Used for loading and manipulating the dataset (e.g., reading CSV files, working with DataFrames).
  • numpy: Provides support for numerical operations (not explicitly used here but often required in data science tasks).
  • pyplot: Used for data visualization (e.g., plotting scatter plots and regression lines).
  • LinearRegression: A class from linear_modelused to create and train a linear regression model.
  • train_test_split: A function from model_selectionused to split the dataset into training and testing sets.
Step 2: Load the Dataset

df = pd.read_csv(‘ice_cream_sales.csv’)print(df.head())

Functionality:
  • read_csv(‘ice_cream_sales.csv’): Reads the dataset from a CSV file named ice_cream_sales.csvinto a Pandas DataFrame.
  • head(): Displays the first 5 rows of the dataset to give a quick overview of its structure.
Example Dataset:

Assume the dataset (ice_cream_sales.csv) looks like this:

Temperature

Sales

25

100

30

150

35

200

40

250

45

300

Step 3: Prepare the Data

 

X = df[[‘Temperature’]]y = df[‘Sales’]

Functionality:
  • X = df[[‘Temperature’]]: Extracts the Temperaturecolumn as the feature (independent variable). The double brackets [[‘Temperature’]] ensure that X is a DataFrame (required by sklearn).
  • y = df[‘Sales’]: Extracts the Salescolumn as the target (dependent variable). This is a single column (a Pandas Series).
Step 4: Split the Data into Training and Testing Sets

 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Functionality:
  • train_test_split(X, y, test_size=0.2, random_state=42):
    • Splits the dataset into training and testing sets.
    • test_size=0.2: 20% of the data is used for testing, and 80% is used for training.
    • random_state=42: Ensures the split is reproducible (i.e., the same split will occur every time the code is run).
Output:
  • X_train: Features (Temperature) for training.
  • X_test: Features (Temperature) for testing.
  • y_train: Target (Sales) for training.
  • y_test: Target (Sales) for testing.
Step 5: Create and Train the Linear Regression Model

model = LinearRegression()model.fit(X_train, y_train)

Functionality:
  • model = LinearRegression(): Creates an instance of the LinearRegression
  • fit(X_train, y_train): Trains the model using the training data (X_trainand y_train). The model learns the relationship between Temperature and Sales.
Step 6: Make Predictions

 

y_pred = model.predict(X_test)

Functionality:
  • predict(X_test): Uses the trained model to predict Salesfor the test data (X_test).
  • y_pred: Contains the predicted values of Salesfor the test data.
Step 7: Evaluate the Model

 

r_squared = model.score(X_test, y_test)print(f’R-squared: {r_squared:.2f}’)

Functionality:
  • score(X_test, y_test): Calculates the R-squared value, which measures how well the model explains the variance in the target variable.
    • R-squared ranges from 0 to 1, where 1 indicates a perfect fit.
  • print(f’R-squared: {r_squared:.2f}’): Displays the R-squared value rounded to 2 decimal places.
Step 8: Visualize the Results

plt.scatter(X, y, color=’blue’, label=’All Data Points’)plt.plot(X_test, y_pred, color=’red’, label=’Regression Line’)plt.xlabel(‘Temperature’)plt.ylabel(‘Sales’)plt.title(‘Ice Cream Sales vs Temperature’)plt.legend()plt.show()

Functionality:
  • scatter(X, y, color=’blue’, label=’All Data Points’):
    • Plots all data points (Temperaturevs Sales) as blue dots.
    • label=’All Data Points’: Adds a label for the legend.
  • plot(X_test, y_pred, color=’red’, label=’Regression Line’):
    • Plots the regression line (predicted values) as a red line.
    • label=’Regression Line’: Adds a label for the legend.
  • xlabel(‘Temperature’): Labels the x-axis as “Temperature”.
  • ylabel(‘Sales’): Labels the y-axis as “Sales”.
  • title(‘Ice Cream Sales vs Temperature’): Adds a title to the plot.
  • legend(): Displays the legend to differentiate between data points and the regression line.
  • show(): Displays the plot.