Implementing Simple Linear Regression In Python
In the previous parts of this tutorial series, we focused entirely on building intuition. We learned what Linear Regression is, why it is used for regression problems, how data is visualized using scatter plots, how the concept of a best-fit line emerges, and how the mathematical equation of a straight line can be used to make predictions.
At this point, you already understand the theory behind Linear Regression. The next step is to see how these ideas are implemented in Python.
One of the most exciting moments in Machine Learning is when a mathematical concept is transformed into a working model that can learn from data and make predictions. In this part, we will take our placement prediction example and build a complete Linear Regression model using Python and Scikit-Learn.
By the end of this tutorial, you will understand how to prepare the data, split it into training and testing sets, train a Linear Regression model, make predictions, and visualize the results.
The Dataset
Throughout this tutorial series, we have been using a placement prediction example where a student's CGPA is used to predict the expected placement package.
For simplicity, let us assume we have the following dataset.
| CGPA | Package (LPA) |
|---|---|
| 6.5 | 3.0 |
| 7.0 | 4.0 |
| 8.0 | 6.5 |
| 9.0 | 8.5 |
| 6.8 | 3.8 |
| 7.5 | 5.2 |
| 8.2 | 6.9 |
| 9.2 | 9.0 |
In a real-world scenario, the dataset would typically contain hundreds or thousands of records. However, a small dataset is sufficient for understanding the implementation process.
Installing the Required Libraries
Before writing any code, we need a few Python libraries.
pip install pandas numpy matplotlib scikit-learn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
Each library serves a specific purpose.
- Pandas is used for loading and manipulating data.
- NumPy provides numerical operations.
- Matplotlib helps visualize the data.
- train_test_split is used to divide the dataset into training and testing portions.
- LinearRegression provides Scikit-Learn's implementation of the Linear Regression algorithm.
Creating the Dataset in Python
For this tutorial, we will create the dataset directly in Python.
data = {
"CGPA": [6.5, 7.0, 8.0, 9.0, 6.8, 7.5, 8.2, 9.2],
"Package": [3.0, 4.0, 6.5, 8.5, 3.8, 5.2, 6.9, 9.0]
}
df = pd.DataFrame(data)
print(df)
Output:
CGPA Package
0 6.5 3.0
1 7.0 4.0
2 8.0 6.5
3 9.0 8.5
4 6.8 3.8
5 7.5 5.2
6 8.2 6.9
7 9.2 9.0
At this stage, the dataset exists as a Pandas DataFrame.
Understanding Features and Target Variables
Before training a machine learning model, we must separate the input and output variables.
In our placement prediction problem:
- CGPA is the input feature.
- Package is the target variable.
The feature contains the information provided to the model, while the target contains the values that the model must learn to predict.
X = df[["CGPA"]]
y = df["Package"]
Notice the double brackets around "CGPA".
X = df[["CGPA"]]
This ensures that X remains a two-dimensional structure, which is what Scikit-Learn expects.
Why Do We Split the Dataset?
Many beginners wonder why we cannot simply train a model using the entire dataset. The reason is that our goal is not to memorize historical data. Our goal is to determine whether the model can make accurate predictions on unseen data.
Imagine a student preparing for an exam. If the student only memorizes questions they have already seen, it becomes impossible to determine whether they truly understand the subject. Similarly, if we train and evaluate the model on the same data, we cannot measure its ability to generalize.
To solve this problem, the dataset is divided into:
- Training Data
- Testing Data
The training data teaches the model. The testing data evaluates the model.
Train-Test Split
Scikit-Learn provides a convenient function for splitting datasets.
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
Let us understand each parameter.
test_size=0.2
This means:
80% → Training Data
20% → Testing Data
The model learns from 80% of the records and is evaluated on the remaining 20%.
random_state=42
Dataset splitting involves randomness. Without a fixed random state, the training and testing records may change every time the program runs.
Setting:
random_state=42
ensures reproducible results.
Whenever someone runs the code, the same records will be selected for training and testing.
Visualizing the Training Data
Before training the model, it is often useful to visualize the data.
plt.scatter(X, y)
plt.xlabel("CGPA")
plt.ylabel("Package (LPA)")
plt.title("CGPA vs Package")
plt.show()
The resulting scatter plot should look similar to the visualizations we discussed in Part 2. The points should generally move upward as CGPA increases, indicating a positive relationship between CGPA and package. This upward trend suggests that Linear Regression may be an appropriate model.
Creating the Linear Regression Model
Now comes the actual Machine Learning step. We create a Linear Regression model using Scikit-Learn.
model = LinearRegression()
At this stage, the model exists but has not learned anything yet. Think of it as an empty notebook waiting to learn patterns from data.
Training the Model
Training is the process of allowing the model to learn the relationship between the feature and target variable.
model.fit(X_train, y_train)
This single line performs all the mathematics internally.
During training, the algorithm:
- Examines the training data.
- Finds the best-fit line.
- Calculates the optimal slope.
- Calculates the optimal intercept.
- Stores these values inside the model.
After training, the model knows how to convert CGPA values into package predictions.
Viewing the Learned Equation
One of the advantages of Linear Regression is that we can inspect the learned parameters.
Slope
print(model.coef_)
Example output:
[1.92]
This means that every one-point increase in CGPA increases the predicted package by approximately 1.92 LPA.
###Intercept
print(model.intercept_)
Example output:
-9.45
Learned Equation
Combining both values gives:
Package = 1.92 × CGPA - 9.45
This is the equation discovered by the algorithm. Notice how the model has automatically learned the slope and intercept from historical data.
Making Predictions
Now that the model has learned the relationship, we can use it to make predictions. Suppose a new student has a CGPA of 8.3.
prediction = model.predict([[8.3]])
print(prediction)
Example output:
[6.48]
This means the model predicts a placement package of approximately 6.48 LPA. The prediction process is exactly what we discussed in the mathematical representation tutorial. The model simply substitutes the CGPA into the learned equation and calculates the result.
Predicting Multiple Students
The model can predict packages for multiple students simultaneously.
new_students = [[7.2], [8.5], [9.1]]
predictions = model.predict(new_students)
print(predictions)
Example output:
[4.37, 6.87, 8.02]
This capability allows organizations to make predictions for thousands of records in seconds.
Predicting the Test Dataset
We can also ask the model to predict the packages of students in the testing dataset.
y_pred = model.predict(X_test)
At this stage:
y_test → Actual Values
y_pred → Predicted Values
Comparing these values allows us to evaluate the model's performance.
Comparing Actual and Predicted Values
A simple comparison can be created using Pandas.
result = pd.DataFrame({
"Actual": y_test,
"Predicted": y_pred
})
print(result)
Example output:
Actual Predicted
0 6.90 6.48
1 8.50 8.20
The values may not match exactly. This is expected because real-world data contains noise and uncertainty. The goal of Linear Regression is not perfect prediction. The goal is to capture the underlying trend as accurately as possible.
Visualizing the Regression Line
One of the most satisfying parts of Linear Regression is visualizing the learned line.
plt.scatter(X, y)
plt.plot(
X,
model.predict(X),
color="red"
)
plt.xlabel("CGPA")
plt.ylabel("Package (LPA)")
plt.title("Linear Regression Best-Fit Line")
plt.show()
The graph contains:
- Blue points representing actual students.
- A red line representing the learned regression model.
This red line is the same best-fit line that we spent the previous tutorials discussing conceptually. Now we can finally see the algorithm discovering it automatically.
Understanding What Happened Internally
Although the implementation required only a few lines of code, a significant amount of mathematics happened behind the scenes.
When we called:
model.fit(X_train, y_train)
the algorithm:
- Analyzed the training records.
- Calculated the optimal slope.
- Calculated the optimal intercept.
- Minimized prediction error.
- Generated the best-fit line.
Once training completed, the model stored the equation internally. Every future prediction uses that equation. This is the essence of supervised learning. Historical examples are used to learn a pattern, and that pattern is then applied to future data.
Complete program
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Dataset
data = {
"CGPA": [6.5, 7.0, 8.0, 9.0, 6.8, 7.5, 8.2, 9.2],
"Package": [3.0, 4.0, 6.5, 8.5, 3.8, 5.2, 6.9, 9.0]
}
df = pd.DataFrame(data)
# Features and Target
X = df[["CGPA"]]
y = df["Package"]
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
# Create Model
model = LinearRegression()
# Train Model
model.fit(X_train, y_train)
# Slope
print("Slope:")
print(model.coef_)
# Intercept
print("\nIntercept:")
print(model.intercept_)
# Predict New Student
prediction = model.predict([[8.3]])
print("\nPrediction for CGPA 8.3:")
print(prediction)
# Plot
plt.scatter(X, y)
plt.plot(
X,
model.predict(X),
color="red"
)
plt.xlabel("CGPA")
plt.ylabel("Package (LPA)")
plt.title("Linear Regression")
plt.show()