1) What is regression?
Regression tries to find a mathematical relationship between input and output.
Regression is a method used to model the relationship between variables. Usually, we have an input variable x and an output variable y. We want to find a function that connects them.
In the simplest case, we assume the relationship is linear. That means we believe the points roughly follow a straight line. Then we try to find the best straight line that passes as close as possible to the data.
Here, m is the slope and c is the intercept. The slope tells us how much y changes when x increases by 1. The intercept tells us the value of y when x is 0.
Simple intuition
Suppose a student studies more hours and usually gets better marks. The exact marks may not lie perfectly on a straight line, but the overall trend may still be roughly linear.
- x = study hours
- y = marks scored
- Regression line = best trend line
2) Why do we need regression?
Because real data is noisy, and we still want a useful trend.
Prediction
Once we fit a line or polynomial, we can predict the output for a new input value. If we know future x, we can estimate future y.
Understanding trend
Regression tells us whether the data is generally rising, falling, or curved. It gives numerical meaning to a visible pattern.
Model comparison
We can compare different models, such as a line, a quadratic curve, or a cubic curve, and see which one matches the data better.
3) Equation of a straight line
This is the core model for linear regression.
- y = predicted output
- x = input
- m = slope of the line
- c = y-intercept
If the slope is positive, the line rises from left to right. If the slope is negative, the line falls. If the slope is zero, the line is flat.
Example
Here the slope is 2. That means when x increases by 1, y increases by 2. The intercept is 3. That means when x = 0, y = 3.
If x = 5, then y = 2 × 5 + 3 = 13.
4) Derivation of linear regression formulas
We derive the best line by minimizing total squared error.
Suppose we have points:
We want a line:
The prediction for each point is:
The error for each point is:
We square each error so that negative and positive errors do not cancel each other. Then we add them:
We want the values of m and c that make S as small as possible. This is called the least squares method.
Now solve the two equations together and we get:
5) Coefficient of correlation and R²
These help us judge how well the model matches the data.
Pearson correlation coefficient
The value of r lies between -1 and +1.
- r close to +1 means strong positive linear relationship
- r close to -1 means strong negative linear relationship
- r close to 0 means weak linear relationship
Coefficient of determination
R² tells us what fraction of the variation in y is explained by the model. Closer to 1 usually means a better fit on the current dataset.
6) Comparing the three methods
All three can fit a line, but they differ in convenience and depth.
| Method | Main idea | Good for | Limitation |
|---|---|---|---|
| Basic Python | Use manual formulas for slope and intercept | Learning the math and interviews | More code and no advanced statistics built in |
| NumPy polyfit | Use least squares polynomial fitting | Quick fitting, polynomial models, concise code | You compute many metrics yourself |
| SciPy linregress | Specialized linear regression function | Linear regression with r, p-value, std errors | Mainly for linear fit, not general polynomial search |
7) Worked example by hand
Let us calculate slope and intercept for a small dataset.
Dataset
First compute: n, Σx, Σy, Σxy, Σx²
Now apply the formula
8) Interactive regression playground
Enter your own x and y values, then compare the methods and see the chart.
The lesson will calculate manual linear regression, NumPy-style polynomial fitting, SciPy-style statistics, and best polynomial degree match.
Data and fitted curves
Points are the dataset. The straight line is the linear fit. The curved line is the best polynomial fit found.
Degree vs fit quality
This chart shows how the score changes with polynomial degree. High training fit is useful, but too much degree can overfit.
Basic Python result
NumPy-style result
SciPy-style result
9) How the best degree finder works
We test several polynomial degrees and choose the best fit score.
Idea
We start from degree 1 and go up to a chosen maximum degree. For each degree, we:
- fit the polynomial
- calculate predictions on the given data
- compute the coefficient of correlation between actual and predicted y
- compute R²
- keep the best one
Important warning
For stronger model selection, use a validation set or cross-validation.
10A) Should we also learn numpy.polynomial?
Yes. It is the modern NumPy polynomial API and is worth knowing.
Why it matters
NumPy's older np.polyfit is still widely used and very useful.
But NumPy also provides the newer numpy.polynomial API.
In many new projects, Polynomial.fit is the better modern choice.
Side-by-side modern example
import numpy as np
from numpy.polynomial import Polynomial
x = np.array([1, 2, 3, 4, 5, 6], dtype=float)
y = np.array([2, 4, 5, 4, 5, 7], dtype=float)
# Older common style
coeffs = np.polyfit(x, y, 1)
poly_old = np.poly1d(coeffs)
print("np.polyfit slope, intercept:", coeffs)
print("Prediction at x=7:", poly_old(7))
# Modern NumPy polynomial API
model = Polynomial.fit(x, y, deg=1)
print("Scaled-domain model:", model)
# Convert to ordinary power basis if you want readable coefficients
model_power = model.convert()
print("Power-basis coefficients [intercept, slope]:", model_power.coef)
print("Prediction at x=7:", model(7))
Notice that Polynomial.fit returns a polynomial object directly.
After convert(), the coefficients are shown in increasing-power order:
intercept first, then slope, then higher terms.
10) Full Python code
This single script compares all methods and finds the best degree. Below it, you also get a fourth modern NumPy example using Polynomial.fit.
import math
import numpy as np
from scipy import stats
# =========================================================
# SAMPLE DATA
# =========================================================
x = [1, 2, 3, 4, 5, 6, 7, 8]
y = [1.2, 2.1, 2.9, 4.2, 5.1, 5.8, 7.2, 8.1]
# =========================================================
# UTILITY FUNCTIONS
# =========================================================
def mean(arr):
return sum(arr) / len(arr)
def ss_total(y_true):
y_bar = mean(y_true)
return sum((yi - y_bar) ** 2 for yi in y_true)
def ss_residual(y_true, y_pred):
return sum((yi - yp) ** 2 for yi, yp in zip(y_true, y_pred))
def r_squared(y_true, y_pred):
sst = ss_total(y_true)
if sst == 0:
return 1.0
ssr = ss_residual(y_true, y_pred)
return 1 - (ssr / sst)
def pearson_r_basic(a, b):
if len(a) != len(b):
raise ValueError("Both arrays must have the same length.")
if len(a) < 2:
raise ValueError("At least 2 points are required.")
ma = mean(a)
mb = mean(b)
num = sum((ai - ma) * (bi - mb) for ai, bi in zip(a, b))
den_a = math.sqrt(sum((ai - ma) ** 2 for ai in a))
den_b = math.sqrt(sum((bi - mb) ** 2 for bi in b))
if den_a == 0 or den_b == 0:
return 0.0
return num / (den_a * den_b)
# =========================================================
# 1) BASIC PYTHON LINEAR REGRESSION
# =========================================================
def linear_regression_basic(x, y):
if len(x) != len(y):
raise ValueError("x and y must have same length.")
n = len(x)
if n < 2:
raise ValueError("At least 2 data points are required.")
sum_x = sum(x)
sum_y = sum(y)
sum_xy = sum(xi * yi for xi, yi in zip(x, y))
sum_x2 = sum(xi * xi for xi in x)
denominator = n * sum_x2 - sum_x ** 2
if denominator == 0:
raise ValueError("Cannot fit line: all x values may be identical.")
slope = (n * sum_xy - sum_x * sum_y) / denominator
intercept = (sum_y - slope * sum_x) / n
y_pred = [slope * xi + intercept for xi in x]
r2 = r_squared(y, y_pred)
r = pearson_r_basic(x, y)
return {
"method": "Basic Python Linear Regression",
"slope": slope,
"intercept": intercept,
"y_pred": y_pred,
"r": r,
"r2": r2,
}
# =========================================================
# 2) NUMPY POLYFIT / POLY1D
# =========================================================
def regression_numpy_polyfit(x, y, degree=1):
x_np = np.array(x, dtype=float)
y_np = np.array(y, dtype=float)
coeffs = np.polyfit(x_np, y_np, degree)
poly = np.poly1d(coeffs)
y_pred = poly(x_np)
r = float(np.corrcoef(y_np, y_pred)[0, 1]) if len(y_np) > 1 else 0.0
r2 = float(r_squared(list(y_np), list(y_pred)))
return {
"method": f"NumPy polyfit degree {degree}",
"degree": degree,
"coefficients": coeffs,
"poly1d": poly,
"y_pred": y_pred,
"r": r,
"r2": r2,
}
# =========================================================
# 3) SCIPY LINEAR REGRESSION
# =========================================================
def regression_scipy_linregress(x, y):
result = stats.linregress(x, y)
y_pred = [result.slope * xi + result.intercept for xi in x]
r2 = result.rvalue ** 2
return {
"method": "SciPy stats.linregress",
"slope": result.slope,
"intercept": result.intercept,
"r": result.rvalue,
"r2": r2,
"p_value": result.pvalue,
"slope_stderr": result.stderr,
"intercept_stderr": result.intercept_stderr,
"y_pred": y_pred,
}
# =========================================================
# 4) FIND BEST DEGREE
# =========================================================
def find_best_degree(x, y, min_degree=1, max_degree=6):
x_np = np.array(x, dtype=float)
y_np = np.array(y, dtype=float)
if len(x_np) != len(y_np):
raise ValueError("x and y must have same length.")
if len(x_np) < 2:
raise ValueError("At least 2 data points are required.")
results = []
for deg in range(min_degree, max_degree + 1):
if deg >= len(x_np):
continue
coeffs = np.polyfit(x_np, y_np, deg)
poly = np.poly1d(coeffs)
y_pred = poly(x_np)
r = float(np.corrcoef(y_np, y_pred)[0, 1]) if len(y_np) > 1 else 0.0
r2 = float(r_squared(list(y_np), list(y_pred)))
results.append({
"degree": deg,
"coefficients": coeffs,
"poly1d": poly,
"y_pred": y_pred,
"r": r,
"abs_r": abs(r),
"r2": r2,
})
if not results:
raise ValueError("No valid degree could be tested.")
best = max(results, key=lambda item: (item["r2"], item["abs_r"]))
return {
"all_results": results,
"best": best,
}
# =========================================================
# 5) COMPARISON DRIVER
# =========================================================
def compare_regression_methods(x, y, max_degree=6):
basic = linear_regression_basic(x, y)
numpy_linear = regression_numpy_polyfit(x, y, degree=1)
scipy_linear = regression_scipy_linregress(x, y)
degree_search = find_best_degree(x, y, min_degree=1, max_degree=max_degree)
print("=" * 70)
print("1) BASIC PYTHON LINEAR REGRESSION")
print("=" * 70)
print(f"Slope : {basic['slope']:.6f}")
print(f"Intercept : {basic['intercept']:.6f}")
print(f"r : {basic['r']:.6f}")
print(f"R^2 : {basic['r2']:.6f}")
print()
print("=" * 70)
print("2) NUMPY POLYFIT LINEAR REGRESSION (DEGREE 1)")
print("=" * 70)
print("Coefficients [slope, intercept]:")
print(numpy_linear["coefficients"])
print(f"r : {numpy_linear['r']:.6f}")
print(f"R^2 : {numpy_linear['r2']:.6f}")
print()
print("=" * 70)
print("3) SCIPY STATS LINREGRESS")
print("=" * 70)
print(f"Slope : {scipy_linear['slope']:.6f}")
print(f"Intercept : {scipy_linear['intercept']:.6f}")
print(f"r : {scipy_linear['r']:.6f}")
print(f"R^2 : {scipy_linear['r2']:.6f}")
print(f"p-value : {scipy_linear['p_value']:.6f}")
print(f"Slope Std Error : {scipy_linear['slope_stderr']:.6f}")
print(f"Intercept Std Error : {scipy_linear['intercept_stderr']:.6f}")
print()
print("=" * 70)
print("4) BEST DEGREE SEARCH")
print("=" * 70)
print("Degree-wise summary:")
for item in degree_search["all_results"]:
print(
f"Degree {item['degree']:>2} | "
f"r = {item['r']:.6f} | "
f"|r| = {item['abs_r']:.6f} | "
f"R^2 = {item['r2']:.6f}"
)
best = degree_search["best"]
print()
print("BEST DEGREE FOUND")
print(f"Best degree : {best['degree']}")
print(f"Best coefficient of corr. r : {best['r']:.6f}")
print(f"Best |r| : {best['abs_r']:.6f}")
print(f"Best R^2 : {best['r2']:.6f}")
print("Best polynomial coefficients:")
print(best["coefficients"])
print("Polynomial:")
print(best["poly1d"])
return {
"basic": basic,
"numpy_linear": numpy_linear,
"scipy_linear": scipy_linear,
"degree_search": degree_search,
}
# =========================================================
# RUN
# =========================================================
if __name__ == "__main__":
output = compare_regression_methods(x, y, max_degree=5)
10B) Fourth implementation: NumPy Polynomial.fit
This keeps the lesson current with the newer NumPy polynomial API.
import numpy as np
from numpy.polynomial import Polynomial
x = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=float)
y = np.array([1.2, 2.1, 2.9, 4.2, 5.1, 5.8, 7.2, 8.1], dtype=float)
# Fit degree-1 polynomial using the modern API
model = Polynomial.fit(x, y, deg=1)
# Predict on existing x values
y_pred = model(x)
# Convert to ordinary power basis for readable coefficients
power_model = model.convert()
intercept = power_model.coef[0]
slope = power_model.coef[1]
# Correlation between actual and predicted y
r = np.corrcoef(y, y_pred)[0, 1]
# R^2
ss_total = np.sum((y - np.mean(y)) ** 2)
ss_res = np.sum((y - y_pred) ** 2)
r2 = 1 - (ss_res / ss_total)
print("Polynomial.fit model:", model)
print("Converted power-basis model:", power_model)
print("Intercept:", intercept)
print("Slope:", slope)
print("r:", r)
print("R^2:", r2)
print("Prediction at x=9:", model(9))
What to notice
Polynomial.fitgives you a fitted polynomial object directly.model.convert()changes it into the ordinary power basis.- The converted coefficients come as [intercept, slope, ...].
- Predictions can be made simply by calling the model like a function.
11) MCQ quiz
Test whether the concepts are clear.
Read the explanations carefully. They are part of the lesson.
12) Final summary
What to remember from this lesson.
Basic Python
Use this when you want to understand the derivation and the real formulas behind linear regression.
NumPy polyfit and Polynomial.fit
Use np.polyfit when you want fast numerical fitting and when you want to
test higher polynomial degrees easily. Also learn Polynomial.fit as the
newer NumPy polynomial API.
SciPy linregress
Use this when you want linear regression plus useful statistics such as r-value, p-value, and standard errors.