Programmer's Picnic — Regression Master Lesson

Ground zero to practical implementation with derivation, charts, code, and quiz

📈 Regression • Correlation • Best Degree Fit

Linear Regression using Basic Python, NumPy Polyfit, and SciPy

In this lesson, we will understand what regression is, why we need it, how the straight-line equation is derived, how to implement it using plain Python, NumPy, and SciPy, and how to search for the best polynomial degree for a dataset. We will also compute the coefficient of correlation and compare the three approaches clearly.

4
Main implementations
1
Best degree finder
2
Interactive charts
10
MCQs with score

1) What is regression?

Regression tries to find a mathematical relationship between input and output.

Regression is a method used to model the relationship between variables. Usually, we have an input variable x and an output variable y. We want to find a function that connects them.

In the simplest case, we assume the relationship is linear. That means we believe the points roughly follow a straight line. Then we try to find the best straight line that passes as close as possible to the data.

y = mx + c

Here, m is the slope and c is the intercept. The slope tells us how much y changes when x increases by 1. The intercept tells us the value of y when x is 0.

Simple intuition

Suppose a student studies more hours and usually gets better marks. The exact marks may not lie perfectly on a straight line, but the overall trend may still be roughly linear.

  • x = study hours
  • y = marks scored
  • Regression line = best trend line
Regression is about fitting a function. Correlation is about measuring how strongly two variables move together.

2) Why do we need regression?

Because real data is noisy, and we still want a useful trend.

Prediction

Once we fit a line or polynomial, we can predict the output for a new input value. If we know future x, we can estimate future y.

Understanding trend

Regression tells us whether the data is generally rising, falling, or curved. It gives numerical meaning to a visible pattern.

Model comparison

We can compare different models, such as a line, a quadratic curve, or a cubic curve, and see which one matches the data better.

3) Equation of a straight line

This is the core model for linear regression.

y = mx + c
  • y = predicted output
  • x = input
  • m = slope of the line
  • c = y-intercept

If the slope is positive, the line rises from left to right. If the slope is negative, the line falls. If the slope is zero, the line is flat.

Example

y = 2x + 3

Here the slope is 2. That means when x increases by 1, y increases by 2. The intercept is 3. That means when x = 0, y = 3.

If x = 5, then y = 2 × 5 + 3 = 13.

4) Derivation of linear regression formulas

We derive the best line by minimizing total squared error.

Suppose we have points:

(x₁, y₁), (x₂, y₂), (x₃, y₃), ... , (xₙ, yₙ)

We want a line:

ŷ = mx + c

The prediction for each point is:

ŷᵢ = mxᵢ + c

The error for each point is:

eᵢ = yᵢ - ŷᵢ = yᵢ - (mxᵢ + c)

We square each error so that negative and positive errors do not cancel each other. Then we add them:

S = Σ (yᵢ - mxᵢ - c)²

We want the values of m and c that make S as small as possible. This is called the least squares method.

∂S / ∂c = -2 Σ (yᵢ - mxᵢ - c) = 0
Σy = mΣx + nc
c = (Σy - mΣx) / n
∂S / ∂m = -2 Σ xᵢ(yᵢ - mxᵢ - c) = 0
Σxy = mΣx² + cΣx

Now solve the two equations together and we get:

m = [nΣxy - (Σx)(Σy)] / [nΣx² - (Σx)²]
c = [Σy - mΣx] / n
These two formulas are the heart of manual linear regression in basic Python.

5) Coefficient of correlation and R²

These help us judge how well the model matches the data.

Pearson correlation coefficient

r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / √(Σ(xᵢ - x̄)² · Σ(yᵢ - ȳ)²)

The value of r lies between -1 and +1.

  • r close to +1 means strong positive linear relationship
  • r close to -1 means strong negative linear relationship
  • r close to 0 means weak linear relationship

Coefficient of determination

R² = 1 - (SSres / SStotal)
SSres = Σ(yᵢ - ŷᵢ)²
SStotal = Σ(yᵢ - ȳ)²

R² tells us what fraction of the variation in y is explained by the model. Closer to 1 usually means a better fit on the current dataset.

6) Comparing the three methods

All three can fit a line, but they differ in convenience and depth.

Method Main idea Good for Limitation
Basic Python Use manual formulas for slope and intercept Learning the math and interviews More code and no advanced statistics built in
NumPy polyfit Use least squares polynomial fitting Quick fitting, polynomial models, concise code You compute many metrics yourself
SciPy linregress Specialized linear regression function Linear regression with r, p-value, std errors Mainly for linear fit, not general polynomial search

7) Worked example by hand

Let us calculate slope and intercept for a small dataset.

Dataset

x = [1, 2, 3, 4, 5, 6]
y = [2, 4, 5, 4, 5, 7]

First compute: n, Σx, Σy, Σxy, Σx²

n = 6
Σx = 1+2+3+4+5+6 = 21
Σy = 2+4+5+4+5+7 = 27
Σxy = 2+8+15+16+25+42 = 108
Σx² = 1+4+9+16+25+36 = 91

Now apply the formula

m = [6×108 - 21×27] / [6×91 - 21²]
m = (648 - 567) / (546 - 441)
m = 81 / 105 = 0.77142857...
c = [27 - (0.77142857 × 21)] / 6
c = (27 - 16.2) / 6
c = 1.8
So the fitted line is approximately: y = 0.7714x + 1.8

8) Interactive regression playground

Enter your own x and y values, then compare the methods and see the chart.

Ready to run

The lesson will calculate manual linear regression, NumPy-style polynomial fitting, SciPy-style statistics, and best polynomial degree match.

Basic Python line
NumPy degree 1 line
SciPy line
Best degree
Best coefficient of correlation
Predicted y for given x

Data and fitted curves

Points are the dataset. The straight line is the linear fit. The curved line is the best polynomial fit found.

Degree vs fit quality

This chart shows how the score changes with polynomial degree. High training fit is useful, but too much degree can overfit.

Basic Python result

Run the demo to see details.

NumPy-style result

Run the demo to see details.

SciPy-style result

Run the demo to see details.

9) How the best degree finder works

We test several polynomial degrees and choose the best fit score.

Idea

We start from degree 1 and go up to a chosen maximum degree. For each degree, we:

  • fit the polynomial
  • calculate predictions on the given data
  • compute the coefficient of correlation between actual and predicted y
  • compute R²
  • keep the best one

Important warning

A very high degree can memorize the training data too closely. That can look excellent on the current points but perform badly on new points. So “best degree” on training data is not always best for real prediction.

For stronger model selection, use a validation set or cross-validation.

10A) Should we also learn numpy.polynomial?

Yes. It is the modern NumPy polynomial API and is worth knowing.

Why it matters

NumPy's older np.polyfit is still widely used and very useful. But NumPy also provides the newer numpy.polynomial API. In many new projects, Polynomial.fit is the better modern choice.

A simple rule is this: learn np.polyfit because you will see it everywhere, and also learn Polynomial.fit because it is the modern NumPy direction.

Side-by-side modern example

import numpy as np
from numpy.polynomial import Polynomial

x = np.array([1, 2, 3, 4, 5, 6], dtype=float)
y = np.array([2, 4, 5, 4, 5, 7], dtype=float)

# Older common style
coeffs = np.polyfit(x, y, 1)
poly_old = np.poly1d(coeffs)
print("np.polyfit slope, intercept:", coeffs)
print("Prediction at x=7:", poly_old(7))

# Modern NumPy polynomial API
model = Polynomial.fit(x, y, deg=1)
print("Scaled-domain model:", model)

# Convert to ordinary power basis if you want readable coefficients
model_power = model.convert()
print("Power-basis coefficients [intercept, slope]:", model_power.coef)
print("Prediction at x=7:", model(7))

Notice that Polynomial.fit returns a polynomial object directly. After convert(), the coefficients are shown in increasing-power order: intercept first, then slope, then higher terms.

10) Full Python code

This single script compares all methods and finds the best degree. Below it, you also get a fourth modern NumPy example using Polynomial.fit.

import math
import numpy as np
from scipy import stats

# =========================================================
# SAMPLE DATA
# =========================================================
x = [1, 2, 3, 4, 5, 6, 7, 8]
y = [1.2, 2.1, 2.9, 4.2, 5.1, 5.8, 7.2, 8.1]


# =========================================================
# UTILITY FUNCTIONS
# =========================================================
def mean(arr):
    return sum(arr) / len(arr)


def ss_total(y_true):
    y_bar = mean(y_true)
    return sum((yi - y_bar) ** 2 for yi in y_true)


def ss_residual(y_true, y_pred):
    return sum((yi - yp) ** 2 for yi, yp in zip(y_true, y_pred))


def r_squared(y_true, y_pred):
    sst = ss_total(y_true)
    if sst == 0:
        return 1.0
    ssr = ss_residual(y_true, y_pred)
    return 1 - (ssr / sst)


def pearson_r_basic(a, b):
    if len(a) != len(b):
        raise ValueError("Both arrays must have the same length.")
    if len(a) < 2:
        raise ValueError("At least 2 points are required.")

    ma = mean(a)
    mb = mean(b)

    num = sum((ai - ma) * (bi - mb) for ai, bi in zip(a, b))
    den_a = math.sqrt(sum((ai - ma) ** 2 for ai in a))
    den_b = math.sqrt(sum((bi - mb) ** 2 for bi in b))

    if den_a == 0 or den_b == 0:
        return 0.0

    return num / (den_a * den_b)


# =========================================================
# 1) BASIC PYTHON LINEAR REGRESSION
# =========================================================
def linear_regression_basic(x, y):
    if len(x) != len(y):
        raise ValueError("x and y must have same length.")
    n = len(x)
    if n < 2:
        raise ValueError("At least 2 data points are required.")

    sum_x = sum(x)
    sum_y = sum(y)
    sum_xy = sum(xi * yi for xi, yi in zip(x, y))
    sum_x2 = sum(xi * xi for xi in x)

    denominator = n * sum_x2 - sum_x ** 2
    if denominator == 0:
        raise ValueError("Cannot fit line: all x values may be identical.")

    slope = (n * sum_xy - sum_x * sum_y) / denominator
    intercept = (sum_y - slope * sum_x) / n

    y_pred = [slope * xi + intercept for xi in x]
    r2 = r_squared(y, y_pred)
    r = pearson_r_basic(x, y)

    return {
        "method": "Basic Python Linear Regression",
        "slope": slope,
        "intercept": intercept,
        "y_pred": y_pred,
        "r": r,
        "r2": r2,
    }


# =========================================================
# 2) NUMPY POLYFIT / POLY1D
# =========================================================
def regression_numpy_polyfit(x, y, degree=1):
    x_np = np.array(x, dtype=float)
    y_np = np.array(y, dtype=float)

    coeffs = np.polyfit(x_np, y_np, degree)
    poly = np.poly1d(coeffs)
    y_pred = poly(x_np)

    r = float(np.corrcoef(y_np, y_pred)[0, 1]) if len(y_np) > 1 else 0.0
    r2 = float(r_squared(list(y_np), list(y_pred)))

    return {
        "method": f"NumPy polyfit degree {degree}",
        "degree": degree,
        "coefficients": coeffs,
        "poly1d": poly,
        "y_pred": y_pred,
        "r": r,
        "r2": r2,
    }


# =========================================================
# 3) SCIPY LINEAR REGRESSION
# =========================================================
def regression_scipy_linregress(x, y):
    result = stats.linregress(x, y)
    y_pred = [result.slope * xi + result.intercept for xi in x]
    r2 = result.rvalue ** 2

    return {
        "method": "SciPy stats.linregress",
        "slope": result.slope,
        "intercept": result.intercept,
        "r": result.rvalue,
        "r2": r2,
        "p_value": result.pvalue,
        "slope_stderr": result.stderr,
        "intercept_stderr": result.intercept_stderr,
        "y_pred": y_pred,
    }


# =========================================================
# 4) FIND BEST DEGREE
# =========================================================
def find_best_degree(x, y, min_degree=1, max_degree=6):
    x_np = np.array(x, dtype=float)
    y_np = np.array(y, dtype=float)

    if len(x_np) != len(y_np):
        raise ValueError("x and y must have same length.")
    if len(x_np) < 2:
        raise ValueError("At least 2 data points are required.")

    results = []

    for deg in range(min_degree, max_degree + 1):
        if deg >= len(x_np):
            continue

        coeffs = np.polyfit(x_np, y_np, deg)
        poly = np.poly1d(coeffs)
        y_pred = poly(x_np)

        r = float(np.corrcoef(y_np, y_pred)[0, 1]) if len(y_np) > 1 else 0.0
        r2 = float(r_squared(list(y_np), list(y_pred)))

        results.append({
            "degree": deg,
            "coefficients": coeffs,
            "poly1d": poly,
            "y_pred": y_pred,
            "r": r,
            "abs_r": abs(r),
            "r2": r2,
        })

    if not results:
        raise ValueError("No valid degree could be tested.")

    best = max(results, key=lambda item: (item["r2"], item["abs_r"]))

    return {
        "all_results": results,
        "best": best,
    }


# =========================================================
# 5) COMPARISON DRIVER
# =========================================================
def compare_regression_methods(x, y, max_degree=6):
    basic = linear_regression_basic(x, y)
    numpy_linear = regression_numpy_polyfit(x, y, degree=1)
    scipy_linear = regression_scipy_linregress(x, y)
    degree_search = find_best_degree(x, y, min_degree=1, max_degree=max_degree)

    print("=" * 70)
    print("1) BASIC PYTHON LINEAR REGRESSION")
    print("=" * 70)
    print(f"Slope     : {basic['slope']:.6f}")
    print(f"Intercept : {basic['intercept']:.6f}")
    print(f"r         : {basic['r']:.6f}")
    print(f"R^2       : {basic['r2']:.6f}")
    print()

    print("=" * 70)
    print("2) NUMPY POLYFIT LINEAR REGRESSION (DEGREE 1)")
    print("=" * 70)
    print("Coefficients [slope, intercept]:")
    print(numpy_linear["coefficients"])
    print(f"r         : {numpy_linear['r']:.6f}")
    print(f"R^2       : {numpy_linear['r2']:.6f}")
    print()

    print("=" * 70)
    print("3) SCIPY STATS LINREGRESS")
    print("=" * 70)
    print(f"Slope               : {scipy_linear['slope']:.6f}")
    print(f"Intercept           : {scipy_linear['intercept']:.6f}")
    print(f"r                   : {scipy_linear['r']:.6f}")
    print(f"R^2                 : {scipy_linear['r2']:.6f}")
    print(f"p-value             : {scipy_linear['p_value']:.6f}")
    print(f"Slope Std Error     : {scipy_linear['slope_stderr']:.6f}")
    print(f"Intercept Std Error : {scipy_linear['intercept_stderr']:.6f}")
    print()

    print("=" * 70)
    print("4) BEST DEGREE SEARCH")
    print("=" * 70)
    print("Degree-wise summary:")
    for item in degree_search["all_results"]:
        print(
            f"Degree {item['degree']:>2} | "
            f"r = {item['r']:.6f} | "
            f"|r| = {item['abs_r']:.6f} | "
            f"R^2 = {item['r2']:.6f}"
        )

    best = degree_search["best"]
    print()
    print("BEST DEGREE FOUND")
    print(f"Best degree                 : {best['degree']}")
    print(f"Best coefficient of corr. r : {best['r']:.6f}")
    print(f"Best |r|                    : {best['abs_r']:.6f}")
    print(f"Best R^2                    : {best['r2']:.6f}")
    print("Best polynomial coefficients:")
    print(best["coefficients"])
    print("Polynomial:")
    print(best["poly1d"])

    return {
        "basic": basic,
        "numpy_linear": numpy_linear,
        "scipy_linear": scipy_linear,
        "degree_search": degree_search,
    }


# =========================================================
# RUN
# =========================================================
if __name__ == "__main__":
    output = compare_regression_methods(x, y, max_degree=5)

10B) Fourth implementation: NumPy Polynomial.fit

This keeps the lesson current with the newer NumPy polynomial API.

import numpy as np
from numpy.polynomial import Polynomial

x = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=float)
y = np.array([1.2, 2.1, 2.9, 4.2, 5.1, 5.8, 7.2, 8.1], dtype=float)

# Fit degree-1 polynomial using the modern API
model = Polynomial.fit(x, y, deg=1)

# Predict on existing x values
y_pred = model(x)

# Convert to ordinary power basis for readable coefficients
power_model = model.convert()
intercept = power_model.coef[0]
slope = power_model.coef[1]

# Correlation between actual and predicted y
r = np.corrcoef(y, y_pred)[0, 1]

# R^2
ss_total = np.sum((y - np.mean(y)) ** 2)
ss_res = np.sum((y - y_pred) ** 2)
r2 = 1 - (ss_res / ss_total)

print("Polynomial.fit model:", model)
print("Converted power-basis model:", power_model)
print("Intercept:", intercept)
print("Slope:", slope)
print("r:", r)
print("R^2:", r2)
print("Prediction at x=9:", model(9))

What to notice

  • Polynomial.fit gives you a fitted polynomial object directly.
  • model.convert() changes it into the ordinary power basis.
  • The converted coefficients come as [intercept, slope, ...].
  • Predictions can be made simply by calling the model like a function.

11) MCQ quiz

Test whether the concepts are clear.

Not submitted yet
Your score: 0 / 10

Read the explanations carefully. They are part of the lesson.

12) Final summary

What to remember from this lesson.

Basic Python

Use this when you want to understand the derivation and the real formulas behind linear regression.

NumPy polyfit and Polynomial.fit

Use np.polyfit when you want fast numerical fitting and when you want to test higher polynomial degrees easily. Also learn Polynomial.fit as the newer NumPy polynomial API.

SciPy linregress

Use this when you want linear regression plus useful statistics such as r-value, p-value, and standard errors.

The line itself is only one part of the story. Always check fit quality, understand what the coefficients mean, and be careful with very high polynomial degrees.