Dataset Cleaning to Linear Regression — Champak Roy

Level 0

What is Data Cleaning?

Data cleaning means preparing raw data so that it becomes useful for analysis and machine learning. Real-world data often contains missing values, duplicate rows, spelling mistakes, impossible values, wrong data types, and outliers.

Raw Data

Inspect

Clean

Transform

Model

Evaluate

Level 1 — Basic Python First

Cleaning Data Using Basic Python Only

Before using pandas, understand cleaning with normal Python lists and dictionaries. This helps you understand what pandas does internally.

Messy Dataset

students = [
    {"hours": 1, "score": 50},
    {"hours": 2, "score": 55},
    {"hours": 2, "score": 55},      # duplicate
    {"hours": None, "score": 65},   # missing hours
    {"hours": 4, "score": "bad"},   # bad score
    {"hours": 5, "score": 70},
]

Step 1: Remove Duplicates

unique_students = []
seen = set()

for row in students:
    key = (row["hours"], row["score"])

    if key not in seen:
        unique_students.append(row)
        seen.add(key)

print(unique_students)

Step 2: Remove Missing and Bad Values

clean_students = []

for row in unique_students:
    hours = row["hours"]
    score = row["score"]

    if hours is None or score is None:
        continue

    if not isinstance(hours, (int, float)):
        continue

    if not isinstance(score, (int, float)):
        continue

    clean_students.append(row)

print(clean_students)

Step 3: Separate x and y

x = []
y = []

for row in clean_students:
    x.append(row["hours"])
    y.append(row["score"])

print("x values:", x)
print("y values:", y)

Step 4: Basic Linear Regression Logic

n = len(x)

mean_x = sum(x) / n
mean_y = sum(y) / n

numerator = 0
denominator = 0

for i in range(n):
    numerator += (x[i] - mean_x) * (y[i] - mean_y)
    denominator += (x[i] - mean_x) ** 2

slope = numerator / denominator
intercept = mean_y - slope * mean_x

print("Slope:", slope)
print("Intercept:", intercept)

prediction = intercept + slope * 7
print("Prediction for 7 hours:", prediction)

This is the ground-zero idea. We removed duplicates, removed missing values, removed bad values, separated input and output, and calculated a regression line manually.

Pandas makes this easier, but basic Python shows the real logic. First learn the logic, then use pandas for speed and professional work.

Level 10

Basic Cleaning Mindset

Before cleaning, always inspect the dataset. Do not blindly delete rows. First understand what is wrong.

import pandas as pd
import numpy as np
from numpy.polynomial import Polynomial

df.info()
df.describe()
df.isnull().sum()
df.duplicated().sum()

Formula

Linear Regression Idea

Linear regression tries to find a straight-line relationship between input x and output y.

# y = c + mx
model = Polynomial.fit(x, y, deg=1)
normal_model = model.convert()
print(normal_model.coef)

Dataset 1

Incomplete Data + Duplicate Rows

df1 = pd.DataFrame({
    "hours": [1, 2, 2, 3, None, 5],
    "score": [50, 55, 55, None, 65, 70]
})

df1 = df1.drop_duplicates()
df1 = df1.dropna()

x = df1["hours"]
y = df1["score"]

model = Polynomial.fit(x, y, deg=1)
print(model.convert().coef)

Here we remove duplicate rows and then remove rows where hours or score is missing.

Dataset 2

Bad Text Values Inside Numeric Columns

df2 = pd.DataFrame({
    "hours": [1, "two", 3, 4, 5],
    "score": [50, 55, "??", 65, 70]
})

df2["hours"] = pd.to_numeric(df2["hours"], errors="coerce")
df2["score"] = pd.to_numeric(df2["score"], errors="coerce")

df2 = df2.dropna()

x = df2["hours"]
y = df2["score"]

model = Polynomial.fit(x, y, deg=1)
print(model.convert().coef)

The function pandas.to_numeric converts valid numbers and changes bad values into NaN. Then we clean those NaN rows.

Dataset 3

Outliers

df3 = pd.DataFrame({
    "hours": [1, 2, 3, 4, 5, 100],
    "score": [50, 55, 60, 65, 70, 10]
})

df3 = df3[(df3["hours"] < 10) & (df3["score"] > 40)]

x = df3["hours"]
y = df3["score"]

model = Polynomial.fit(x, y, deg=1)
print(model.convert().coef)

Outliers are extreme values. Sometimes they are mistakes. Sometimes they are important signals. Expert cleaning means checking before removing.

Dataset 4

Mixed Problems

df4 = pd.DataFrame({
    "hours": [1, 2, 2, "three", None, 5],
    "score": [50, 55, 55, 60, "N/A", 70]
})

df4["hours"] = pd.to_numeric(df4["hours"], errors="coerce")
df4["score"] = pd.to_numeric(df4["score"], errors="coerce")

df4 = df4.drop_duplicates()
df4 = df4.fillna(df4.mean(numeric_only=True))

x = df4["hours"]
y = df4["score"]

model = Polynomial.fit(x, y, deg=1)
print(model.convert().coef)

This dataset contains duplicates, missing values, and text inside number columns. We use a complete cleaning sequence.

Dataset 5

Reusable Expert Cleaning Function

def clean_dataset(df):
    df = df.copy()

    # Remove duplicate rows
    df = df.drop_duplicates()

    # Convert everything possible to numbers
    for col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

    # Fill missing values with column mean
    df = df.fillna(df.mean(numeric_only=True))

    return df


df5 = pd.DataFrame({
    "hours": [1, 2, 3, 4, 5, 6, "bad", None, 6],
    "score": [52, 54, 63, 64, 68, 72, "??", None, 72]
})

df5 = clean_dataset(df5)

x = df5["hours"]
y = df5["score"]

model = Polynomial.fit(x, y, deg=1)
print(model.convert().coef)

At expert level, we do not repeat cleaning code again and again. We create a reusable cleaning function.

Embedded Editor

Practice in the Live Python Editor

Use the editor below to paste the examples, modify the datasets, add your own bad data, and test the cleaning pipeline.

Complete Practice Code

One Full Program

import pandas as pd
import numpy as np
from numpy.polynomial import Polynomial

def clean_dataset(df):
    df = df.copy()
    df = df.drop_duplicates()

    for col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

    df = df.fillna(df.mean(numeric_only=True))
    return df

datasets = []

datasets.append(pd.DataFrame({
    "hours": [1, 2, 2, 3, None, 5],
    "score": [50, 55, 55, None, 65, 70]
}))

datasets.append(pd.DataFrame({
    "hours": [1, "two", 3, 4, 5],
    "score": [50, 55, "??", 65, 70]
}))

datasets.append(pd.DataFrame({
    "hours": [1, 2, 3, 4, 5, 100],
    "score": [50, 55, 60, 65, 70, 10]
}))

datasets.append(pd.DataFrame({
    "hours": [1, 2, 2, "three", None, 5],
    "score": [50, 55, 55, 60, "N/A", 70]
}))

datasets.append(pd.DataFrame({
    "hours": [1, 2, 3, 4, 5, 6, "bad", None, 6],
    "score": [52, 54, 63, 64, 68, 72, "??", None, 72]
}))

for i, df in enumerate(datasets, start=1):
    print("\\nDATASET", i)
    print("Raw data:")
    print(df)

    clean_df = clean_dataset(df)

    # Simple outlier protection
    clean_df = clean_df[
        (clean_df["hours"] < clean_df["hours"].quantile(0.95)) |
        (clean_df["hours"] == clean_df["hours"].median())
    ]

    print("Clean data:")
    print(clean_df)

    x = clean_df["hours"]
    y = clean_df["score"]

    model = Polynomial.fit(x, y, deg=1)
    converted = model.convert()

    print("Regression coefficients:")
    print(converted.coef)

    print("Prediction for 7 hours:")
    print(converted(7))

MCQ Engine

Quiz: Test Your Understanding

Expert Summary

Final Expert Rules

Clean data carefully. Do not delete blindly. Convert data types. Handle missing values. Remove duplicates. Investigate outliers. Then apply regression and evaluate the result.