What is Data Cleaning?
Data cleaning means preparing raw data so that it becomes useful for analysis and machine learning. Real-world data often contains missing values, duplicate rows, spelling mistakes, impossible values, wrong data types, and outliers.
Cleaning Data Using Basic Python Only
Before using pandas, understand cleaning with normal Python lists and dictionaries. This helps you understand what pandas does internally.
Messy Dataset
students = [
{"hours": 1, "score": 50},
{"hours": 2, "score": 55},
{"hours": 2, "score": 55}, # duplicate
{"hours": None, "score": 65}, # missing hours
{"hours": 4, "score": "bad"}, # bad score
{"hours": 5, "score": 70},
]
Step 1: Remove Duplicates
unique_students = []
seen = set()
for row in students:
key = (row["hours"], row["score"])
if key not in seen:
unique_students.append(row)
seen.add(key)
print(unique_students)
Step 2: Remove Missing and Bad Values
clean_students = []
for row in unique_students:
hours = row["hours"]
score = row["score"]
if hours is None or score is None:
continue
if not isinstance(hours, (int, float)):
continue
if not isinstance(score, (int, float)):
continue
clean_students.append(row)
print(clean_students)
Step 3: Separate x and y
x = []
y = []
for row in clean_students:
x.append(row["hours"])
y.append(row["score"])
print("x values:", x)
print("y values:", y)
Step 4: Basic Linear Regression Logic
n = len(x)
mean_x = sum(x) / n
mean_y = sum(y) / n
numerator = 0
denominator = 0
for i in range(n):
numerator += (x[i] - mean_x) * (y[i] - mean_y)
denominator += (x[i] - mean_x) ** 2
slope = numerator / denominator
intercept = mean_y - slope * mean_x
print("Slope:", slope)
print("Intercept:", intercept)
prediction = intercept + slope * 7
print("Prediction for 7 hours:", prediction)
This is the ground-zero idea. We removed duplicates, removed missing values, removed bad values, separated input and output, and calculated a regression line manually.
Basic Cleaning Mindset
Before cleaning, always inspect the dataset. Do not blindly delete rows. First understand what is wrong.
import pandas as pd
import numpy as np
from numpy.polynomial import Polynomial
df.info()
df.describe()
df.isnull().sum()
df.duplicated().sum()
Linear Regression Idea
Linear regression tries to find a straight-line relationship between input x and output y.
# y = c + mx
model = Polynomial.fit(x, y, deg=1)
normal_model = model.convert()
print(normal_model.coef)
Incomplete Data + Duplicate Rows
df1 = pd.DataFrame({
"hours": [1, 2, 2, 3, None, 5],
"score": [50, 55, 55, None, 65, 70]
})
df1 = df1.drop_duplicates()
df1 = df1.dropna()
x = df1["hours"]
y = df1["score"]
model = Polynomial.fit(x, y, deg=1)
print(model.convert().coef)
Here we remove duplicate rows and then remove rows where hours or score is missing.
Bad Text Values Inside Numeric Columns
df2 = pd.DataFrame({
"hours": [1, "two", 3, 4, 5],
"score": [50, 55, "??", 65, 70]
})
df2["hours"] = pd.to_numeric(df2["hours"], errors="coerce")
df2["score"] = pd.to_numeric(df2["score"], errors="coerce")
df2 = df2.dropna()
x = df2["hours"]
y = df2["score"]
model = Polynomial.fit(x, y, deg=1)
print(model.convert().coef)
The function pandas.to_numeric converts valid numbers and changes bad values into NaN. Then we clean those NaN rows.
Outliers
df3 = pd.DataFrame({
"hours": [1, 2, 3, 4, 5, 100],
"score": [50, 55, 60, 65, 70, 10]
})
df3 = df3[(df3["hours"] < 10) & (df3["score"] > 40)]
x = df3["hours"]
y = df3["score"]
model = Polynomial.fit(x, y, deg=1)
print(model.convert().coef)
Outliers are extreme values. Sometimes they are mistakes. Sometimes they are important signals. Expert cleaning means checking before removing.
Mixed Problems
df4 = pd.DataFrame({
"hours": [1, 2, 2, "three", None, 5],
"score": [50, 55, 55, 60, "N/A", 70]
})
df4["hours"] = pd.to_numeric(df4["hours"], errors="coerce")
df4["score"] = pd.to_numeric(df4["score"], errors="coerce")
df4 = df4.drop_duplicates()
df4 = df4.fillna(df4.mean(numeric_only=True))
x = df4["hours"]
y = df4["score"]
model = Polynomial.fit(x, y, deg=1)
print(model.convert().coef)
This dataset contains duplicates, missing values, and text inside number columns. We use a complete cleaning sequence.
Reusable Expert Cleaning Function
def clean_dataset(df):
df = df.copy()
# Remove duplicate rows
df = df.drop_duplicates()
# Convert everything possible to numbers
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
# Fill missing values with column mean
df = df.fillna(df.mean(numeric_only=True))
return df
df5 = pd.DataFrame({
"hours": [1, 2, 3, 4, 5, 6, "bad", None, 6],
"score": [52, 54, 63, 64, 68, 72, "??", None, 72]
})
df5 = clean_dataset(df5)
x = df5["hours"]
y = df5["score"]
model = Polynomial.fit(x, y, deg=1)
print(model.convert().coef)
At expert level, we do not repeat cleaning code again and again. We create a reusable cleaning function.
Practice in the Live Python Editor
Use the editor below to paste the examples, modify the datasets, add your own bad data, and test the cleaning pipeline.
One Full Program
import pandas as pd
import numpy as np
from numpy.polynomial import Polynomial
def clean_dataset(df):
df = df.copy()
df = df.drop_duplicates()
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
df = df.fillna(df.mean(numeric_only=True))
return df
datasets = []
datasets.append(pd.DataFrame({
"hours": [1, 2, 2, 3, None, 5],
"score": [50, 55, 55, None, 65, 70]
}))
datasets.append(pd.DataFrame({
"hours": [1, "two", 3, 4, 5],
"score": [50, 55, "??", 65, 70]
}))
datasets.append(pd.DataFrame({
"hours": [1, 2, 3, 4, 5, 100],
"score": [50, 55, 60, 65, 70, 10]
}))
datasets.append(pd.DataFrame({
"hours": [1, 2, 2, "three", None, 5],
"score": [50, 55, 55, 60, "N/A", 70]
}))
datasets.append(pd.DataFrame({
"hours": [1, 2, 3, 4, 5, 6, "bad", None, 6],
"score": [52, 54, 63, 64, 68, 72, "??", None, 72]
}))
for i, df in enumerate(datasets, start=1):
print("\\nDATASET", i)
print("Raw data:")
print(df)
clean_df = clean_dataset(df)
# Simple outlier protection
clean_df = clean_df[
(clean_df["hours"] < clean_df["hours"].quantile(0.95)) |
(clean_df["hours"] == clean_df["hours"].median())
]
print("Clean data:")
print(clean_df)
x = clean_df["hours"]
y = clean_df["score"]
model = Polynomial.fit(x, y, deg=1)
converted = model.convert()
print("Regression coefficients:")
print(converted.coef)
print("Prediction for 7 hours:")
print(converted(7))