Statistics for AI-ML with Direct Method, Python, NumPy, and Matplotlib

This version gives you collapsible topic cards, copy buttons, a live embedded Python editor, scored MCQs, direct calculation, plain Python, NumPy, Mode, and Matplotlib charts for all major calculations.

Mean Median Mode Variance Standard Deviation Percentiles Correlation Normalization Matplotlib MCQ Score
Main habitInspect data firstthen build model
Best coding pathHand → Python → NumPy → Chartsdeep understanding
Main gainBetter preprocessingbetter ML results
Main lessonStatistics is foundationnot decoration

1. What is Statistics?

Statistics is the study of data. It helps us collect data, organize it, summarize it, understand patterns, compare groups, and make decisions.

Simple idea

If data is the raw material of AI-ML, statistics is the measuring and inspection toolkit.

2. Why Statistics Matters in AI-ML

Understand the center

Mean, median, and mode help us know what is typical.

Understand the spread

Variance and standard deviation show how scattered the data is.

Detect strange values

Min, max, range, and percentiles help reveal outliers.

Compare features

Some inputs may carry stronger signal than others.

Prepare the data

Normalization and standardization directly use statistics.

Check relationships

Correlation helps us inspect how variables move together.

3. Live Python Editor

Practice the code directly inside the embedded editor below. Live Python Editor in new tab

4. Foundation Dataset

We will use this dataset in many examples:

[2, 4, 4, 4, 5, 5, 7, 9]

5. Topic Cards

Tap any card to open it.

Meaning

The average value of the dataset.

Formula

Mean = Sum of all values / Number of values

AI-ML Use

Used to summarize and standardize features.

Direct Calculation

2 + 4 + 4 + 4 + 5 + 5 + 7 + 9 = 40

Count = 8

Mean = 40 / 8 = 5

Important

Mean is sensitive to outliers. Very large or small values can pull it.

data = [2, 4, 4, 4, 5, 5, 7, 9]

total = 0
for x in data:
    total += x

mean = total / len(data)
print("Mean:", mean)
import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print("Mean:", np.mean(data))
import matplotlib.pyplot as plt

data = [2, 4, 4, 4, 5, 5, 7, 9]
mean = sum(data) / len(data)

plt.figure(figsize=(8, 4))
plt.plot(range(len(data)), data, marker="o", label="Data")
plt.axhline(mean, linestyle="--", label=f"Mean = {mean}")
plt.title("Mean of Dataset")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()
Meaning

The middle value after sorting.

Formula Idea

Sort the values. Take the middle one, or average the two middle values.

AI-ML Use

Robust summary when outliers exist.

Direct Calculation

Sorted data: 2, 4, 4, 4, 5, 5, 7, 9

Middle values = 4 and 5

Median = (4 + 5) / 2 = 4.5

Important

Median is often better than mean when the dataset contains extreme values.

data = [2, 4, 4, 4, 5, 5, 7, 9]
sorted_data = sorted(data)
n = len(sorted_data)

if n % 2 == 0:
    median = (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
else:
    median = sorted_data[n//2]

print("Median:", median)
import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print("Median:", np.median(data))
import matplotlib.pyplot as plt

data = [2, 4, 4, 4, 5, 5, 7, 9]
sorted_data = sorted(data)
n = len(sorted_data)

if n % 2 == 0:
    median = (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
else:
    median = sorted_data[n//2]

plt.figure(figsize=(8, 4))
plt.plot(range(len(sorted_data)), sorted_data, marker="o", label="Sorted Data")
plt.axhline(median, linestyle="--", label=f"Median = {median}")
plt.title("Median of Dataset")
plt.xlabel("Sorted Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()
Meaning

The value that appears most often in the dataset.

Formula Idea

Count the frequency of each value. Highest frequency gives the mode.

AI-ML Use

Useful for repeated values and categorical data analysis.

Direct Calculation

Data: 2, 4, 4, 4, 5, 5, 7, 9

Counts: 2→1, 4→3, 5→2, 7→1, 9→1

Mode = 4

Important

A dataset may have one mode, many modes, or no mode if frequencies are equal.

data = [2, 4, 4, 4, 5, 5, 7, 9]

counts = {}
for x in data:
    counts[x] = counts.get(x, 0) + 1

mode = max(counts, key=counts.get)

print("Counts:", counts)
print("Mode:", mode)
import numpy as np
from collections import Counter

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

counts = Counter(data)
mode = counts.most_common(1)[0][0]

print("Counts:", counts)
print("Mode:", mode)
import matplotlib.pyplot as plt
from collections import Counter

data = [2, 4, 4, 4, 5, 5, 7, 9]
counts = Counter(data)

x = list(counts.keys())
y = list(counts.values())
mode = max(counts, key=counts.get)

plt.figure(figsize=(8, 4))
plt.bar(x, y)
plt.title(f"Mode Frequency Chart (Mode = {mode})")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.grid(True, axis="y")
plt.show()
Meaning

How far values spread from the mean.

Formula

Variance = average of squared distance from the mean.

AI-ML Use

Helps understand noise and feature spread.

Direct Calculation

Mean = 5

Differences: -3, -1, -1, -1, 0, 0, 2, 4

Squares: 9, 1, 1, 1, 0, 0, 4, 16

Sum = 32

Variance = 32 / 8 = 4

Why square?

Squaring removes sign and gives more weight to larger deviations.

data = [2, 4, 4, 4, 5, 5, 7, 9]
mean = sum(data) / len(data)

sq_sum = 0
for x in data:
    sq_sum += (x - mean) ** 2

variance = sq_sum / len(data)
print("Variance:", variance)
import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print("Variance:", np.var(data))
import matplotlib.pyplot as plt

data = [2, 4, 4, 4, 5, 5, 7, 9]
mean = sum(data) / len(data)
sq_dev = [(x - mean) ** 2 for x in data]

plt.figure(figsize=(8, 4))
plt.bar(range(len(sq_dev)), sq_dev)
plt.title("Squared Deviations Used in Variance")
plt.xlabel("Index")
plt.ylabel("Squared Deviation")
plt.grid(True, axis="y")
plt.show()
Meaning

Spread in the same unit as the original data.

Formula

Standard Deviation = √Variance

AI-ML Use

Used in standardization and anomaly detection.

Direct Calculation

Variance = 4

Standard Deviation = √4 = 2

Interpretation

Higher standard deviation means more spread. Lower means tighter grouping.

data = [2, 4, 4, 4, 5, 5, 7, 9]
mean = sum(data) / len(data)

sq_sum = 0
for x in data:
    sq_sum += (x - mean) ** 2

variance = sq_sum / len(data)
std = variance ** 0.5

print("Standard Deviation:", std)
import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print("Standard Deviation:", np.std(data))
import matplotlib.pyplot as plt

data = [2, 4, 4, 4, 5, 5, 7, 9]
mean = sum(data) / len(data)
variance = sum((x - mean) ** 2 for x in data) / len(data)
std = variance ** 0.5

plt.figure(figsize=(8, 4))
plt.plot(range(len(data)), data, marker="o", label="Data")
plt.axhline(mean, linestyle="--", label=f"Mean = {mean}")
plt.axhline(mean + std, linestyle=":", label=f"Mean + Std = {mean + std}")
plt.axhline(mean - std, linestyle=":", label=f"Mean - Std = {mean - std}")
plt.title("Standard Deviation Around Mean")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()
Meaning

Boundaries of the dataset.

Formula

Range = Maximum - Minimum

AI-ML Use

Useful in scaling and outlier inspection.

Direct Calculation

Minimum = 2

Maximum = 9

Range = 9 - 2 = 7

Important

Range looks only at two values, so use it with other measures as well.

data = [2, 4, 4, 4, 5, 5, 7, 9]

minimum = min(data)
maximum = max(data)
data_range = maximum - minimum

print("Minimum:", minimum)
print("Maximum:", maximum)
print("Range:", data_range)
import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

print("Minimum:", np.min(data))
print("Maximum:", np.max(data))
print("Range:", np.max(data) - np.min(data))
import matplotlib.pyplot as plt

data = [2, 4, 4, 4, 5, 5, 7, 9]
minimum = min(data)
maximum = max(data)

plt.figure(figsize=(8, 4))
plt.plot(range(len(data)), data, marker="o", label="Data")
plt.axhline(minimum, linestyle="--", label=f"Min = {minimum}")
plt.axhline(maximum, linestyle="--", label=f"Max = {maximum}")
plt.title(f"Range of Dataset = {maximum - minimum}")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()
Meaning

Position of values inside the distribution.

Main fact

50th percentile is the median.

AI-ML Use

Useful for thresholds, quartiles, and outlier handling.

Direct Understanding

Sorted data: 2, 4, 4, 4, 5, 5, 7, 9

25th percentile is near the lower quarter.

50th percentile is the median.

75th percentile is near the upper quarter.

Important

Exact percentile rules are handled carefully by software using fixed methods.

# Percentiles are usually computed using a library
# after sorting and locating the relative position.
import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

print("25th:", np.percentile(data, 25))
print("50th:", np.percentile(data, 50))
print("75th:", np.percentile(data, 75))
import numpy as np
import matplotlib.pyplot as plt

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
p25 = np.percentile(data, 25)
p50 = np.percentile(data, 50)
p75 = np.percentile(data, 75)

plt.figure(figsize=(8, 4))
plt.plot(sorted(data), marker="o", label="Sorted Data")
plt.axhline(p25, linestyle="--", label=f"25th = {p25}")
plt.axhline(p50, linestyle="--", label=f"50th = {p50}")
plt.axhline(p75, linestyle="--", label=f"75th = {p75}")
plt.title("Percentiles of Dataset")
plt.xlabel("Sorted Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()
Meaning

How strongly two variables move together.

Interpretation

Close to 1 positive, close to -1 negative, close to 0 weak linear relation.

AI-ML Use

Feature analysis and early relationship checking.

Direct Understanding

Hours: 1, 2, 3, 4, 5

Marks: 20, 35, 45, 60, 75

As hours increase, marks also increase.

This suggests positive correlation.

Very Important

Correlation does not prove causation.

# Full manual coding of correlation is longer.
# First learn the interpretation well.
import numpy as np

hours = np.array([1, 2, 3, 4, 5])
marks = np.array([20, 35, 45, 60, 75])

corr = np.corrcoef(hours, marks)
print(corr)
print("Correlation:", corr[0, 1])
import matplotlib.pyplot as plt
import numpy as np

hours = np.array([1, 2, 3, 4, 5])
marks = np.array([20, 35, 45, 60, 75])

plt.figure(figsize=(8, 4))
plt.scatter(hours, marks, s=80)
plt.plot(hours, marks)
plt.title("Correlation Between Study Hours and Marks")
plt.xlabel("Hours")
plt.ylabel("Marks")
plt.grid(True)
plt.show()

6. Full Plain Python Program

data = [2, 4, 4, 4, 5, 5, 7, 9]

total = 0
for x in data:
    total += x

mean = total / len(data)

sorted_data = sorted(data)
n = len(sorted_data)
if n % 2 == 0:
    median = (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
else:
    median = sorted_data[n//2]

counts = {}
for x in data:
    counts[x] = counts.get(x, 0) + 1

mode = max(counts, key=counts.get)

sq_sum = 0
for x in data:
    sq_sum += (x - mean) ** 2

variance = sq_sum / len(data)
std = variance ** 0.5

minimum = min(data)
maximum = max(data)
data_range = maximum - minimum

print("Data:", data)
print("Sum:", total)
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Variance:", variance)
print("Standard Deviation:", std)
print("Minimum:", minimum)
print("Maximum:", maximum)
print("Range:", data_range)

7. Full NumPy Program

import numpy as np
from collections import Counter

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

counts = Counter(data)
mode = counts.most_common(1)[0][0]

print("Data:", data)
print("Sum:", np.sum(data))
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Mode:", mode)
print("Variance:", np.var(data))
print("Standard Deviation:", np.std(data))
print("Minimum:", np.min(data))
print("Maximum:", np.max(data))
print("Range:", np.max(data) - np.min(data))
print("25th Percentile:", np.percentile(data, 25))
print("50th Percentile:", np.percentile(data, 50))
print("75th Percentile:", np.percentile(data, 75))

8. Full Matplotlib Program

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

mean = np.mean(data)
median = np.median(data)
variance = np.var(data)
std = np.std(data)
minimum = np.min(data)
maximum = np.max(data)
data_range = maximum - minimum
p25 = np.percentile(data, 25)
p75 = np.percentile(data, 75)

counts = Counter(data)
mode = counts.most_common(1)[0][0]

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Variance:", variance)
print("Standard Deviation:", std)
print("Range:", data_range)
print("25th Percentile:", p25)
print("75th Percentile:", p75)

plt.figure(figsize=(10, 5))
plt.plot(range(len(data)), data, marker="o", label="Data")
plt.axhline(mean, linestyle="--", label=f"Mean = {mean}")
plt.axhline(median, linestyle=":", label=f"Median = {median}")
plt.axhline(mode, linestyle="-.", label=f"Mode = {mode}")
plt.axhline(minimum, linestyle="--", label=f"Min = {minimum}")
plt.axhline(maximum, linestyle="--", label=f"Max = {maximum}")
plt.axhline(p25, linestyle=":", label=f"25th = {p25}")
plt.axhline(p75, linestyle=":", label=f"75th = {p75}")

plt.title("Statistics Overview of Dataset")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()

plt.figure(figsize=(8, 4))
plt.bar(list(counts.keys()), list(counts.values()))
plt.title("Frequency Chart for Mode")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.grid(True, axis="y")
plt.show()

plt.figure(figsize=(8, 4))
plt.bar(range(len(data)), (data - mean) ** 2)
plt.title("Squared Deviations for Variance")
plt.xlabel("Index")
plt.ylabel("Squared Deviation")
plt.grid(True, axis="y")
plt.show()

9. Normalization and Standardization

Normalization

Often scales values between 0 and 1.

normalized = (x - min_value) / (max_value - min_value)
Standardization

Centers at mean 0 and standard deviation 1.

z = (x - mean) / std
import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
normalized = (data - np.min(data)) / (np.max(data) - np.min(data))
print(normalized)
import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
standardized = (data - np.mean(data)) / np.std(data)
print(standardized)
import numpy as np
import matplotlib.pyplot as plt

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
normalized = (data - np.min(data)) / (np.max(data) - np.min(data))
standardized = (data - np.mean(data)) / np.std(data)

plt.figure(figsize=(8, 4))
plt.plot(data, marker="o", label="Original")
plt.plot(normalized, marker="o", label="Normalized")
plt.plot(standardized, marker="o", label="Standardized")
plt.title("Original vs Normalized vs Standardized")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()

10. Final Project — Student Statistics Analyzer

Build a tool that accepts student marks and prints the main statistics before any ML work.

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

marks = np.array([67, 72, 81, 90, 76, 88, 59, 94])

counts = Counter(marks)
mode = counts.most_common(1)[0][0]

print("Marks:", marks)
print("Sum:", np.sum(marks))
print("Mean:", np.mean(marks))
print("Median:", np.median(marks))
print("Mode:", mode)
print("Variance:", np.var(marks))
print("Standard Deviation:", np.std(marks))
print("Minimum:", np.min(marks))
print("Maximum:", np.max(marks))
print("Range:", np.max(marks) - np.min(marks))
print("25th Percentile:", np.percentile(marks, 25))
print("75th Percentile:", np.percentile(marks, 75))

plt.figure(figsize=(9, 4))
plt.plot(marks, marker="o", label="Marks")
plt.axhline(np.mean(marks), linestyle="--", label=f"Mean = {np.mean(marks):.2f}")
plt.axhline(np.median(marks), linestyle=":", label=f"Median = {np.median(marks)}")
plt.title("Student Marks Overview")
plt.xlabel("Student Index")
plt.ylabel("Marks")
plt.legend()
plt.grid(True)
plt.show()
Self Work

Write the two most important reasons to inspect data statistically before training an ML model.

Strong answers usually mention outliers, scaling, data spread, repeated values, or relationships between features and target.

Try It in the Embedded Editor

MCQ Score
0 / 8

11. MCQ Zone

Q1. What is the mean of 10, 20, and 30?

Q2. What is the median of 1, 2, 100?

Q3. What is the mode of 3, 4, 4, 5, 6?

Q4. If all values in a dataset are the same, what is the variance?

Q5. If variance is 9, what is the standard deviation?

Q6. What is the range of 3, 8, 10, 15?

Q7. Which percentile is the median?

Q8. If x increases and y also increases consistently, the correlation is usually:

12. Conclusion

Statistics is not optional in AI-ML. It helps you inspect the data, understand its center and spread, detect strange values, compare features, and prepare the data correctly.

The strongest learning path is direct logic first, then plain Python, then NumPy, and then Matplotlib charts for visual understanding.

13. Speak Paragraphs

Hidden narration paragraphs for guided reading systems.