Statistics for AI-ML with Direct Method, Python, NumPy, and Matplotlib
This version gives you collapsible topic cards, copy buttons, a live embedded Python editor, scored MCQs, direct calculation, plain Python, NumPy, Mode, and Matplotlib charts for all major calculations.
1. What is Statistics?
Statistics is the study of data. It helps us collect data, organize it, summarize it, understand patterns, compare groups, and make decisions.
If data is the raw material of AI-ML, statistics is the measuring and inspection toolkit.
2. Why Statistics Matters in AI-ML
Understand the center
Mean, median, and mode help us know what is typical.
Understand the spread
Variance and standard deviation show how scattered the data is.
Detect strange values
Min, max, range, and percentiles help reveal outliers.
Compare features
Some inputs may carry stronger signal than others.
Prepare the data
Normalization and standardization directly use statistics.
Check relationships
Correlation helps us inspect how variables move together.
3. Live Python Editor
Practice the code directly inside the embedded editor below. Live Python Editor in new tab
4. Foundation Dataset
We will use this dataset in many examples:
[2, 4, 4, 4, 5, 5, 7, 9]
5. Topic Cards
Tap any card to open it.
The average value of the dataset.
Mean = Sum of all values / Number of values
Used to summarize and standardize features.
2 + 4 + 4 + 4 + 5 + 5 + 7 + 9 = 40
Count = 8
Mean = 40 / 8 = 5
Mean is sensitive to outliers. Very large or small values can pull it.
data = [2, 4, 4, 4, 5, 5, 7, 9]
total = 0
for x in data:
total += x
mean = total / len(data)
print("Mean:", mean)
import numpy as np
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print("Mean:", np.mean(data))
import matplotlib.pyplot as plt
data = [2, 4, 4, 4, 5, 5, 7, 9]
mean = sum(data) / len(data)
plt.figure(figsize=(8, 4))
plt.plot(range(len(data)), data, marker="o", label="Data")
plt.axhline(mean, linestyle="--", label=f"Mean = {mean}")
plt.title("Mean of Dataset")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()
The middle value after sorting.
Sort the values. Take the middle one, or average the two middle values.
Robust summary when outliers exist.
Sorted data: 2, 4, 4, 4, 5, 5, 7, 9
Middle values = 4 and 5
Median = (4 + 5) / 2 = 4.5
Median is often better than mean when the dataset contains extreme values.
data = [2, 4, 4, 4, 5, 5, 7, 9]
sorted_data = sorted(data)
n = len(sorted_data)
if n % 2 == 0:
median = (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
else:
median = sorted_data[n//2]
print("Median:", median)
import numpy as np
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print("Median:", np.median(data))
import matplotlib.pyplot as plt
data = [2, 4, 4, 4, 5, 5, 7, 9]
sorted_data = sorted(data)
n = len(sorted_data)
if n % 2 == 0:
median = (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
else:
median = sorted_data[n//2]
plt.figure(figsize=(8, 4))
plt.plot(range(len(sorted_data)), sorted_data, marker="o", label="Sorted Data")
plt.axhline(median, linestyle="--", label=f"Median = {median}")
plt.title("Median of Dataset")
plt.xlabel("Sorted Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()
The value that appears most often in the dataset.
Count the frequency of each value. Highest frequency gives the mode.
Useful for repeated values and categorical data analysis.
Data: 2, 4, 4, 4, 5, 5, 7, 9
Counts: 2→1, 4→3, 5→2, 7→1, 9→1
Mode = 4
A dataset may have one mode, many modes, or no mode if frequencies are equal.
data = [2, 4, 4, 4, 5, 5, 7, 9]
counts = {}
for x in data:
counts[x] = counts.get(x, 0) + 1
mode = max(counts, key=counts.get)
print("Counts:", counts)
print("Mode:", mode)
import numpy as np
from collections import Counter
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
counts = Counter(data)
mode = counts.most_common(1)[0][0]
print("Counts:", counts)
print("Mode:", mode)
import matplotlib.pyplot as plt
from collections import Counter
data = [2, 4, 4, 4, 5, 5, 7, 9]
counts = Counter(data)
x = list(counts.keys())
y = list(counts.values())
mode = max(counts, key=counts.get)
plt.figure(figsize=(8, 4))
plt.bar(x, y)
plt.title(f"Mode Frequency Chart (Mode = {mode})")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.grid(True, axis="y")
plt.show()
How far values spread from the mean.
Variance = average of squared distance from the mean.
Helps understand noise and feature spread.
Mean = 5
Differences: -3, -1, -1, -1, 0, 0, 2, 4
Squares: 9, 1, 1, 1, 0, 0, 4, 16
Sum = 32
Variance = 32 / 8 = 4
Squaring removes sign and gives more weight to larger deviations.
data = [2, 4, 4, 4, 5, 5, 7, 9]
mean = sum(data) / len(data)
sq_sum = 0
for x in data:
sq_sum += (x - mean) ** 2
variance = sq_sum / len(data)
print("Variance:", variance)
import numpy as np
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print("Variance:", np.var(data))
import matplotlib.pyplot as plt
data = [2, 4, 4, 4, 5, 5, 7, 9]
mean = sum(data) / len(data)
sq_dev = [(x - mean) ** 2 for x in data]
plt.figure(figsize=(8, 4))
plt.bar(range(len(sq_dev)), sq_dev)
plt.title("Squared Deviations Used in Variance")
plt.xlabel("Index")
plt.ylabel("Squared Deviation")
plt.grid(True, axis="y")
plt.show()
Spread in the same unit as the original data.
Standard Deviation = √Variance
Used in standardization and anomaly detection.
Variance = 4
Standard Deviation = √4 = 2
Higher standard deviation means more spread. Lower means tighter grouping.
data = [2, 4, 4, 4, 5, 5, 7, 9]
mean = sum(data) / len(data)
sq_sum = 0
for x in data:
sq_sum += (x - mean) ** 2
variance = sq_sum / len(data)
std = variance ** 0.5
print("Standard Deviation:", std)
import numpy as np
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print("Standard Deviation:", np.std(data))
import matplotlib.pyplot as plt
data = [2, 4, 4, 4, 5, 5, 7, 9]
mean = sum(data) / len(data)
variance = sum((x - mean) ** 2 for x in data) / len(data)
std = variance ** 0.5
plt.figure(figsize=(8, 4))
plt.plot(range(len(data)), data, marker="o", label="Data")
plt.axhline(mean, linestyle="--", label=f"Mean = {mean}")
plt.axhline(mean + std, linestyle=":", label=f"Mean + Std = {mean + std}")
plt.axhline(mean - std, linestyle=":", label=f"Mean - Std = {mean - std}")
plt.title("Standard Deviation Around Mean")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()
Boundaries of the dataset.
Range = Maximum - Minimum
Useful in scaling and outlier inspection.
Minimum = 2
Maximum = 9
Range = 9 - 2 = 7
Range looks only at two values, so use it with other measures as well.
data = [2, 4, 4, 4, 5, 5, 7, 9]
minimum = min(data)
maximum = max(data)
data_range = maximum - minimum
print("Minimum:", minimum)
print("Maximum:", maximum)
print("Range:", data_range)
import numpy as np
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print("Minimum:", np.min(data))
print("Maximum:", np.max(data))
print("Range:", np.max(data) - np.min(data))
import matplotlib.pyplot as plt
data = [2, 4, 4, 4, 5, 5, 7, 9]
minimum = min(data)
maximum = max(data)
plt.figure(figsize=(8, 4))
plt.plot(range(len(data)), data, marker="o", label="Data")
plt.axhline(minimum, linestyle="--", label=f"Min = {minimum}")
plt.axhline(maximum, linestyle="--", label=f"Max = {maximum}")
plt.title(f"Range of Dataset = {maximum - minimum}")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()
Position of values inside the distribution.
50th percentile is the median.
Useful for thresholds, quartiles, and outlier handling.
Sorted data: 2, 4, 4, 4, 5, 5, 7, 9
25th percentile is near the lower quarter.
50th percentile is the median.
75th percentile is near the upper quarter.
Exact percentile rules are handled carefully by software using fixed methods.
# Percentiles are usually computed using a library # after sorting and locating the relative position.
import numpy as np
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print("25th:", np.percentile(data, 25))
print("50th:", np.percentile(data, 50))
print("75th:", np.percentile(data, 75))
import numpy as np
import matplotlib.pyplot as plt
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
p25 = np.percentile(data, 25)
p50 = np.percentile(data, 50)
p75 = np.percentile(data, 75)
plt.figure(figsize=(8, 4))
plt.plot(sorted(data), marker="o", label="Sorted Data")
plt.axhline(p25, linestyle="--", label=f"25th = {p25}")
plt.axhline(p50, linestyle="--", label=f"50th = {p50}")
plt.axhline(p75, linestyle="--", label=f"75th = {p75}")
plt.title("Percentiles of Dataset")
plt.xlabel("Sorted Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()
How strongly two variables move together.
Close to 1 positive, close to -1 negative, close to 0 weak linear relation.
Feature analysis and early relationship checking.
Hours: 1, 2, 3, 4, 5
Marks: 20, 35, 45, 60, 75
As hours increase, marks also increase.
This suggests positive correlation.
Correlation does not prove causation.
# Full manual coding of correlation is longer. # First learn the interpretation well.
import numpy as np
hours = np.array([1, 2, 3, 4, 5])
marks = np.array([20, 35, 45, 60, 75])
corr = np.corrcoef(hours, marks)
print(corr)
print("Correlation:", corr[0, 1])
import matplotlib.pyplot as plt
import numpy as np
hours = np.array([1, 2, 3, 4, 5])
marks = np.array([20, 35, 45, 60, 75])
plt.figure(figsize=(8, 4))
plt.scatter(hours, marks, s=80)
plt.plot(hours, marks)
plt.title("Correlation Between Study Hours and Marks")
plt.xlabel("Hours")
plt.ylabel("Marks")
plt.grid(True)
plt.show()
6. Full Plain Python Program
data = [2, 4, 4, 4, 5, 5, 7, 9]
total = 0
for x in data:
total += x
mean = total / len(data)
sorted_data = sorted(data)
n = len(sorted_data)
if n % 2 == 0:
median = (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
else:
median = sorted_data[n//2]
counts = {}
for x in data:
counts[x] = counts.get(x, 0) + 1
mode = max(counts, key=counts.get)
sq_sum = 0
for x in data:
sq_sum += (x - mean) ** 2
variance = sq_sum / len(data)
std = variance ** 0.5
minimum = min(data)
maximum = max(data)
data_range = maximum - minimum
print("Data:", data)
print("Sum:", total)
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Variance:", variance)
print("Standard Deviation:", std)
print("Minimum:", minimum)
print("Maximum:", maximum)
print("Range:", data_range)
7. Full NumPy Program
import numpy as np
from collections import Counter
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
counts = Counter(data)
mode = counts.most_common(1)[0][0]
print("Data:", data)
print("Sum:", np.sum(data))
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Mode:", mode)
print("Variance:", np.var(data))
print("Standard Deviation:", np.std(data))
print("Minimum:", np.min(data))
print("Maximum:", np.max(data))
print("Range:", np.max(data) - np.min(data))
print("25th Percentile:", np.percentile(data, 25))
print("50th Percentile:", np.percentile(data, 50))
print("75th Percentile:", np.percentile(data, 75))
8. Full Matplotlib Program
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
mean = np.mean(data)
median = np.median(data)
variance = np.var(data)
std = np.std(data)
minimum = np.min(data)
maximum = np.max(data)
data_range = maximum - minimum
p25 = np.percentile(data, 25)
p75 = np.percentile(data, 75)
counts = Counter(data)
mode = counts.most_common(1)[0][0]
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Variance:", variance)
print("Standard Deviation:", std)
print("Range:", data_range)
print("25th Percentile:", p25)
print("75th Percentile:", p75)
plt.figure(figsize=(10, 5))
plt.plot(range(len(data)), data, marker="o", label="Data")
plt.axhline(mean, linestyle="--", label=f"Mean = {mean}")
plt.axhline(median, linestyle=":", label=f"Median = {median}")
plt.axhline(mode, linestyle="-.", label=f"Mode = {mode}")
plt.axhline(minimum, linestyle="--", label=f"Min = {minimum}")
plt.axhline(maximum, linestyle="--", label=f"Max = {maximum}")
plt.axhline(p25, linestyle=":", label=f"25th = {p25}")
plt.axhline(p75, linestyle=":", label=f"75th = {p75}")
plt.title("Statistics Overview of Dataset")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()
plt.figure(figsize=(8, 4))
plt.bar(list(counts.keys()), list(counts.values()))
plt.title("Frequency Chart for Mode")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.grid(True, axis="y")
plt.show()
plt.figure(figsize=(8, 4))
plt.bar(range(len(data)), (data - mean) ** 2)
plt.title("Squared Deviations for Variance")
plt.xlabel("Index")
plt.ylabel("Squared Deviation")
plt.grid(True, axis="y")
plt.show()
9. Normalization and Standardization
Often scales values between 0 and 1.
normalized = (x - min_value) / (max_value - min_value)
Centers at mean 0 and standard deviation 1.
z = (x - mean) / std
import numpy as np data = np.array([2, 4, 4, 4, 5, 5, 7, 9]) normalized = (data - np.min(data)) / (np.max(data) - np.min(data)) print(normalized)
import numpy as np data = np.array([2, 4, 4, 4, 5, 5, 7, 9]) standardized = (data - np.mean(data)) / np.std(data) print(standardized)
import numpy as np
import matplotlib.pyplot as plt
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
normalized = (data - np.min(data)) / (np.max(data) - np.min(data))
standardized = (data - np.mean(data)) / np.std(data)
plt.figure(figsize=(8, 4))
plt.plot(data, marker="o", label="Original")
plt.plot(normalized, marker="o", label="Normalized")
plt.plot(standardized, marker="o", label="Standardized")
plt.title("Original vs Normalized vs Standardized")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.show()
10. Final Project — Student Statistics Analyzer
Build a tool that accepts student marks and prints the main statistics before any ML work.
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
marks = np.array([67, 72, 81, 90, 76, 88, 59, 94])
counts = Counter(marks)
mode = counts.most_common(1)[0][0]
print("Marks:", marks)
print("Sum:", np.sum(marks))
print("Mean:", np.mean(marks))
print("Median:", np.median(marks))
print("Mode:", mode)
print("Variance:", np.var(marks))
print("Standard Deviation:", np.std(marks))
print("Minimum:", np.min(marks))
print("Maximum:", np.max(marks))
print("Range:", np.max(marks) - np.min(marks))
print("25th Percentile:", np.percentile(marks, 25))
print("75th Percentile:", np.percentile(marks, 75))
plt.figure(figsize=(9, 4))
plt.plot(marks, marker="o", label="Marks")
plt.axhline(np.mean(marks), linestyle="--", label=f"Mean = {np.mean(marks):.2f}")
plt.axhline(np.median(marks), linestyle=":", label=f"Median = {np.median(marks)}")
plt.title("Student Marks Overview")
plt.xlabel("Student Index")
plt.ylabel("Marks")
plt.legend()
plt.grid(True)
plt.show()
Write the two most important reasons to inspect data statistically before training an ML model.
Try It in the Embedded Editor
11. MCQ Zone
Q1. What is the mean of 10, 20, and 30?
Q2. What is the median of 1, 2, 100?
Q3. What is the mode of 3, 4, 4, 5, 6?
Q4. If all values in a dataset are the same, what is the variance?
Q5. If variance is 9, what is the standard deviation?
Q6. What is the range of 3, 8, 10, 15?
Q7. Which percentile is the median?
Q8. If x increases and y also increases consistently, the correlation is usually:
12. Conclusion
Statistics is not optional in AI-ML. It helps you inspect the data, understand its center and spread, detect strange values, compare features, and prepare the data correctly.
The strongest learning path is direct logic first, then plain Python, then NumPy, and then Matplotlib charts for visual understanding.
13. Speak Paragraphs
Hidden narration paragraphs for guided reading systems.