Programmers Picnic
AI-ML Classes by Champak Roy
NLP 31% to 40% • Text Similarity

From Words to Meaningful Comparison

In this lesson, students learn how NLP compares two pieces of text. We move from simple words to vectors, CountVectorizer, cosine similarity, and finally a mini project: Similar Sentence Finder.

Open Embedded Python Editor

31% to 40% Roadmap

10 small steps
  • 31%What is text similarity?
    Understanding when two sentences are close, related, or different.
  • 32%Why computers cannot directly understand meaning
    Computers need numbers, not raw human language.
  • 33%Converting text into numbers
    The first big idea behind NLP models.
  • 34%Bag of Words idea
    Represent a sentence by counting words.
  • 35%CountVectorizer concept
    Use Python to automatically create word-count vectors.
  • 36%What is Cosine Similarity?
    Compare the direction of two vectors.
  • 37%Comparing two sentences
    Calculate a similarity score between sentences.
  • 38%Mini project
    Build a Similar Sentence Finder.
  • 39%Real use cases
    Search, recommendation, doubt matching, chatbot matching.
  • 40%Revision and test
    Check whether the student can explain and code the concept.

Level 0 Explanation

Beginner friendly

What is text similarity?

Text similarity means checking how close two pieces of text are. The two texts may not be exactly the same, but they may still talk about the same idea.

Example:
“I like machine learning” and “I enjoy learning AI” are not identical, but both are related to learning and AI.

Why convert text into numbers?

A computer can store letters, but mathematical comparison is easier with numbers. So NLP converts words and sentences into numerical representations.

Simple rule:
Text → Words → Numbers → Comparison → Result

The NLP Pipeline

From sentence to similarity score
Text Student sentence Words Token list Numbers Vector Score Similarity NLP comparison becomes possible after text becomes numerical.

Bag of Words

The first simple representation

Idea

Bag of Words ignores grammar and word order. It simply counts which words appear. This is not perfect, but it is a very important first step.

Sentence:
I like machine learning

Words:
I, like, machine, learning

Count:
I = 1
like = 1
machine = 1
learning = 1

Example Table

Word Sentence 1 Sentence 2
machine 1 0
learning 1 1
AI 0 1
enjoy 0 1
The more useful overlap two sentences have, the more related they may be.

Cosine Similarity

The main formula

Cosine similarity compares the direction of two vectors. If two text vectors point in a similar direction, the similarity score is high. If they point in very different directions, the score is low.

cosine similarity = 1      means very similar direction
cosine similarity = 0      means no useful similarity
cosine similarity near 0   means mostly different
For beginner NLP, we can think of cosine similarity as: How close are these two sentences after they become number lists?

Score near 1

Very similar text.

Python is useful
Python is very useful

Score medium

Some relation, but not exactly the same.

I like AI
I study machine learning

Score near 0

Mostly unrelated text.

Python programming
The mango is sweet

Python Example 1

CountVectorizer and cosine similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

sentences = [
    "I like machine learning",
    "I enjoy learning AI",
    "The cat is sleeping"
]

vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(sentences)

similarity = cosine_similarity(vectors)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("\nVectors:")
print(vectors.toarray())

print("\nSimilarity Matrix:")
print(similarity)
The output will show three things: vocabulary, vectors, and the similarity matrix.

Understanding the Similarity Matrix

Reading the output

When we compare three sentences with each other, Python gives a matrix. Each row and column represents one sentence.

Sentence 1 Sentence 2 Sentence 3
Sentence 1 1.00 Some similarity Low similarity
Sentence 2 Some similarity 1.00 Low similarity
Sentence 3 Low similarity Low similarity 1.00
A sentence compared with itself gives 1.00 because it is exactly similar to itself.

Python Example 2

Compare one question with lesson titles

This example is useful for Programmers Picnic AI-ML Classes. A student asks a question. The program compares the question with stored lesson titles.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

student_question = "What is Python used for?"

lesson_titles = [
    "Python for AI and Machine Learning",
    "Introduction to HTML and CSS",
    "Sorting algorithms in Python",
    "How to deploy Angular on GitHub Pages",
    "Natural Language Processing basics"
]

all_texts = [student_question] + lesson_titles

vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(all_texts)

scores = cosine_similarity(vectors[0:1], vectors[1:]).flatten()

best_index = scores.argmax()
best_score = scores[best_index]
best_lesson = lesson_titles[best_index]

print("Student question:")
print(student_question)

print("\nBest matching lesson:")
print(best_lesson)

print("\nSimilarity score:")
print(best_score)
This is the basic idea behind search, recommendation, and doubt matching.

Mini Project

Similar Sentence Finder

Project Goal

The student enters one sentence. The program compares it with a list of stored sentences. Then it finds the most similar sentence.

Step 1

Create a list of stored sentences.

Step 2

Take one input sentence from the user.

Step 3

Use CountVectorizer and cosine similarity.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

stored_sentences = [
    "Python is used for artificial intelligence",
    "HTML is used to create web pages",
    "CSS is used to style websites",
    "Machine learning helps computers learn from data",
    "Natural language processing works with human language",
    "Sorting algorithms arrange data in order"
]

user_sentence = input("Enter your sentence: ")

all_sentences = [user_sentence] + stored_sentences

vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(all_sentences)

similarity_scores = cosine_similarity(vectors[0:1], vectors[1:]).flatten()

best_index = similarity_scores.argmax()
best_match = stored_sentences[best_index]
best_score = similarity_scores[best_index]

print("\nYour sentence:")
print(user_sentence)

print("\nMost similar stored sentence:")
print(best_match)

print("\nSimilarity score:")
print(best_score)

print("\nAll scores:")
for sentence, score in zip(stored_sentences, similarity_scores):
    print(round(score, 3), "->", sentence)

Real-Life Uses

Where text similarity appears

Search

When a user searches for “Python AI”, a website can show pages related to Python and artificial intelligence.

Recommendation

If a student reads a lesson on machine learning, the website can recommend another related lesson.

Doubt Matching

If a student asks a doubt, the system can search old solved doubts and find the closest one.

Chatbots

A chatbot can compare the user message with known questions and choose the closest answer.

Duplicate Detection

Two similar articles, questions, or support tickets can be detected.

Lesson Finder

Programmers Picnic can use similarity to match a student question with the best AI-ML lesson.

Common Mistakes

Important warnings
Mistake Problem Better thinking
Thinking exact match and similarity are the same Exact match needs the same words. Similarity can find related text. Use similarity when wording may be different.
Thinking CountVectorizer understands deep meaning CountVectorizer mainly counts words. It is a beginner method, not a full human-like understanding system.
Ignoring text cleaning Messy text can reduce result quality. Clean text before comparison.
Expecting perfect results Basic similarity can fail when words are different but meaning is same. Later we learn better methods like TF-IDF and embeddings.

Practice in Python Editor

Run and modify the examples

Use the embedded editor below. Paste the code examples, run them, and change the sentences. Try your own lesson titles from Programmers Picnic.

If the editor opens slowly, open it directly: Python Editor Full Page

Classroom Assignment

Student work

Assignment A

  1. Create 5 sentences about Python.
  2. Create 5 sentences about web development.
  3. Compare one new sentence with all 10 sentences.
  4. Print the most similar sentence.

Assignment B

  1. Create 8 lesson titles for AI-ML classes.
  2. Ask the user to enter a doubt.
  3. Find the closest lesson title.
  4. Print the similarity score.

Revision Test

Click each question to see the answer
What is text similarity?

Text similarity means checking how close or related two pieces of text are.

Why do we convert text into numbers?

Because mathematical comparison is easier when text is represented as numbers.

What does CountVectorizer do?

It converts text into word-count vectors.

What is Bag of Words?

It is a simple NLP representation that counts words and ignores grammar or word order.

What is cosine similarity used for?

It is used to compare the direction of two vectors and produce a similarity score.

What score means two texts are very similar?

A score close to 1 means the two texts are very similar.

What score means two texts are mostly unrelated?

A score close to 0 means they are mostly unrelated.

Give one real use of text similarity.

Search, recommendation, chatbot matching, doubt matching, or duplicate detection.

What is the mini project in this lesson?

Similar Sentence Finder.

What is one limitation of CountVectorizer?

It mostly counts words and does not deeply understand meaning like advanced NLP models.

40% Checkpoint

Before moving ahead

Students should now be able to explain:

  • What text similarity means.
  • Why text must be converted into numbers.
  • What Bag of Words means.
  • What CountVectorizer does.
  • What cosine similarity means.
  • How to compare one sentence with many stored sentences.
  • How search and recommendation systems use similarity.
Next 41% to 50%:
Improve text comparison using TF-IDF, stop words, better ranking, and cleaner search results.