From Words to Meaningful Comparison
In this lesson, students learn how NLP compares two pieces of text. We move from simple words to vectors, CountVectorizer, cosine similarity, and finally a mini project: Similar Sentence Finder.
31% to 40% Roadmap
10 small steps- 31%What is text similarity?
Understanding when two sentences are close, related, or different. - 32%Why computers cannot directly understand meaning
Computers need numbers, not raw human language. - 33%Converting text into numbers
The first big idea behind NLP models. - 34%Bag of Words idea
Represent a sentence by counting words. - 35%CountVectorizer concept
Use Python to automatically create word-count vectors. - 36%What is Cosine Similarity?
Compare the direction of two vectors. - 37%Comparing two sentences
Calculate a similarity score between sentences. - 38%Mini project
Build a Similar Sentence Finder. - 39%Real use cases
Search, recommendation, doubt matching, chatbot matching. - 40%Revision and test
Check whether the student can explain and code the concept.
Level 0 Explanation
Beginner friendlyWhat is text similarity?
Text similarity means checking how close two pieces of text are. The two texts may not be exactly the same, but they may still talk about the same idea.
“I like machine learning” and “I enjoy learning AI” are not identical, but both are related to learning and AI.
Why convert text into numbers?
A computer can store letters, but mathematical comparison is easier with numbers. So NLP converts words and sentences into numerical representations.
Text → Words → Numbers → Comparison → Result
The NLP Pipeline
From sentence to similarity scoreBag of Words
The first simple representationIdea
Bag of Words ignores grammar and word order. It simply counts which words appear. This is not perfect, but it is a very important first step.
Sentence:
I like machine learning
Words:
I, like, machine, learning
Count:
I = 1
like = 1
machine = 1
learning = 1
Example Table
| Word | Sentence 1 | Sentence 2 |
|---|---|---|
| machine | 1 | 0 |
| learning | 1 | 1 |
| AI | 0 | 1 |
| enjoy | 0 | 1 |
Cosine Similarity
The main formulaCosine similarity compares the direction of two vectors. If two text vectors point in a similar direction, the similarity score is high. If they point in very different directions, the score is low.
cosine similarity = 1 means very similar direction
cosine similarity = 0 means no useful similarity
cosine similarity near 0 means mostly different
Score near 1
Very similar text.
Python is useful
Python is very useful
Score medium
Some relation, but not exactly the same.
I like AI
I study machine learning
Score near 0
Mostly unrelated text.
Python programming
The mango is sweet
Python Example 1
CountVectorizer and cosine similarityfrom sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
sentences = [
"I like machine learning",
"I enjoy learning AI",
"The cat is sleeping"
]
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(sentences)
similarity = cosine_similarity(vectors)
print("Vocabulary:")
print(vectorizer.get_feature_names_out())
print("\nVectors:")
print(vectors.toarray())
print("\nSimilarity Matrix:")
print(similarity)
Understanding the Similarity Matrix
Reading the outputWhen we compare three sentences with each other, Python gives a matrix. Each row and column represents one sentence.
| Sentence 1 | Sentence 2 | Sentence 3 | |
|---|---|---|---|
| Sentence 1 | 1.00 | Some similarity | Low similarity |
| Sentence 2 | Some similarity | 1.00 | Low similarity |
| Sentence 3 | Low similarity | Low similarity | 1.00 |
Python Example 2
Compare one question with lesson titlesThis example is useful for Programmers Picnic AI-ML Classes. A student asks a question. The program compares the question with stored lesson titles.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
student_question = "What is Python used for?"
lesson_titles = [
"Python for AI and Machine Learning",
"Introduction to HTML and CSS",
"Sorting algorithms in Python",
"How to deploy Angular on GitHub Pages",
"Natural Language Processing basics"
]
all_texts = [student_question] + lesson_titles
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(all_texts)
scores = cosine_similarity(vectors[0:1], vectors[1:]).flatten()
best_index = scores.argmax()
best_score = scores[best_index]
best_lesson = lesson_titles[best_index]
print("Student question:")
print(student_question)
print("\nBest matching lesson:")
print(best_lesson)
print("\nSimilarity score:")
print(best_score)
Mini Project
Similar Sentence FinderProject Goal
The student enters one sentence. The program compares it with a list of stored sentences. Then it finds the most similar sentence.
Create a list of stored sentences.
Take one input sentence from the user.
Use CountVectorizer and cosine similarity.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
stored_sentences = [
"Python is used for artificial intelligence",
"HTML is used to create web pages",
"CSS is used to style websites",
"Machine learning helps computers learn from data",
"Natural language processing works with human language",
"Sorting algorithms arrange data in order"
]
user_sentence = input("Enter your sentence: ")
all_sentences = [user_sentence] + stored_sentences
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(all_sentences)
similarity_scores = cosine_similarity(vectors[0:1], vectors[1:]).flatten()
best_index = similarity_scores.argmax()
best_match = stored_sentences[best_index]
best_score = similarity_scores[best_index]
print("\nYour sentence:")
print(user_sentence)
print("\nMost similar stored sentence:")
print(best_match)
print("\nSimilarity score:")
print(best_score)
print("\nAll scores:")
for sentence, score in zip(stored_sentences, similarity_scores):
print(round(score, 3), "->", sentence)
Real-Life Uses
Where text similarity appearsSearch
When a user searches for “Python AI”, a website can show pages related to Python and artificial intelligence.
Recommendation
If a student reads a lesson on machine learning, the website can recommend another related lesson.
Doubt Matching
If a student asks a doubt, the system can search old solved doubts and find the closest one.
Chatbots
A chatbot can compare the user message with known questions and choose the closest answer.
Duplicate Detection
Two similar articles, questions, or support tickets can be detected.
Lesson Finder
Programmers Picnic can use similarity to match a student question with the best AI-ML lesson.
Common Mistakes
Important warnings| Mistake | Problem | Better thinking |
|---|---|---|
| Thinking exact match and similarity are the same | Exact match needs the same words. Similarity can find related text. | Use similarity when wording may be different. |
| Thinking CountVectorizer understands deep meaning | CountVectorizer mainly counts words. | It is a beginner method, not a full human-like understanding system. |
| Ignoring text cleaning | Messy text can reduce result quality. | Clean text before comparison. |
| Expecting perfect results | Basic similarity can fail when words are different but meaning is same. | Later we learn better methods like TF-IDF and embeddings. |
Practice in Python Editor
Run and modify the examplesUse the embedded editor below. Paste the code examples, run them, and change the sentences. Try your own lesson titles from Programmers Picnic.
Classroom Assignment
Student workAssignment A
- Create 5 sentences about Python.
- Create 5 sentences about web development.
- Compare one new sentence with all 10 sentences.
- Print the most similar sentence.
Assignment B
- Create 8 lesson titles for AI-ML classes.
- Ask the user to enter a doubt.
- Find the closest lesson title.
- Print the similarity score.
Revision Test
Click each question to see the answerWhat is text similarity?
Text similarity means checking how close or related two pieces of text are.
Why do we convert text into numbers?
Because mathematical comparison is easier when text is represented as numbers.
What does CountVectorizer do?
It converts text into word-count vectors.
What is Bag of Words?
It is a simple NLP representation that counts words and ignores grammar or word order.
What is cosine similarity used for?
It is used to compare the direction of two vectors and produce a similarity score.
What score means two texts are very similar?
A score close to 1 means the two texts are very similar.
What score means two texts are mostly unrelated?
A score close to 0 means they are mostly unrelated.
Give one real use of text similarity.
Search, recommendation, chatbot matching, doubt matching, or duplicate detection.
What is the mini project in this lesson?
Similar Sentence Finder.
What is one limitation of CountVectorizer?
It mostly counts words and does not deeply understand meaning like advanced NLP models.
40% Checkpoint
Before moving aheadStudents should now be able to explain:
- What text similarity means.
- Why text must be converted into numbers.
- What Bag of Words means.
- What CountVectorizer does.
- What cosine similarity means.
- How to compare one sentence with many stored sentences.
- How search and recommendation systems use similarity.
Improve text comparison using TF-IDF, stop words, better ranking, and cleaner search results.