Better Text Search with TF-IDF
In the previous lesson, we compared text using simple word counts. Now we improve the search system by learning stop words, TF, IDF, TF-IDF, TfidfVectorizer, and better ranking with cosine similarity.
41% to 50% Roadmap
10 focused steps- 41%Problem with simple word counting
Every word gets equal treatment, even weak words. - 42%What are common words?
Words that appear everywhere and do not add strong meaning. - 43%Stop words
Words like is, am, are, the, a, an, of, to. - 44%Why all words are not equally important
Some words carry meaning. Some words only support grammar. - 45%What is TF?
Term Frequency means how often a word appears in a document. - 46%What is IDF?
Inverse Document Frequency means how rare or special a word is. - 47%What is TF-IDF?
TF-IDF gives higher importance to useful and meaningful words. - 48%TF-IDF Vectorizer in Python
Use TfidfVectorizer from scikit-learn. - 49%Better search using TF-IDF + Cosine
Similarity
Rank documents by meaning-heavy words. - 50%Revision and mini test
Check whether students can explain and code the idea.
The Problem with Simple Word Counting
Why CountVectorizer is not enoughCountVectorizer counts all words
In CountVectorizer, every word is counted. This is simple and useful, but it has a weakness. Common words can appear many times even when they are not important.
Sentence 1:
Python is used for machine learning
Sentence 2:
The book is on the table
Common word:
is
is appears in both sentences, but it does not tell us the main topic.
Important words carry meaning
Words like Python, machine, learning, NLP, cosine, vector, and similarity carry stronger meaning than words like is, the, a, and of.
Common Words and Stop Words
42% and 43%Stop words are very common words that usually do not carry strong topic meaning. They are important for grammar, but often weak for search ranking.
| Type | Examples | Meaning Strength for Search |
|---|---|---|
| Stop words | is, am, are, the, a, an, of, to, in, on | Usually weak |
| Topic words | Python, AI, machine, learning, NLP, vector | Usually strong |
| Action words | search, compare, classify, predict, recommend | Often useful |
Visual Idea: Word Importance
All words are not equalIn search ranking, some words should have more importance than others. Below is a beginner-friendly visualization.
TF, IDF, and TF-IDF
45% to 47%TF
TF means Term Frequency. It asks: how often does this word appear in this document?
TF = word count in current document
IDF
IDF means Inverse Document Frequency. It asks: how rare or special is this word across all documents?
Rare word = more important
Common word = less important
TF-IDF
TF-IDF combines both ideas. A word gets a high score when it appears in the current document and is not too common everywhere.
TF-IDF = TF × IDF
TF-IDF Pipeline
From query to ranked resultsCountVectorizer vs TF-IDF
Important comparisonCountVectorizer
Counts how many times each word appears.
Python = 1
is = 1
used = 1
for = 1
machine = 1
learning = 1
TF-IDF
Gives weight based on usefulness and rarity.
Python = high
machine = high
learning = high
is = low
for = low
Python Example 1
TfidfVectorizer basicsThis example compares a search query with different documents and finds the best match.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents = [
"Python is used for machine learning",
"Machine learning is a part of artificial intelligence",
"HTML is used to create web pages",
"CSS is used to style web pages"
]
query = "Python machine learning"
all_texts = [query] + documents
vectorizer = TfidfVectorizer(stop_words="english")
vectors = vectorizer.fit_transform(all_texts)
scores = cosine_similarity(vectors[0:1], vectors[1:]).flatten()
best_index = scores.argmax()
print("Query:")
print(query)
print("\nBest matching document:")
print(documents[best_index])
print("\nScore:")
print(scores[best_index])
print("\nAll scores:")
for document, score in zip(documents, scores):
print(round(score, 3), "->", document)
stop_words="english" tells the vectorizer to reduce the effect of common English
stop words.
Python Example 2
Programmers Picnic lesson searchThis example searches through lesson titles from Programmers Picnic AI-ML Classes.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
lesson_titles = [
"Python basics for beginners",
"HTML and CSS introduction",
"Machine Learning with Python",
"Angular deployment on GitHub Pages",
"NLP text similarity using cosine similarity",
"TF IDF search ranking in natural language processing",
"Sorting algorithms in Python",
"NumPy arrays for AI and ML"
]
query = input("Search for a lesson: ")
all_texts = [query] + lesson_titles
vectorizer = TfidfVectorizer(stop_words="english")
vectors = vectorizer.fit_transform(all_texts)
scores = cosine_similarity(vectors[0:1], vectors[1:]).flatten()
best_index = scores.argmax()
best_lesson = lesson_titles[best_index]
best_score = scores[best_index]
print("\nYour search:")
print(query)
print("\nBest matching lesson:")
print(best_lesson)
print("\nSimilarity score:")
print(round(best_score, 3))
print("\nRanked results:")
ranked = sorted(
zip(lesson_titles, scores),
key=lambda item: item[1],
reverse=True
)
for title, score in ranked:
print(round(score, 3), "->", title)
Understanding Ranked Results
Why ranking mattersA search system should not only find one result. It should rank all results from most useful to least useful.
| Rank | Lesson | Possible Score | Meaning |
|---|---|---|---|
| 1 | Machine Learning with Python | 0.74 | Strong match |
| 2 | NumPy arrays for AI and ML | 0.31 | Some relation |
| 3 | HTML and CSS introduction | 0.00 | Not related |
Mini Project
Better Lesson Search using TF-IDFProject Goal
The student enters a search query. The program compares it with a list of lesson titles and descriptions. Then it returns the best matching lessons in ranked order.
Create a list of lesson dictionaries.
Combine title and description into searchable text.
Use TF-IDF and cosine similarity to rank results.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
lessons = [
{
"title": "Python Basics",
"description": "Learn variables, input, output, arithmetic operators and beginner programming."
},
{
"title": "NumPy for AI and ML",
"description": "Learn arrays, matrix operations, numerical computing and AI data handling."
},
{
"title": "Machine Learning with Python",
"description": "Learn supervised learning, training data, prediction and model evaluation."
},
{
"title": "NLP Text Similarity",
"description": "Learn CountVectorizer, cosine similarity and sentence comparison."
},
{
"title": "TF-IDF Search Ranking",
"description": "Learn stop words, term frequency, inverse document frequency and better text search."
},
{
"title": "HTML and CSS Web Design",
"description": "Learn how to create beautiful web pages using HTML and CSS."
},
{
"title": "Angular GitHub Pages Deployment",
"description": "Learn how to build and deploy Angular apps on GitHub Pages."
}
]
search_texts = []
for lesson in lessons:
combined_text = lesson["title"] + " " + lesson["description"]
search_texts.append(combined_text)
query = input("What do you want to learn? ")
all_texts = [query] + search_texts
vectorizer = TfidfVectorizer(stop_words="english")
vectors = vectorizer.fit_transform(all_texts)
scores = cosine_similarity(vectors[0:1], vectors[1:]).flatten()
ranked_results = sorted(
zip(lessons, scores),
key=lambda item: item[1],
reverse=True
)
print("\nSearch query:")
print(query)
print("\nBest lessons:")
for lesson, score in ranked_results:
if score > 0:
print("\nTitle:", lesson["title"])
print("Description:", lesson["description"])
print("Score:", round(score, 3))
Classroom Activities
Practice tasksActivity A: Stop Word Check
- Write 5 English sentences.
- Underline common words like is, the, a, an, of, to.
- Underline topic words like Python, AI, ML, search.
- Explain which words are more useful for search.
Activity B: Lesson Search
- Create 10 lesson titles.
- Create one search query.
- Use TfidfVectorizer.
- Print top 3 ranked results.
Activity C: Compare Two Methods
- Run one program using CountVectorizer.
- Run another using TfidfVectorizer.
- Use the same documents and query.
- Compare the results.
Activity D: Brand Example
- Use Programmers Picnic lesson names.
- Search for “Python AI beginner”.
- Search for “web design CSS”.
- Check whether the correct lesson comes first.
Practice in Python Editor
Run the examplesPaste the Python examples into the embedded editor. Change the documents, lesson titles, and search queries.
Common Mistakes
Important warnings| Mistake | Problem | Better Thinking |
|---|---|---|
| Thinking TF-IDF understands full human meaning | TF-IDF is still based on words and weights. | It is better than simple count, but not a full AI brain. |
| Forgetting stop words | Common words can disturb search ranking. | Use stop_words="english" for beginner English examples. |
| Expecting perfect results with tiny data | Small document lists may give limited results. | Use more lessons and richer descriptions. |
| Only printing the best result | The user may need more than one result. | Print ranked top results. |
Revision Test
Click each question to see the answerWhat is the problem with simple word counting?
It treats all words equally, even common weak words like is, the, a, and of.
What are stop words?
Stop words are common words that usually do not carry strong topic meaning.
Give five examples of stop words.
is, am, are, the, a, an, of, to, in, on.
Why is Python usually more important than is?
Python tells us the topic. The word is mostly supports grammar.
What is TF?
TF means Term Frequency. It measures how often a word appears in a document.
What is IDF?
IDF means Inverse Document Frequency. It measures how rare or special a word is across documents.
What is TF-IDF?
TF-IDF is a word-weighting method that gives importance to useful words and reduces the power of common words.
Why is TF-IDF better than CountVectorizer for search?
Because it does not only count words. It also considers how important or rare the words are.
Which Python class do we use for TF-IDF?
We use TfidfVectorizer from scikit-learn.
What is the mini project in this lesson?
Better Lesson Search using TF-IDF.
50% Checkpoint
Before moving aheadStudents should now be able to explain:
- Why simple word counting is limited.
- What common words and stop words are.
- Why all words are not equally important.
- What TF means.
- What IDF means.
- What TF-IDF means.
- How to use TfidfVectorizer in Python.
- How to build a better lesson search program.
We can move from TF-IDF search to text classification, intent detection, and simple chatbot-style matching.