NLP 41% to 50% • TF-IDF Search

Better Text Search with TF-IDF

In the previous lesson, we compared text using simple word counts. Now we improve the search system by learning stop words, TF, IDF, TF-IDF, TfidfVectorizer, and better ranking with cosine similarity.

Open Embedded Python Editor

learnwithchampak.live AI-ML Site Python Editor

41% to 50% Roadmap

10 focused steps

41%Problem with simple word counting
Every word gets equal treatment, even weak words.
42%What are common words?
Words that appear everywhere and do not add strong meaning.
43%Stop words
Words like is, am, are, the, a, an, of, to.
44%Why all words are not equally important
Some words carry meaning. Some words only support grammar.
45%What is TF?
Term Frequency means how often a word appears in a document.
46%What is IDF?
Inverse Document Frequency means how rare or special a word is.
47%What is TF-IDF?
TF-IDF gives higher importance to useful and meaningful words.
48%TF-IDF Vectorizer in Python
Use TfidfVectorizer from scikit-learn.
49%Better search using TF-IDF + Cosine Similarity
Rank documents by meaning-heavy words.
50%Revision and mini test
Check whether students can explain and code the idea.

The Problem with Simple Word Counting

Why CountVectorizer is not enough

CountVectorizer counts all words

In CountVectorizer, every word is counted. This is simple and useful, but it has a weakness. Common words can appear many times even when they are not important.

Sentence 1:
Python is used for machine learning

Sentence 2:
The book is on the table

Common word:
is

The word is appears in both sentences, but it does not tell us the main topic.

Important words carry meaning

Words like Python, machine, learning, NLP, cosine, vector, and similarity carry stronger meaning than words like is, the, a, and of.

Python machine learning NLP cosine similarity

TF-IDF helps us give more importance to meaningful words and less importance to very common words.

Common Words and Stop Words

42% and 43%

Stop words are very common words that usually do not carry strong topic meaning. They are important for grammar, but often weak for search ranking.

Type	Examples	Meaning Strength for Search
Stop words	is, am, are, the, a, an, of, to, in, on	Usually weak
Topic words	Python, AI, machine, learning, NLP, vector	Usually strong
Action words	search, compare, classify, predict, recommend	Often useful

We do not always remove every common word. But for beginner search examples, removing English stop words often improves results.

Visual Idea: Word Importance

All words are not equal

In search ranking, some words should have more importance than others. Below is a beginner-friendly visualization.

low

the

low

Python

high

learning

high

NLP

high

TF, IDF, and TF-IDF

45% to 47%

TF

TF means Term Frequency. It asks: how often does this word appear in this document?

TF = word count in current document

If "Python" appears many times in one lesson, that lesson is probably related to Python.

IDF

IDF means Inverse Document Frequency. It asks: how rare or special is this word across all documents?

Rare word = more important
Common word = less important

A word appearing everywhere becomes less powerful for ranking.

TF-IDF

TF-IDF combines both ideas. A word gets a high score when it appears in the current document and is not too common everywhere.

TF-IDF = TF × IDF

This is better than simple word counting for search systems.

TF-IDF Pipeline

From query to ranked results

CountVectorizer vs TF-IDF

Important comparison

CountVectorizer

Counts how many times each word appears.

Python = 1
is = 1
used = 1
for = 1
machine = 1
learning = 1

It may give unnecessary importance to common words.

TF-IDF

Gives weight based on usefulness and rarity.

Python = high
machine = high
learning = high
is = low
for = low

It is usually better for search and ranking.

Python Example 1

TfidfVectorizer basics

This example compares a search query with different documents and finds the best match.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

documents = [
    "Python is used for machine learning",
    "Machine learning is a part of artificial intelligence",
    "HTML is used to create web pages",
    "CSS is used to style web pages"
]

query = "Python machine learning"

all_texts = [query] + documents

vectorizer = TfidfVectorizer(stop_words="english")
vectors = vectorizer.fit_transform(all_texts)

scores = cosine_similarity(vectors[0:1], vectors[1:]).flatten()

best_index = scores.argmax()

print("Query:")
print(query)

print("\nBest matching document:")
print(documents[best_index])

print("\nScore:")
print(scores[best_index])

print("\nAll scores:")
for document, score in zip(documents, scores):
    print(round(score, 3), "->", document)

stop_words="english" tells the vectorizer to reduce the effect of common English stop words.

Python Example 2

Programmers Picnic lesson search

This example searches through lesson titles from Programmers Picnic AI-ML Classes.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

lesson_titles = [
    "Python basics for beginners",
    "HTML and CSS introduction",
    "Machine Learning with Python",
    "Angular deployment on GitHub Pages",
    "NLP text similarity using cosine similarity",
    "TF IDF search ranking in natural language processing",
    "Sorting algorithms in Python",
    "NumPy arrays for AI and ML"
]

query = input("Search for a lesson: ")

all_texts = [query] + lesson_titles

vectorizer = TfidfVectorizer(stop_words="english")
vectors = vectorizer.fit_transform(all_texts)

scores = cosine_similarity(vectors[0:1], vectors[1:]).flatten()

best_index = scores.argmax()
best_lesson = lesson_titles[best_index]
best_score = scores[best_index]

print("\nYour search:")
print(query)

print("\nBest matching lesson:")
print(best_lesson)

print("\nSimilarity score:")
print(round(best_score, 3))

print("\nRanked results:")

ranked = sorted(
    zip(lesson_titles, scores),
    key=lambda item: item[1],
    reverse=True
)

for title, score in ranked:
    print(round(score, 3), "->", title)

This is the beginning of a real search engine for your own lesson website.

Understanding Ranked Results

Why ranking matters

A search system should not only find one result. It should rank all results from most useful to least useful.

Rank	Lesson	Possible Score	Meaning
1	Machine Learning with Python	0.74	Strong match
2	NumPy arrays for AI and ML	0.31	Some relation
3	HTML and CSS introduction	0.00	Not related

In a real website, we can show the highest ranking lessons first.

Mini Project

Better Lesson Search using TF-IDF

Project Goal

The student enters a search query. The program compares it with a list of lesson titles and descriptions. Then it returns the best matching lessons in ranked order.

Step 1

Create a list of lesson dictionaries.

Step 2

Combine title and description into searchable text.

Step 3

Use TF-IDF and cosine similarity to rank results.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

lessons = [
    {
        "title": "Python Basics",
        "description": "Learn variables, input, output, arithmetic operators and beginner programming."
    },
    {
        "title": "NumPy for AI and ML",
        "description": "Learn arrays, matrix operations, numerical computing and AI data handling."
    },
    {
        "title": "Machine Learning with Python",
        "description": "Learn supervised learning, training data, prediction and model evaluation."
    },
    {
        "title": "NLP Text Similarity",
        "description": "Learn CountVectorizer, cosine similarity and sentence comparison."
    },
    {
        "title": "TF-IDF Search Ranking",
        "description": "Learn stop words, term frequency, inverse document frequency and better text search."
    },
    {
        "title": "HTML and CSS Web Design",
        "description": "Learn how to create beautiful web pages using HTML and CSS."
    },
    {
        "title": "Angular GitHub Pages Deployment",
        "description": "Learn how to build and deploy Angular apps on GitHub Pages."
    }
]

search_texts = []

for lesson in lessons:
    combined_text = lesson["title"] + " " + lesson["description"]
    search_texts.append(combined_text)

query = input("What do you want to learn? ")

all_texts = [query] + search_texts

vectorizer = TfidfVectorizer(stop_words="english")
vectors = vectorizer.fit_transform(all_texts)

scores = cosine_similarity(vectors[0:1], vectors[1:]).flatten()

ranked_results = sorted(
    zip(lessons, scores),
    key=lambda item: item[1],
    reverse=True
)

print("\nSearch query:")
print(query)

print("\nBest lessons:")

for lesson, score in ranked_results:
    if score > 0:
        print("\nTitle:", lesson["title"])
        print("Description:", lesson["description"])
        print("Score:", round(score, 3))

Classroom Activities

Practice tasks

Activity A: Stop Word Check

Write 5 English sentences.
Underline common words like is, the, a, an, of, to.
Underline topic words like Python, AI, ML, search.
Explain which words are more useful for search.

Activity B: Lesson Search

Create 10 lesson titles.
Create one search query.
Use TfidfVectorizer.
Print top 3 ranked results.

Activity C: Compare Two Methods

Run one program using CountVectorizer.
Run another using TfidfVectorizer.
Use the same documents and query.
Compare the results.

Activity D: Brand Example

Use Programmers Picnic lesson names.
Search for “Python AI beginner”.
Search for “web design CSS”.
Check whether the correct lesson comes first.

Practice in Python Editor

Run the examples

Paste the Python examples into the embedded editor. Change the documents, lesson titles, and search queries.

If the editor opens slowly, open it directly: Python Editor Full Page

Common Mistakes

Important warnings

Mistake	Problem	Better Thinking
Thinking TF-IDF understands full human meaning	TF-IDF is still based on words and weights.	It is better than simple count, but not a full AI brain.
Forgetting stop words	Common words can disturb search ranking.	Use `stop_words="english"` for beginner English examples.
Expecting perfect results with tiny data	Small document lists may give limited results.	Use more lessons and richer descriptions.
Only printing the best result	The user may need more than one result.	Print ranked top results.

Revision Test

Click each question to see the answer

What is the problem with simple word counting?

It treats all words equally, even common weak words like is, the, a, and of.

What are stop words?

Stop words are common words that usually do not carry strong topic meaning.

Give five examples of stop words.

is, am, are, the, a, an, of, to, in, on.

Why is Python usually more important than is?

Python tells us the topic. The word is mostly supports grammar.

What is TF?

TF means Term Frequency. It measures how often a word appears in a document.

What is IDF?

IDF means Inverse Document Frequency. It measures how rare or special a word is across documents.

What is TF-IDF?

TF-IDF is a word-weighting method that gives importance to useful words and reduces the power of common words.

Why is TF-IDF better than CountVectorizer for search?

Because it does not only count words. It also considers how important or rare the words are.

Which Python class do we use for TF-IDF?

We use TfidfVectorizer from scikit-learn.

What is the mini project in this lesson?

Better Lesson Search using TF-IDF.

50% Checkpoint

Before moving ahead

Students should now be able to explain:

Why simple word counting is limited.
What common words and stop words are.
Why all words are not equally important.
What TF means.
What IDF means.
What TF-IDF means.
How to use TfidfVectorizer in Python.
How to build a better lesson search program.

Next 51% to 60%:
We can move from TF-IDF search to text classification, intent detection, and simple chatbot-style matching.