Level:

21% to 30%

Previous Lesson:

Text cleaning and preprocessing

This Lesson:

Bag of Words and count vectors

Goal:

Represent text as numbers using word counts.

1. What We Have Learned So Far

In the first part of NLP, we learned that computers do not understand text directly. They understand numbers.

Then we learned how to clean messy text before giving it to an NLP program.

Raw Text ➡ Clean Text ➡ Tokens ➡ Numbers

This lesson focuses on one of the simplest ways to convert text into numbers: Bag of Words.

2. What is Bag of Words?

Bag of Words

Bag of Words is a simple NLP method where we represent text by counting words.

It does not care much about grammar or exact sentence order. It mostly checks which words are present and how many times they appear.

Sentence:

Python AI Python ML

Bag of Words thinking:

Word	Count
python	2
ai	1
ml	1

Simple meaning: Bag of Words converts a sentence into word counts.

3. Why Is It Called a Bag?

Imagine putting all words of a sentence into a bag. The bag contains words, but it does not strongly remember the sentence order.

Sentence 1

Python teaches AI

Sentence 2

AI teaches Python

Both sentences contain the same words:

python

teaches

Bag of Words is simple and useful, but it may ignore deeper meaning because word order is not strongly considered.

4. Vocabulary in NLP

Vocabulary means the list of unique words found in our text collection.

Sentences:

Python teaches AI
Python teaches ML
Champak Roy teaches NLP

Vocabulary:

["python", "teaches", "ai", "ml", "champak", "roy", "nlp"]

Once we have a vocabulary, we can convert every sentence into numbers.

Word	Meaning in Vocabulary
python	A word found in the text collection
teaches	A repeated action word
ai	A topic word
ml	A topic word
nlp	A topic word

5. Binary Vector

A binary vector uses only 0 and 1.

1 means the word is present.
0 means the word is absent.

Vocabulary:

["python", "ai", "ml", "nlp"]

Sentence:

python nlp

Binary Vector:

[1, 0, 0, 1]

Vocabulary Word	Present in sentence?	Number
python	Yes	1
ai	No	0
ml	No	0
nlp	Yes	1

6. Count Vector

A count vector stores how many times each vocabulary word appears in a sentence.

Vocabulary:

["python", "ai", "ml", "nlp"]

Sentence:

python ai python nlp python

Count Vector:

[3, 1, 0, 1]

Vocabulary Word	Count in sentence
python	3
ai	1
ml	0
nlp	1

Binary vector checks presence. Count vector checks frequency.

7. Binary Vector vs Count Vector

Feature	Binary Vector	Count Vector
Meaning	Word present or absent	How many times word appears
Values	Only 0 or 1	0, 1, 2, 3, 4...
Example	[1, 0, 1]	[3, 0, 2]
Useful when	We only need to know whether a word exists	Word repetition is important

8. Document-Term Matrix

A document-term matrix is a table where rows are documents and columns are vocabulary words.

Documents:

Doc 1: python ai
Doc 2: python ml
Doc 3: ai nlp

Vocabulary:

["python", "ai", "ml", "nlp"]

Document	python	ai	ml	nlp
Doc 1	1	1	0	0
Doc 2	1	0	1	0
Doc 3	0	1	0	1

This table is very important. Many beginner machine learning models can use this kind of table as input.

9. Bag of Words Limitations

Bag of Words is beginner-friendly and useful, but it has limitations.

Limitation	Meaning
Ignores word order	It may not understand the difference between different sentence orders.
Ignores grammar	It counts words but does not deeply understand grammar.
Large vocabulary problem	Many words can create very large vectors.
Meaning problem	It may not understand synonyms properly unless we handle them.

Even with limitations, Bag of Words is one of the best first steps for understanding NLP.

10. Python Examples: 21% to 30%

These examples move from manual vectors to automatic CountVectorizer. Start with pure Python examples, then move to scikit-learn.

Example 1: Build Vocabulary From Sentences

sentences = [
    "python teaches ai",
    "python teaches ml",
    "champak roy teaches nlp"
]

vocabulary = []

for sentence in sentences:
    words = sentence.split()

    for word in words:
        if word not in vocabulary:
            vocabulary.append(word)

print(vocabulary)

Example 2: Create Binary Vector Manually

vocabulary = ["python", "ai", "ml", "nlp"]

sentence = "python nlp"

words = sentence.split()

vector = []

for vocab_word in vocabulary:
    if vocab_word in words:
        vector.append(1)
    else:
        vector.append(0)

print(vector)

Example 3: Create Count Vector Manually

vocabulary = ["python", "ai", "ml", "nlp"]

sentence = "python ai python nlp python"

words = sentence.split()

vector = []

for vocab_word in vocabulary:
    count = words.count(vocab_word)
    vector.append(count)

print(vector)

Example 4: Convert Many Sentences Into Count Vectors

vocabulary = ["python", "ai", "ml", "nlp"]

sentences = [
    "python ai",
    "python ml",
    "ai nlp",
    "python python ai"
]

for sentence in sentences:
    words = sentence.split()
    vector = []

    for vocab_word in vocabulary:
        vector.append(words.count(vocab_word))

    print(sentence, "-->", vector)

Example 5: Build Document-Term Matrix Manually

documents = [
    "python ai",
    "python ml",
    "ai nlp",
    "python python ai"
]

vocabulary = []

for document in documents:
    words = document.split()

    for word in words:
        if word not in vocabulary:
            vocabulary.append(word)

print("Vocabulary:", vocabulary)

matrix = []

for document in documents:
    words = document.split()
    vector = []

    for vocab_word in vocabulary:
        vector.append(words.count(vocab_word))

    matrix.append(vector)

print("Document-Term Matrix:")

for row in matrix:
    print(row)

Example 6: Clean Text Before Bag of Words

import string

documents = [
    "Python teaches AI!!!",
    "Python teaches ML.",
    "Champak Roy teaches NLP?"
]

clean_documents = []

for document in documents:
    document = document.lower()

    clean_text = ""

    for ch in document:
        if ch not in string.punctuation:
            clean_text = clean_text + ch

    clean_text = " ".join(clean_text.split())
    clean_documents.append(clean_text)

print(clean_documents)

Example 7: Bag of Words After Cleaning

import string

documents = [
    "Python teaches AI!!!",
    "Python teaches ML.",
    "Champak Roy teaches NLP?"
]

clean_documents = []

for document in documents:
    document = document.lower()

    clean_text = ""

    for ch in document:
        if ch not in string.punctuation:
            clean_text = clean_text + ch

    clean_text = " ".join(clean_text.split())
    clean_documents.append(clean_text)

vocabulary = []

for document in clean_documents:
    words = document.split()

    for word in words:
        if word not in vocabulary:
            vocabulary.append(word)

print("Vocabulary:", vocabulary)

for document in clean_documents:
    words = document.split()
    vector = []

    for vocab_word in vocabulary:
        vector.append(words.count(vocab_word))

    print(document, "-->", vector)

Example 8: Find Most Repeated Word

sentence = "python ai python ml python nlp ai"

words = sentence.split()

word_count = {}

for word in words:
    if word in word_count:
        word_count[word] = word_count[word] + 1
    else:
        word_count[word] = 1

print(word_count)

most_repeated_word = ""
highest_count = 0

for word, count in word_count.items():
    if count > highest_count:
        highest_count = count
        most_repeated_word = word

print("Most repeated word:", most_repeated_word)
print("Count:", highest_count)

Example 9: Compare Two Texts Using Common Words

text1 = "python ai ml"
text2 = "python nlp ai"

words1 = set(text1.split())
words2 = set(text2.split())

common_words = words1.intersection(words2)

print("Common words:", common_words)
print("Number of common words:", len(common_words))

Example 10: Simple Similarity Score Using Sets

text1 = "python ai ml"
text2 = "python nlp ai"

words1 = set(text1.split())
words2 = set(text2.split())

common_words = words1.intersection(words2)
all_words = words1.union(words2)

similarity = len(common_words) / len(all_words)

print("Similarity:", similarity)

Example 11: CountVectorizer Introduction

This example uses scikit-learn. Install scikit-learn in the editor if needed.

from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "python teaches ai",
    "python teaches ml",
    "champak roy teaches nlp"
]

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform(documents)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("Matrix:")
print(matrix.toarray())

Example 12: CountVectorizer With Student Queries

from sklearn.feature_extraction.text import CountVectorizer

queries = [
    "I want python class",
    "Do you teach AI and ML",
    "I have NLP doubt",
    "What is class timing"
]

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform(queries)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("Vectors:")
print(matrix.toarray())

Example 13: Binary Bag of Words With CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "python python ai",
    "python ml",
    "ai nlp"
]

vectorizer = CountVectorizer(binary=True)

matrix = vectorizer.fit_transform(documents)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("Binary Matrix:")
print(matrix.toarray())

Example 14: Mini Project — Course Page Vectors

from sklearn.feature_extraction.text import CountVectorizer

course_pages = [
    "python variables loops functions beginner",
    "nlp tokenization vectors text similarity",
    "machine learning dataset model prediction",
    "google search console blogger sitemap"
]

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform(course_pages)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("Course Page Vectors:")
print(matrix.toarray())

Example 15: Mini Project — Search Query Vector

from sklearn.feature_extraction.text import CountVectorizer

pages = [
    "python variables loops functions beginner",
    "nlp tokenization vectors text similarity",
    "machine learning dataset model prediction",
    "google search console blogger sitemap"
]

query = "text similarity nlp"

all_texts = [query] + pages

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform(all_texts)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("Query vector:")
print(matrix.toarray()[0])

print("Page vectors:")
for row in matrix.toarray()[1:]:
    print(row)

Classroom flow: First build vectors manually. Then show that CountVectorizer does the same idea automatically.

11. Practice in Our Python Editor

Use the embedded Programmers Picnic Python editor below to run the Bag of Words and CountVectorizer examples.

Tip: If the embedded editor appears small on mobile, tap “Open in New Tab”.

12. Complete Beginner Summary

Topic	Meaning	Example
Bag of Words	Represent text by word counts	python: 2, ai: 1
Vocabulary	Unique words in text collection	python, ai, ml, nlp
Binary Vector	Shows word presence or absence	[1, 0, 1]
Count Vector	Shows word frequency	[2, 0, 3]
Document-Term Matrix	Table of documents and word counts	Rows are documents, columns are words
CountVectorizer	Scikit-learn tool for Bag of Words	vectorizer.fit_transform()

13. Practice Questions

What is Bag of Words?
Build vocabulary from these sentences:
```
python ai
python ml
ai nlp
```

Create a binary vector using this vocabulary:

Vocabulary: ["python", "ai", "ml", "nlp"]
Sentence: "python ml"

Create a count vector:

Vocabulary: ["python", "ai", "ml"]
Sentence: "python python ai"

What is the difference between binary vector and count vector?
What is a document-term matrix?
Give one limitation of Bag of Words.

14. Mini Assignment

Create a small Bag of Words project for course search.

Use these course titles:

Python Basics for Beginners
Beginning NLP with Text Similarity
Machine Learning Model Training
Google Search Console for Blogger

Your task:

Convert all titles to lowercase.
Build a vocabulary of unique words.
Create a count vector for each title.
Print the document-term matrix.
Try the same using CountVectorizer.

Teacher tip: Ask students to first do it manually, then with CountVectorizer. This makes the library feel logical instead of magical.