Programmers Picnic AI-ML Classes

NLP Lesson 21% to 30%

Bag of Words, vocabulary building, binary vectors, count vectors, document-term matrix, and beginner-friendly Python examples for text representation.

By Champak Roy

Advertisement
Level:

21% to 30%

Previous Lesson:

Text cleaning and preprocessing

This Lesson:

Bag of Words and count vectors

Goal:

Represent text as numbers using word counts.

1. What We Have Learned So Far

In the first part of NLP, we learned that computers do not understand text directly. They understand numbers.

Then we learned how to clean messy text before giving it to an NLP program.

Raw Text Clean Text Tokens Numbers
This lesson focuses on one of the simplest ways to convert text into numbers: Bag of Words.

2. What is Bag of Words?

Bag of Words

Bag of Words is a simple NLP method where we represent text by counting words.

It does not care much about grammar or exact sentence order. It mostly checks which words are present and how many times they appear.

Sentence:
Python AI Python ML

Bag of Words thinking:

Word Count
python 2
ai 1
ml 1
Simple meaning: Bag of Words converts a sentence into word counts.

3. Why Is It Called a Bag?

Imagine putting all words of a sentence into a bag. The bag contains words, but it does not strongly remember the sentence order.

Sentence 1

Python teaches AI

Sentence 2

AI teaches Python

Both sentences contain the same words:

python
teaches
ai
Bag of Words is simple and useful, but it may ignore deeper meaning because word order is not strongly considered.

4. Vocabulary in NLP

Vocabulary means the list of unique words found in our text collection.

Sentences:
Python teaches AI
Python teaches ML
Champak Roy teaches NLP
Vocabulary:
["python", "teaches", "ai", "ml", "champak", "roy", "nlp"]

Once we have a vocabulary, we can convert every sentence into numbers.

Word Meaning in Vocabulary
python A word found in the text collection
teaches A repeated action word
ai A topic word
ml A topic word
nlp A topic word

5. Binary Vector

A binary vector uses only 0 and 1.

Vocabulary:
["python", "ai", "ml", "nlp"]
Sentence:
python nlp
Binary Vector:
[1, 0, 0, 1]
Vocabulary Word Present in sentence? Number
python Yes 1
ai No 0
ml No 0
nlp Yes 1
Advertisement

6. Count Vector

A count vector stores how many times each vocabulary word appears in a sentence.

Vocabulary:
["python", "ai", "ml", "nlp"]
Sentence:
python ai python nlp python
Count Vector:
[3, 1, 0, 1]
Vocabulary Word Count in sentence
python 3
ai 1
ml 0
nlp 1
Binary vector checks presence. Count vector checks frequency.

7. Binary Vector vs Count Vector

Feature Binary Vector Count Vector
Meaning Word present or absent How many times word appears
Values Only 0 or 1 0, 1, 2, 3, 4...
Example [1, 0, 1] [3, 0, 2]
Useful when We only need to know whether a word exists Word repetition is important

8. Document-Term Matrix

A document-term matrix is a table where rows are documents and columns are vocabulary words.

Documents:
Doc 1: python ai
Doc 2: python ml
Doc 3: ai nlp
Vocabulary:
["python", "ai", "ml", "nlp"]
Document python ai ml nlp
Doc 1 1 1 0 0
Doc 2 1 0 1 0
Doc 3 0 1 0 1
This table is very important. Many beginner machine learning models can use this kind of table as input.

9. Bag of Words Limitations

Bag of Words is beginner-friendly and useful, but it has limitations.

Limitation Meaning
Ignores word order It may not understand the difference between different sentence orders.
Ignores grammar It counts words but does not deeply understand grammar.
Large vocabulary problem Many words can create very large vectors.
Meaning problem It may not understand synonyms properly unless we handle them.
Even with limitations, Bag of Words is one of the best first steps for understanding NLP.

10. Python Examples: 21% to 30%

These examples move from manual vectors to automatic CountVectorizer. Start with pure Python examples, then move to scikit-learn.

Example 1: Build Vocabulary From Sentences

sentences = [
    "python teaches ai",
    "python teaches ml",
    "champak roy teaches nlp"
]

vocabulary = []

for sentence in sentences:
    words = sentence.split()

    for word in words:
        if word not in vocabulary:
            vocabulary.append(word)

print(vocabulary)

Example 2: Create Binary Vector Manually

vocabulary = ["python", "ai", "ml", "nlp"]

sentence = "python nlp"

words = sentence.split()

vector = []

for vocab_word in vocabulary:
    if vocab_word in words:
        vector.append(1)
    else:
        vector.append(0)

print(vector)

Example 3: Create Count Vector Manually

vocabulary = ["python", "ai", "ml", "nlp"]

sentence = "python ai python nlp python"

words = sentence.split()

vector = []

for vocab_word in vocabulary:
    count = words.count(vocab_word)
    vector.append(count)

print(vector)

Example 4: Convert Many Sentences Into Count Vectors

vocabulary = ["python", "ai", "ml", "nlp"]

sentences = [
    "python ai",
    "python ml",
    "ai nlp",
    "python python ai"
]

for sentence in sentences:
    words = sentence.split()
    vector = []

    for vocab_word in vocabulary:
        vector.append(words.count(vocab_word))

    print(sentence, "-->", vector)

Example 5: Build Document-Term Matrix Manually

documents = [
    "python ai",
    "python ml",
    "ai nlp",
    "python python ai"
]

vocabulary = []

for document in documents:
    words = document.split()

    for word in words:
        if word not in vocabulary:
            vocabulary.append(word)

print("Vocabulary:", vocabulary)

matrix = []

for document in documents:
    words = document.split()
    vector = []

    for vocab_word in vocabulary:
        vector.append(words.count(vocab_word))

    matrix.append(vector)

print("Document-Term Matrix:")

for row in matrix:
    print(row)

Example 6: Clean Text Before Bag of Words

import string

documents = [
    "Python teaches AI!!!",
    "Python teaches ML.",
    "Champak Roy teaches NLP?"
]

clean_documents = []

for document in documents:
    document = document.lower()

    clean_text = ""

    for ch in document:
        if ch not in string.punctuation:
            clean_text = clean_text + ch

    clean_text = " ".join(clean_text.split())
    clean_documents.append(clean_text)

print(clean_documents)

Example 7: Bag of Words After Cleaning

import string

documents = [
    "Python teaches AI!!!",
    "Python teaches ML.",
    "Champak Roy teaches NLP?"
]

clean_documents = []

for document in documents:
    document = document.lower()

    clean_text = ""

    for ch in document:
        if ch not in string.punctuation:
            clean_text = clean_text + ch

    clean_text = " ".join(clean_text.split())
    clean_documents.append(clean_text)

vocabulary = []

for document in clean_documents:
    words = document.split()

    for word in words:
        if word not in vocabulary:
            vocabulary.append(word)

print("Vocabulary:", vocabulary)

for document in clean_documents:
    words = document.split()
    vector = []

    for vocab_word in vocabulary:
        vector.append(words.count(vocab_word))

    print(document, "-->", vector)

Example 8: Find Most Repeated Word

sentence = "python ai python ml python nlp ai"

words = sentence.split()

word_count = {}

for word in words:
    if word in word_count:
        word_count[word] = word_count[word] + 1
    else:
        word_count[word] = 1

print(word_count)

most_repeated_word = ""
highest_count = 0

for word, count in word_count.items():
    if count > highest_count:
        highest_count = count
        most_repeated_word = word

print("Most repeated word:", most_repeated_word)
print("Count:", highest_count)

Example 9: Compare Two Texts Using Common Words

text1 = "python ai ml"
text2 = "python nlp ai"

words1 = set(text1.split())
words2 = set(text2.split())

common_words = words1.intersection(words2)

print("Common words:", common_words)
print("Number of common words:", len(common_words))

Example 10: Simple Similarity Score Using Sets

text1 = "python ai ml"
text2 = "python nlp ai"

words1 = set(text1.split())
words2 = set(text2.split())

common_words = words1.intersection(words2)
all_words = words1.union(words2)

similarity = len(common_words) / len(all_words)

print("Similarity:", similarity)

Example 11: CountVectorizer Introduction

This example uses scikit-learn. Install scikit-learn in the editor if needed.

from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "python teaches ai",
    "python teaches ml",
    "champak roy teaches nlp"
]

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform(documents)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("Matrix:")
print(matrix.toarray())

Example 12: CountVectorizer With Student Queries

from sklearn.feature_extraction.text import CountVectorizer

queries = [
    "I want python class",
    "Do you teach AI and ML",
    "I have NLP doubt",
    "What is class timing"
]

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform(queries)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("Vectors:")
print(matrix.toarray())

Example 13: Binary Bag of Words With CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "python python ai",
    "python ml",
    "ai nlp"
]

vectorizer = CountVectorizer(binary=True)

matrix = vectorizer.fit_transform(documents)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("Binary Matrix:")
print(matrix.toarray())

Example 14: Mini Project — Course Page Vectors

from sklearn.feature_extraction.text import CountVectorizer

course_pages = [
    "python variables loops functions beginner",
    "nlp tokenization vectors text similarity",
    "machine learning dataset model prediction",
    "google search console blogger sitemap"
]

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform(course_pages)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("Course Page Vectors:")
print(matrix.toarray())

Example 15: Mini Project — Search Query Vector

from sklearn.feature_extraction.text import CountVectorizer

pages = [
    "python variables loops functions beginner",
    "nlp tokenization vectors text similarity",
    "machine learning dataset model prediction",
    "google search console blogger sitemap"
]

query = "text similarity nlp"

all_texts = [query] + pages

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform(all_texts)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("Query vector:")
print(matrix.toarray()[0])

print("Page vectors:")
for row in matrix.toarray()[1:]:
    print(row)
Classroom flow: First build vectors manually. Then show that CountVectorizer does the same idea automatically.

11. Practice in Our Python Editor

Use the embedded Programmers Picnic Python editor below to run the Bag of Words and CountVectorizer examples.

Python Practice Editor
Open in New Tab
Tip: If the embedded editor appears small on mobile, tap “Open in New Tab”.
Advertisement

12. Complete Beginner Summary

Topic Meaning Example
Bag of Words Represent text by word counts python: 2, ai: 1
Vocabulary Unique words in text collection python, ai, ml, nlp
Binary Vector Shows word presence or absence [1, 0, 1]
Count Vector Shows word frequency [2, 0, 3]
Document-Term Matrix Table of documents and word counts Rows are documents, columns are words
CountVectorizer Scikit-learn tool for Bag of Words vectorizer.fit_transform()

13. Practice Questions

  1. What is Bag of Words?
  2. Build vocabulary from these sentences:
    python ai
    python ml
    ai nlp
  3. Create a binary vector using this vocabulary:
    Vocabulary: ["python", "ai", "ml", "nlp"]
    Sentence: "python ml"
  4. Create a count vector:
    Vocabulary: ["python", "ai", "ml"]
    Sentence: "python python ai"
  5. What is the difference between binary vector and count vector?
  6. What is a document-term matrix?
  7. Give one limitation of Bag of Words.

14. Mini Assignment

Create a small Bag of Words project for course search.

Use these course titles:

Python Basics for Beginners
Beginning NLP with Text Similarity
Machine Learning Model Training
Google Search Console for Blogger

Your task:

Teacher tip: Ask students to first do it manually, then with CountVectorizer. This makes the library feel logical instead of magical.