21% to 30%
Text cleaning and preprocessing
Bag of Words and count vectors
Represent text as numbers using word counts.
1. What We Have Learned So Far
In the first part of NLP, we learned that computers do not understand text directly. They understand numbers.
Then we learned how to clean messy text before giving it to an NLP program.
2. What is Bag of Words?
Bag of Words
Bag of Words is a simple NLP method where we represent text by counting words.
It does not care much about grammar or exact sentence order. It mostly checks which words are present and how many times they appear.
Python AI Python ML
Bag of Words thinking:
| Word | Count |
|---|---|
| python | 2 |
| ai | 1 |
| ml | 1 |
3. Why Is It Called a Bag?
Imagine putting all words of a sentence into a bag. The bag contains words, but it does not strongly remember the sentence order.
Sentence 1
Python teaches AI
Sentence 2
AI teaches Python
Both sentences contain the same words:
4. Vocabulary in NLP
Vocabulary means the list of unique words found in our text collection.
Python teaches AI
Python teaches ML
Champak Roy teaches NLP
Vocabulary:
["python", "teaches", "ai", "ml", "champak", "roy", "nlp"]
Once we have a vocabulary, we can convert every sentence into numbers.
| Word | Meaning in Vocabulary |
|---|---|
| python | A word found in the text collection |
| teaches | A repeated action word |
| ai | A topic word |
| ml | A topic word |
| nlp | A topic word |
5. Binary Vector
A binary vector uses only 0 and 1.
- 1 means the word is present.
- 0 means the word is absent.
["python", "ai", "ml", "nlp"]
Sentence:
python nlp
Binary Vector:
[1, 0, 0, 1]
| Vocabulary Word | Present in sentence? | Number |
|---|---|---|
| python | Yes | 1 |
| ai | No | 0 |
| ml | No | 0 |
| nlp | Yes | 1 |
6. Count Vector
A count vector stores how many times each vocabulary word appears in a sentence.
["python", "ai", "ml", "nlp"]
Sentence:
python ai python nlp python
Count Vector:
[3, 1, 0, 1]
| Vocabulary Word | Count in sentence |
|---|---|
| python | 3 |
| ai | 1 |
| ml | 0 |
| nlp | 1 |
7. Binary Vector vs Count Vector
| Feature | Binary Vector | Count Vector |
|---|---|---|
| Meaning | Word present or absent | How many times word appears |
| Values | Only 0 or 1 | 0, 1, 2, 3, 4... |
| Example | [1, 0, 1] | [3, 0, 2] |
| Useful when | We only need to know whether a word exists | Word repetition is important |
8. Document-Term Matrix
A document-term matrix is a table where rows are documents and columns are vocabulary words.
Doc 1: python ai
Doc 2: python ml
Doc 3: ai nlp
Vocabulary:
["python", "ai", "ml", "nlp"]
| Document | python | ai | ml | nlp |
|---|---|---|---|---|
| Doc 1 | 1 | 1 | 0 | 0 |
| Doc 2 | 1 | 0 | 1 | 0 |
| Doc 3 | 0 | 1 | 0 | 1 |
9. Bag of Words Limitations
Bag of Words is beginner-friendly and useful, but it has limitations.
| Limitation | Meaning |
|---|---|
| Ignores word order | It may not understand the difference between different sentence orders. |
| Ignores grammar | It counts words but does not deeply understand grammar. |
| Large vocabulary problem | Many words can create very large vectors. |
| Meaning problem | It may not understand synonyms properly unless we handle them. |
10. Python Examples: 21% to 30%
These examples move from manual vectors to automatic CountVectorizer. Start with pure Python examples, then move to scikit-learn.
Example 1: Build Vocabulary From Sentences
sentences = [
"python teaches ai",
"python teaches ml",
"champak roy teaches nlp"
]
vocabulary = []
for sentence in sentences:
words = sentence.split()
for word in words:
if word not in vocabulary:
vocabulary.append(word)
print(vocabulary)
Example 2: Create Binary Vector Manually
vocabulary = ["python", "ai", "ml", "nlp"]
sentence = "python nlp"
words = sentence.split()
vector = []
for vocab_word in vocabulary:
if vocab_word in words:
vector.append(1)
else:
vector.append(0)
print(vector)
Example 3: Create Count Vector Manually
vocabulary = ["python", "ai", "ml", "nlp"]
sentence = "python ai python nlp python"
words = sentence.split()
vector = []
for vocab_word in vocabulary:
count = words.count(vocab_word)
vector.append(count)
print(vector)
Example 4: Convert Many Sentences Into Count Vectors
vocabulary = ["python", "ai", "ml", "nlp"]
sentences = [
"python ai",
"python ml",
"ai nlp",
"python python ai"
]
for sentence in sentences:
words = sentence.split()
vector = []
for vocab_word in vocabulary:
vector.append(words.count(vocab_word))
print(sentence, "-->", vector)
Example 5: Build Document-Term Matrix Manually
documents = [
"python ai",
"python ml",
"ai nlp",
"python python ai"
]
vocabulary = []
for document in documents:
words = document.split()
for word in words:
if word not in vocabulary:
vocabulary.append(word)
print("Vocabulary:", vocabulary)
matrix = []
for document in documents:
words = document.split()
vector = []
for vocab_word in vocabulary:
vector.append(words.count(vocab_word))
matrix.append(vector)
print("Document-Term Matrix:")
for row in matrix:
print(row)
Example 6: Clean Text Before Bag of Words
import string
documents = [
"Python teaches AI!!!",
"Python teaches ML.",
"Champak Roy teaches NLP?"
]
clean_documents = []
for document in documents:
document = document.lower()
clean_text = ""
for ch in document:
if ch not in string.punctuation:
clean_text = clean_text + ch
clean_text = " ".join(clean_text.split())
clean_documents.append(clean_text)
print(clean_documents)
Example 7: Bag of Words After Cleaning
import string
documents = [
"Python teaches AI!!!",
"Python teaches ML.",
"Champak Roy teaches NLP?"
]
clean_documents = []
for document in documents:
document = document.lower()
clean_text = ""
for ch in document:
if ch not in string.punctuation:
clean_text = clean_text + ch
clean_text = " ".join(clean_text.split())
clean_documents.append(clean_text)
vocabulary = []
for document in clean_documents:
words = document.split()
for word in words:
if word not in vocabulary:
vocabulary.append(word)
print("Vocabulary:", vocabulary)
for document in clean_documents:
words = document.split()
vector = []
for vocab_word in vocabulary:
vector.append(words.count(vocab_word))
print(document, "-->", vector)
Example 8: Find Most Repeated Word
sentence = "python ai python ml python nlp ai"
words = sentence.split()
word_count = {}
for word in words:
if word in word_count:
word_count[word] = word_count[word] + 1
else:
word_count[word] = 1
print(word_count)
most_repeated_word = ""
highest_count = 0
for word, count in word_count.items():
if count > highest_count:
highest_count = count
most_repeated_word = word
print("Most repeated word:", most_repeated_word)
print("Count:", highest_count)
Example 9: Compare Two Texts Using Common Words
text1 = "python ai ml"
text2 = "python nlp ai"
words1 = set(text1.split())
words2 = set(text2.split())
common_words = words1.intersection(words2)
print("Common words:", common_words)
print("Number of common words:", len(common_words))
Example 10: Simple Similarity Score Using Sets
text1 = "python ai ml"
text2 = "python nlp ai"
words1 = set(text1.split())
words2 = set(text2.split())
common_words = words1.intersection(words2)
all_words = words1.union(words2)
similarity = len(common_words) / len(all_words)
print("Similarity:", similarity)
Example 11: CountVectorizer Introduction
This example uses scikit-learn. Install scikit-learn in the editor if needed.
from sklearn.feature_extraction.text import CountVectorizer
documents = [
"python teaches ai",
"python teaches ml",
"champak roy teaches nlp"
]
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(documents)
print("Vocabulary:")
print(vectorizer.get_feature_names_out())
print("Matrix:")
print(matrix.toarray())
Example 12: CountVectorizer With Student Queries
from sklearn.feature_extraction.text import CountVectorizer
queries = [
"I want python class",
"Do you teach AI and ML",
"I have NLP doubt",
"What is class timing"
]
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(queries)
print("Vocabulary:")
print(vectorizer.get_feature_names_out())
print("Vectors:")
print(matrix.toarray())
Example 13: Binary Bag of Words With CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
documents = [
"python python ai",
"python ml",
"ai nlp"
]
vectorizer = CountVectorizer(binary=True)
matrix = vectorizer.fit_transform(documents)
print("Vocabulary:")
print(vectorizer.get_feature_names_out())
print("Binary Matrix:")
print(matrix.toarray())
Example 14: Mini Project — Course Page Vectors
from sklearn.feature_extraction.text import CountVectorizer
course_pages = [
"python variables loops functions beginner",
"nlp tokenization vectors text similarity",
"machine learning dataset model prediction",
"google search console blogger sitemap"
]
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(course_pages)
print("Vocabulary:")
print(vectorizer.get_feature_names_out())
print("Course Page Vectors:")
print(matrix.toarray())
Example 15: Mini Project — Search Query Vector
from sklearn.feature_extraction.text import CountVectorizer
pages = [
"python variables loops functions beginner",
"nlp tokenization vectors text similarity",
"machine learning dataset model prediction",
"google search console blogger sitemap"
]
query = "text similarity nlp"
all_texts = [query] + pages
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(all_texts)
print("Vocabulary:")
print(vectorizer.get_feature_names_out())
print("Query vector:")
print(matrix.toarray()[0])
print("Page vectors:")
for row in matrix.toarray()[1:]:
print(row)
11. Practice in Our Python Editor
Use the embedded Programmers Picnic Python editor below to run the Bag of Words and CountVectorizer examples.
12. Complete Beginner Summary
| Topic | Meaning | Example |
|---|---|---|
| Bag of Words | Represent text by word counts | python: 2, ai: 1 |
| Vocabulary | Unique words in text collection | python, ai, ml, nlp |
| Binary Vector | Shows word presence or absence | [1, 0, 1] |
| Count Vector | Shows word frequency | [2, 0, 3] |
| Document-Term Matrix | Table of documents and word counts | Rows are documents, columns are words |
| CountVectorizer | Scikit-learn tool for Bag of Words | vectorizer.fit_transform() |
13. Practice Questions
- What is Bag of Words?
-
Build vocabulary from these sentences:
python ai python ml ai nlp -
Create a binary vector using this vocabulary:
Vocabulary: ["python", "ai", "ml", "nlp"] Sentence: "python ml" -
Create a count vector:
Vocabulary: ["python", "ai", "ml"] Sentence: "python python ai" - What is the difference between binary vector and count vector?
- What is a document-term matrix?
- Give one limitation of Bag of Words.
14. Mini Assignment
Create a small Bag of Words project for course search.
Use these course titles:
Python Basics for Beginners
Beginning NLP with Text Similarity
Machine Learning Model Training
Google Search Console for Blogger
Your task:
- Convert all titles to lowercase.
- Build a vocabulary of unique words.
- Create a count vector for each title.
- Print the document-term matrix.
- Try the same using CountVectorizer.