11% to 20%
NLP, tokens, vectors, cosine similarity
Cleaning text before converting it into numbers
Prepare messy text for NLP programs.
1. Why Do We Need Text Cleaning?
In the previous lesson, we learned that NLP converts text into numbers. But real text is usually messy.
Hiii Sir!!! I want to Learn AI, ML & NLP... Is class ONLINE???
A human can understand this message. But for a computer, this text contains many small problems:
- Capital letters and small letters are mixed.
- There are extra punctuation marks.
- Words like
Hiiimay not match normal words. -
ONLINEandonlinemay be treated differently.
2. The Basic NLP Preprocessing Pipeline
A pipeline is a step-by-step process. In beginner NLP, a simple preprocessing pipeline looks like this:
| Step | Meaning | Example |
|---|---|---|
| Lowercase | Convert all letters to small letters | AI becomes ai |
| Remove punctuation | Remove symbols like ! ? , . | hello!!! becomes hello |
| Tokenize | Break text into words | learn ai becomes learn, ai |
| Remove common words | Remove very common words | is, the, and |
3. Lowercase Conversion
Computers may treat Python, PYTHON, and
python as different words.
Python PYTHON python
After lowercase:
python python python
Lowercase conversion helps the computer understand that these words are the same.
text = "Python PYTHON python"
clean_text = text.lower()
print(clean_text)
4. Removing Punctuation
Punctuation marks are symbols such as comma, full stop, question mark, and exclamation mark.
Do you teach AI, ML, and NLP?
After punctuation removal:
Do you teach AI ML and NLP
This makes tokenization cleaner because words are separated from punctuation marks.
import string
text = "Do you teach AI, ML, and NLP?"
clean_text = ""
for character in text:
if character not in string.punctuation:
clean_text = clean_text + character
print(clean_text)
5. Extra Spaces
Sometimes text contains too many spaces. We should convert multiple spaces into one space.
Python AI ML
After:
Python AI ML
text = "Python AI ML"
clean_text = " ".join(text.split())
print(clean_text)
text.split() breaks the text into words.
Then " ".join(...) joins the words using one space.
6. Normalization
Normalization means converting different forms of text into a common form.
| Input | Normalized Form |
|---|---|
| AI | artificial intelligence |
| ML | machine learning |
| course | class |
| tutorial | lesson |
I want an AI course
After normalization:
i want an artificial intelligence class
7. Stemming Idea
Stemming means cutting a word down to its rough root form.
| Word | Stem-like Root |
|---|---|
| learning | learn |
| learned | learn |
| teaching | teach |
| classes | class |
8. Lemmatization Idea
Lemmatization also converts a word to its base form, but it is usually smarter than stemming.
| Word | Lemma |
|---|---|
| running | run |
| better | good |
| children | child |
| studies | study |
9. Python Example: Complete Text Cleaner
Now let us build a small text cleaner using only basic Python.
import string
text = "Hiii Sir!!! I want to Learn AI, ML & NLP... Is class ONLINE???"
# Step 1: lowercase
text = text.lower()
# Step 2: remove punctuation
clean_text = ""
for character in text:
if character not in string.punctuation:
clean_text = clean_text + character
# Step 3: remove extra spaces
clean_text = " ".join(clean_text.split())
print(clean_text)
Possible Output
hiii sir i want to learn ai ml nlp is class online
10. More Code Samples: 11% to 20%
These examples continue the previous lesson. Copy one example at a time into the Python editor and run it.
Example 1: Lowercase Student Message
message = "I Want To Learn PYTHON and AI"
message = message.lower()
print(message)
Example 2: Remove Punctuation From a Question
import string
question = "Do you teach Python, AI, ML, and NLP?"
clean_question = ""
for ch in question:
if ch not in string.punctuation:
clean_question = clean_question + ch
print(clean_question)
Example 3: Remove Extra Spaces
text = "Python AI ML NLP"
clean_text = " ".join(text.split())
print(clean_text)
Example 4: Full Cleaning Function
import string
def clean_text(text):
text = text.lower()
result = ""
for ch in text:
if ch not in string.punctuation:
result = result + ch
result = " ".join(result.split())
return result
message = "Hello!!! I want to Learn AI & ML..."
print(clean_text(message))
Example 5: Clean and Tokenize
import string
def clean_text(text):
text = text.lower()
result = ""
for ch in text:
if ch not in string.punctuation:
result = result + ch
result = " ".join(result.split())
return result
message = "Champak Roy teaches Python, AI, ML, and NLP!"
cleaned = clean_text(message)
tokens = cleaned.split()
print("Cleaned text:", cleaned)
print("Tokens:", tokens)
Example 6: Remove Common Words After Cleaning
import string
common_words = ["is", "am", "are", "the", "a", "an", "and", "to", "in", "of"]
def clean_text(text):
text = text.lower()
result = ""
for ch in text:
if ch not in string.punctuation:
result = result + ch
result = " ".join(result.split())
return result
sentence = "Champak Roy is teaching AI and ML in the class."
cleaned = clean_text(sentence)
words = cleaned.split()
important_words = []
for word in words:
if word not in common_words:
important_words.append(word)
print("Cleaned:", cleaned)
print("Important words:", important_words)
Example 7: Normalize AI and ML
text = "I want to learn AI and ML"
normalization = {
"ai": "artificial intelligence",
"ml": "machine learning"
}
words = text.lower().split()
final_words = []
for word in words:
if word in normalization:
final_words.append(normalization[word])
else:
final_words.append(word)
print(final_words)
Example 8: Normalize Course Words
text = "I want a tutorial for artificial intelligence"
normalization = {
"tutorial": "lesson",
"course": "class",
"artificial": "ai",
"intelligence": "ai"
}
words = text.lower().split()
final_words = []
for word in words:
if word in normalization:
final_words.append(normalization[word])
else:
final_words.append(word)
print(final_words)
Example 9: Very Simple Stemming Idea
words = ["learning", "learned", "teaching", "classes"]
for word in words:
if word.endswith("ing"):
print(word, "-->", word[:-3])
elif word.endswith("ed"):
print(word, "-->", word[:-2])
elif word.endswith("es"):
print(word, "-->", word[:-2])
else:
print(word, "-->", word)
Example 10: Clean a List of Student Messages
import string
messages = [
"Hello Sir!!! I want Python class.",
"Do you teach AI, ML and NLP?",
"What is the timing???",
"Is the class ONLINE?"
]
def clean_text(text):
text = text.lower()
result = ""
for ch in text:
if ch not in string.punctuation:
result = result + ch
result = " ".join(result.split())
return result
for message in messages:
print(clean_text(message))
Example 11: Build a Cleaner Search Query
import string
search = "NLP!!! text similarity??? Python examples..."
search = search.lower()
clean_search = ""
for ch in search:
if ch not in string.punctuation:
clean_search = clean_search + ch
clean_search = " ".join(clean_search.split())
print("Original search:", search)
print("Clean search:", clean_search)
Example 12: Clean Before Matching FAQ
import string
faqs = {
"fees": "Please check the course fee details on learnwithchampak.live.",
"timing": "Please check the latest class timing.",
"python": "Yes, Python is included.",
"nlp": "Yes, NLP is part of the AI-ML path."
}
def clean_text(text):
text = text.lower()
result = ""
for ch in text:
if ch not in string.punctuation:
result = result + ch
return " ".join(result.split())
question = "Sir!!! Do you teach NLP???"
clean_question = clean_text(question)
answer_found = False
for keyword in faqs:
if keyword in clean_question:
print(faqs[keyword])
answer_found = True
break
if answer_found == False:
print("Please contact Champak Roy for details.")
Example 13: Count Important Words After Cleaning
import string
text = "AI is useful. AI is powerful. Python helps AI and ML."
common_words = ["is", "and", "the", "a", "an"]
def clean_text(text):
text = text.lower()
result = ""
for ch in text:
if ch not in string.punctuation:
result = result + ch
return " ".join(result.split())
cleaned = clean_text(text)
words = cleaned.split()
word_count = {}
for word in words:
if word not in common_words:
if word in word_count:
word_count[word] = word_count[word] + 1
else:
word_count[word] = 1
print(word_count)
Example 14: Mini Project — Clean Blog Titles
import string
titles = [
"Beginning NLP!!!",
"Python, AI, ML and Data Science",
"Google Search Console: Full Guide",
"Sorting Trace & Algorithm Detection"
]
def clean_text(text):
text = text.lower()
result = ""
for ch in text:
if ch not in string.punctuation:
result = result + ch
return " ".join(result.split())
for title in titles:
print(clean_text(title))
Example 15: Mini Project — Student Message Cleaner
import string
def clean_text(text):
text = text.lower()
result = ""
for ch in text:
if ch not in string.punctuation:
result = result + ch
result = " ".join(result.split())
return result
message = input("Enter student message: ")
cleaned = clean_text(message)
print("Cleaned message:")
print(cleaned)
print("Words:")
print(cleaned.split())
11. Practice in Our Python Editor
Use the embedded Programmers Picnic Python editor below to run the text cleaning examples.
12. Complete Beginner Summary
| Topic | Meaning | Example |
|---|---|---|
| Text Cleaning | Making text simpler before NLP | Hello!!! becomes hello |
| Lowercase | Convert all text to small letters | AI becomes ai |
| Punctuation Removal | Remove symbols | Python, becomes Python |
| Extra Space Removal | Keep only normal spacing | AI ML becomes AI ML |
| Normalization | Convert different forms to one form | ML becomes machine learning |
| Stemming | Cut word to rough root | learning becomes learn |
| Lemmatization | Find proper base word | children becomes child |
13. Practice Questions
- Why do we clean text before NLP?
-
Convert this text to lowercase:
Python AI ML NLP -
Remove punctuation:
Hello!!! Do you teach AI??? -
Remove extra spaces:
Python is useful - What is normalization?
- What is the difference between stemming and lemmatization?
-
Clean this student message:
Sir!!! I want to Learn ML & NLP...
14. Mini Assignment
Create a Python program that takes a student message and cleans it.
Input Example
Hello Sir!!! I want to Learn AI, ML, and NLP...
Your program should do these steps:
- Convert the message to lowercase.
- Remove punctuation.
- Remove extra spaces.
- Break the cleaned text into words.
- Print the final word list.
hello sir i want to learn ai ml and nlp
["hello", "sir", "i", "want", "to", "learn", "ai", "ml", "and", "nlp"]