Programmers Picnic AI-ML Classes

NLP Lesson 11% to 20%

Text cleaning, lowercase conversion, punctuation removal, better tokenization, normalization, stemming idea, lemmatization idea, and beginner-friendly Python preprocessing projects.

By Champak Roy

Advertisement
Level:

11% to 20%

Previous Lesson:

NLP, tokens, vectors, cosine similarity

This Lesson:

Cleaning text before converting it into numbers

Goal:

Prepare messy text for NLP programs.

1. Why Do We Need Text Cleaning?

In the previous lesson, we learned that NLP converts text into numbers. But real text is usually messy.

Messy student message:
Hiii Sir!!! I want to Learn AI, ML & NLP... Is class ONLINE???

A human can understand this message. But for a computer, this text contains many small problems:

  • Capital letters and small letters are mixed.
  • There are extra punctuation marks.
  • Words like Hiii may not match normal words.
  • ONLINE and online may be treated differently.
Text cleaning means making text simpler and more regular before processing it.

2. The Basic NLP Preprocessing Pipeline

A pipeline is a step-by-step process. In beginner NLP, a simple preprocessing pipeline looks like this:

Raw Text Lowercase Remove Punctuation Tokenize Remove Common Words Useful Words
Step Meaning Example
Lowercase Convert all letters to small letters AI becomes ai
Remove punctuation Remove symbols like ! ? , . hello!!! becomes hello
Tokenize Break text into words learn ai becomes learn, ai
Remove common words Remove very common words is, the, and

3. Lowercase Conversion

Computers may treat Python, PYTHON, and python as different words.

Before:
Python PYTHON python
After lowercase:
python python python

Lowercase conversion helps the computer understand that these words are the same.

text = "Python PYTHON python"

clean_text = text.lower()

print(clean_text)

4. Removing Punctuation

Punctuation marks are symbols such as comma, full stop, question mark, and exclamation mark.

Before:
Do you teach AI, ML, and NLP?
After punctuation removal:
Do you teach AI ML and NLP

This makes tokenization cleaner because words are separated from punctuation marks.

import string

text = "Do you teach AI, ML, and NLP?"

clean_text = ""

for character in text:
    if character not in string.punctuation:
        clean_text = clean_text + character

print(clean_text)

5. Extra Spaces

Sometimes text contains too many spaces. We should convert multiple spaces into one space.

Before:
Python     AI      ML
After:
Python AI ML
text = "Python     AI      ML"

clean_text = " ".join(text.split())

print(clean_text)
The expression text.split() breaks the text into words. Then " ".join(...) joins the words using one space.
Advertisement

6. Normalization

Normalization means converting different forms of text into a common form.

Input Normalized Form
AI artificial intelligence
ML machine learning
course class
tutorial lesson
Before normalization:
I want an AI course
After normalization:
i want an artificial intelligence class

7. Stemming Idea

Stemming means cutting a word down to its rough root form.

Word Stem-like Root
learning learn
learned learn
teaching teach
classes class
Beginner meaning: stemming helps us treat related word forms as almost the same.
Stemming is not always perfect. It may sometimes cut words too much. For beginners, first understand the idea before using advanced tools.

8. Lemmatization Idea

Lemmatization also converts a word to its base form, but it is usually smarter than stemming.

Word Lemma
running run
better good
children child
studies study
Simple difference: stemming cuts words; lemmatization tries to find the proper dictionary form.

9. Python Example: Complete Text Cleaner

Now let us build a small text cleaner using only basic Python.

import string

text = "Hiii Sir!!! I want to Learn AI, ML & NLP... Is class ONLINE???"

# Step 1: lowercase
text = text.lower()

# Step 2: remove punctuation
clean_text = ""

for character in text:
    if character not in string.punctuation:
        clean_text = clean_text + character

# Step 3: remove extra spaces
clean_text = " ".join(clean_text.split())

print(clean_text)

Possible Output

hiii sir i want to learn ai ml nlp is class online
Notice that the text is now simpler. It is ready for tokenization and further NLP processing.

10. More Code Samples: 11% to 20%

These examples continue the previous lesson. Copy one example at a time into the Python editor and run it.

Example 1: Lowercase Student Message

message = "I Want To Learn PYTHON and AI"

message = message.lower()

print(message)

Example 2: Remove Punctuation From a Question

import string

question = "Do you teach Python, AI, ML, and NLP?"

clean_question = ""

for ch in question:
    if ch not in string.punctuation:
        clean_question = clean_question + ch

print(clean_question)

Example 3: Remove Extra Spaces

text = "Python      AI       ML       NLP"

clean_text = " ".join(text.split())

print(clean_text)

Example 4: Full Cleaning Function

import string

def clean_text(text):
    text = text.lower()

    result = ""

    for ch in text:
        if ch not in string.punctuation:
            result = result + ch

    result = " ".join(result.split())

    return result

message = "Hello!!! I want to Learn AI & ML..."

print(clean_text(message))

Example 5: Clean and Tokenize

import string

def clean_text(text):
    text = text.lower()

    result = ""

    for ch in text:
        if ch not in string.punctuation:
            result = result + ch

    result = " ".join(result.split())

    return result

message = "Champak Roy teaches Python, AI, ML, and NLP!"

cleaned = clean_text(message)
tokens = cleaned.split()

print("Cleaned text:", cleaned)
print("Tokens:", tokens)

Example 6: Remove Common Words After Cleaning

import string

common_words = ["is", "am", "are", "the", "a", "an", "and", "to", "in", "of"]

def clean_text(text):
    text = text.lower()

    result = ""

    for ch in text:
        if ch not in string.punctuation:
            result = result + ch

    result = " ".join(result.split())

    return result

sentence = "Champak Roy is teaching AI and ML in the class."

cleaned = clean_text(sentence)
words = cleaned.split()

important_words = []

for word in words:
    if word not in common_words:
        important_words.append(word)

print("Cleaned:", cleaned)
print("Important words:", important_words)

Example 7: Normalize AI and ML

text = "I want to learn AI and ML"

normalization = {
    "ai": "artificial intelligence",
    "ml": "machine learning"
}

words = text.lower().split()

final_words = []

for word in words:
    if word in normalization:
        final_words.append(normalization[word])
    else:
        final_words.append(word)

print(final_words)

Example 8: Normalize Course Words

text = "I want a tutorial for artificial intelligence"

normalization = {
    "tutorial": "lesson",
    "course": "class",
    "artificial": "ai",
    "intelligence": "ai"
}

words = text.lower().split()

final_words = []

for word in words:
    if word in normalization:
        final_words.append(normalization[word])
    else:
        final_words.append(word)

print(final_words)

Example 9: Very Simple Stemming Idea

words = ["learning", "learned", "teaching", "classes"]

for word in words:
    if word.endswith("ing"):
        print(word, "-->", word[:-3])
    elif word.endswith("ed"):
        print(word, "-->", word[:-2])
    elif word.endswith("es"):
        print(word, "-->", word[:-2])
    else:
        print(word, "-->", word)

Example 10: Clean a List of Student Messages

import string

messages = [
    "Hello Sir!!! I want Python class.",
    "Do you teach AI, ML and NLP?",
    "What is the timing???",
    "Is the class ONLINE?"
]

def clean_text(text):
    text = text.lower()

    result = ""

    for ch in text:
        if ch not in string.punctuation:
            result = result + ch

    result = " ".join(result.split())

    return result

for message in messages:
    print(clean_text(message))

Example 11: Build a Cleaner Search Query

import string

search = "NLP!!! text similarity??? Python examples..."

search = search.lower()

clean_search = ""

for ch in search:
    if ch not in string.punctuation:
        clean_search = clean_search + ch

clean_search = " ".join(clean_search.split())

print("Original search:", search)
print("Clean search:", clean_search)

Example 12: Clean Before Matching FAQ

import string

faqs = {
    "fees": "Please check the course fee details on learnwithchampak.live.",
    "timing": "Please check the latest class timing.",
    "python": "Yes, Python is included.",
    "nlp": "Yes, NLP is part of the AI-ML path."
}

def clean_text(text):
    text = text.lower()

    result = ""

    for ch in text:
        if ch not in string.punctuation:
            result = result + ch

    return " ".join(result.split())

question = "Sir!!! Do you teach NLP???"

clean_question = clean_text(question)

answer_found = False

for keyword in faqs:
    if keyword in clean_question:
        print(faqs[keyword])
        answer_found = True
        break

if answer_found == False:
    print("Please contact Champak Roy for details.")

Example 13: Count Important Words After Cleaning

import string

text = "AI is useful. AI is powerful. Python helps AI and ML."

common_words = ["is", "and", "the", "a", "an"]

def clean_text(text):
    text = text.lower()

    result = ""

    for ch in text:
        if ch not in string.punctuation:
            result = result + ch

    return " ".join(result.split())

cleaned = clean_text(text)
words = cleaned.split()

word_count = {}

for word in words:
    if word not in common_words:
        if word in word_count:
            word_count[word] = word_count[word] + 1
        else:
            word_count[word] = 1

print(word_count)

Example 14: Mini Project — Clean Blog Titles

import string

titles = [
    "Beginning NLP!!!",
    "Python, AI, ML and Data Science",
    "Google Search Console: Full Guide",
    "Sorting Trace & Algorithm Detection"
]

def clean_text(text):
    text = text.lower()

    result = ""

    for ch in text:
        if ch not in string.punctuation:
            result = result + ch

    return " ".join(result.split())

for title in titles:
    print(clean_text(title))

Example 15: Mini Project — Student Message Cleaner

import string

def clean_text(text):
    text = text.lower()

    result = ""

    for ch in text:
        if ch not in string.punctuation:
            result = result + ch

    result = " ".join(result.split())

    return result

message = input("Enter student message: ")

cleaned = clean_text(message)

print("Cleaned message:")
print(cleaned)

print("Words:")
print(cleaned.split())
Classroom flow: In this lesson, students should understand that good NLP starts with clean text. Bad input usually gives bad output.

11. Practice in Our Python Editor

Use the embedded Programmers Picnic Python editor below to run the text cleaning examples.

Python Practice Editor
Open in New Tab
Tip: If the embedded editor appears small on mobile, tap “Open in New Tab”.
Advertisement

12. Complete Beginner Summary

Topic Meaning Example
Text Cleaning Making text simpler before NLP Hello!!! becomes hello
Lowercase Convert all text to small letters AI becomes ai
Punctuation Removal Remove symbols Python, becomes Python
Extra Space Removal Keep only normal spacing AI ML becomes AI ML
Normalization Convert different forms to one form ML becomes machine learning
Stemming Cut word to rough root learning becomes learn
Lemmatization Find proper base word children becomes child

13. Practice Questions

  1. Why do we clean text before NLP?
  2. Convert this text to lowercase:
    Python AI ML NLP
  3. Remove punctuation:
    Hello!!! Do you teach AI???
  4. Remove extra spaces:
    Python      is      useful
  5. What is normalization?
  6. What is the difference between stemming and lemmatization?
  7. Clean this student message:
    Sir!!! I want to Learn ML & NLP...

14. Mini Assignment

Create a Python program that takes a student message and cleans it.

Input Example

Hello Sir!!! I want to Learn AI, ML, and NLP...

Your program should do these steps:

Expected output idea:
hello sir i want to learn ai ml and nlp
["hello", "sir", "i", "want", "to", "learn", "ai", "ml", "and", "nlp"]