1 (A)- Text preprocessing (tokenization, stemming, lemmatization)

1. Tokenization

What is Tokenization? Tokenization means breaking down a sentence or paragraph into smaller parts, like words or phrases.

Example:

import nltk

# Sample text
text = "This is a sample sentence with playing children and small houses."

# Tokenize the text into words
tokens = nltk.word_tokenize(text)
print(tokens)

Output:
['This', 'is', 'a', 'sample', 'sentence', 'with', 'playing', 'children', 'and', 'small', 'houses', '.']

2. Stemming

What is Stemming? Stemming reduces words to their base or root form. It chops off suffixes to normalize words.

Example:

from nltk.stem import PorterStemmer

# Create a stemmer object
stemmer = PorterStemmer()

# Stem some words
print(stemmer.stem("playing"))  # Output: "play"
print(stemmer.stem("houses"))   # Output: "hous"

3. Lemmatization

What is Lemmatization? Lemmatization is like stemming but smarter; it reduces words based on their meaning and part of speech.

Example:

from nltk.stem import WordNetLemmatizer

# Create a lemmatizer object
lemmatizer = WordNetLemmatizer()

# Lemmatize some words
print(lemmatizer.lemmatize("playing", pos="v"))  # Output: "play"
print(lemmatizer.lemmatize("houses", pos="n"))   # Output: "house"

Example of Combined Tokenization, Stemming, and Lemmatization

Putting it all together:

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Sample text
text = "This is a sample sentence with playing children and small houses."

# Tokenize the text
tokens = nltk.word_tokenize(text)
print("Tokens:", tokens)

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print("Stemmed Tokens:", stemmed_tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token, pos='v') for token in tokens]
print("Lemmatized Tokens:", lemmatized_tokens)

Output:

Tokens: ['This', 'is', 'a', 'sample', 'sentence', 'with', 'playing', 'children', 'and', 'small', 'houses', '.']

Stemmed Tokens: ['thi', 'is', 'a', 'sampl', 'sentenc', 'with', 'play', 'children', 'and', 'small', 'hous', '.']

Lemmatized Tokens: ['This', 'is', 'a', 'sample', 'sentence', 'with', 'play', 'child', 'and', 'small', 'house', '.']

Explanation:
First, we broke down the text into individual words (tokens). Then, we simplified the words using stemming (like changing “playing” to “play” and “houses” to “hous”). Finally, we refined them further using lemmatization, which understands the context (changing “playing” to “play” and “houses” to “house”).

These techniques are crucial in Natural Language Processing (NLP) for tasks like text analysis, where understanding the meaning of words is important. They help in making text data more structured and useful for machines to process.

One comment

LLM (Large Language Model) – Topics – Core Java in 25 hours says:
at 9:50 pm
[…] A- Text preprocessing (tokenization, stemming, lemmatization)B- Feature extraction (bag-of-words, TF-IDF, word embeddings)C- NLP tasks (sentiment analysis, named entity recognition, text classification) […]

Java Programmatic Universe

Java- write once, run away!

1 (A)- Text preprocessing (tokenization, stemming, lemmatization)

1. Tokenization

2. Stemming

3. Lemmatization

Example of Combined Tokenization, Stemming, and Lemmatization

One comment

Leave a Reply Cancel reply