1 (B) - Feature extraction (bag-of-words, TF-IDF, word embeddings)

1. Bag-of-Words (BoW)

What is Bag-of-Words? Bag-of-Words is a way to represent text as a count of words, ignoring grammar and word order.

Example using scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer

# Sample texts
texts = ["This is a sample text.",
         "This is another sample text."]

# Create a bag-of-words vectorizer
vectorizer = CountVectorizer()

# Fit and transform the texts
bow_matrix = vectorizer.fit_transform(texts)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()
print("Feature Names:", feature_names)

# Print the bag-of-words matrix
print("Bag-of-Words Matrix:\n", bow_matrix.toarray())

Output:

Feature Names: ['another', 'is', 'sample', 'text', 'this']
Bag-of-Words Matrix:
 [[0 1 1 1 1]
 [1 1 1 1 1]]

2. TF-IDF (Term Frequency-Inverse Document Frequency)

What is TF-IDF? TF-IDF measures how important a word is to a document in a corpus, considering its frequency in the document and across all documents.

Example using scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample texts
texts = ["This is a sample text.",
         "This is another sample text."]

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the texts
tfidf_matrix = vectorizer.fit_transform(texts)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()
print("Feature Names:", feature_names)

# Print the TF-IDF matrix
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

Output:

Feature Names: ['another', 'is', 'sample', 'text', 'this']
TF-IDF Matrix:
 [[0.         0.57735027 0.57735027 0.57735027 0.57735027]
 [0.57735027 0.57735027 0.57735027 0.57735027 0.57735027]]

3. Word Embeddings

What are Word Embeddings? Word embeddings are dense vector representations of words that capture their meaning and context.

Example using gensim (Word2Vec):

from gensim.models import Word2Vec

# Load pre-trained Word2Vec embeddings (example)
model = Word2Vec.load("path/to/your/embeddings.model")

# Get the vector representation of a word
vector = model.wv["example"]
print("Vector for 'example':", vector)

# Find the most similar words
similar_words = model.wv.most_similar(positive=["example"])
print("Most Similar Words to 'example':", similar_words)

Note: Replace "path/to/your/embeddings.model" with the actual path to your Word2Vec model file.

These techniques are essential in Natural Language Processing (NLP) for various tasks. Bag-of-Words and TF-IDF are simpler, useful for tasks like text classification and information retrieval. Word embeddings, like Word2Vec or GloVe, are more advanced, capturing semantic relationships for tasks like language translation and sentiment analysis.

Understanding these techniques helps in preprocessing text data effectively and leveraging advanced NLP models like Large Language Models (LLMs) for complex tasks.

One comment

LLM (Large Language Model) – Topics – Core Java in 25 hours says:
at 10:47 pm
[…] Text preprocessing (tokenization, stemming, lemmatization)B- Feature extraction (bag-of-words, TF-IDF, word embeddings)C- NLP tasks (sentiment analysis, named entity recognition, text […]

Java Programmatic Universe

Java- write once, run away!

1 (B) – Feature extraction (bag-of-words, TF-IDF, word embeddings)

1. Bag-of-Words (BoW)

2. TF-IDF (Term Frequency-Inverse Document Frequency)

3. Word Embeddings

One comment

Leave a Reply Cancel reply