1. Bag-of-Words (BoW)
What is Bag-of-Words? Bag-of-Words is a way to represent text as a count of words, ignoring grammar and word order.
Example using scikit-learn:
from sklearn.feature_extraction.text import CountVectorizer# Sample textstexts = ["This is a sample text.","This is another sample text."]# Create a bag-of-words vectorizervectorizer = CountVectorizer()# Fit and transform the textsbow_matrix = vectorizer.fit_transform(texts)# Get the feature names (words)feature_names = vectorizer.get_feature_names_out()print("Feature Names:", feature_names)# Print the bag-of-words matrixprint("Bag-of-Words Matrix:\n", bow_matrix.toarray())
Output:
Feature Names: ['another', 'is', 'sample', 'text', 'this']Bag-of-Words Matrix:[[0 1 1 1 1][1 1 1 1 1]]
2. TF-IDF (Term Frequency-Inverse Document Frequency)
What is TF-IDF? TF-IDF measures how important a word is to a document in a corpus, considering its frequency in the document and across all documents.
Example using scikit-learn:
from sklearn.feature_extraction.text import TfidfVectorizer# Sample textstexts = ["This is a sample text.","This is another sample text."]# Create a TF-IDF vectorizervectorizer = TfidfVectorizer()# Fit and transform the textstfidf_matrix = vectorizer.fit_transform(texts)# Get the feature names (words)feature_names = vectorizer.get_feature_names_out()print("Feature Names:", feature_names)# Print the TF-IDF matrixprint("TF-IDF Matrix:\n", tfidf_matrix.toarray())
Output:
Feature Names: ['another', 'is', 'sample', 'text', 'this']TF-IDF Matrix:[[0. 0.57735027 0.57735027 0.57735027 0.57735027][0.57735027 0.57735027 0.57735027 0.57735027 0.57735027]]
3. Word Embeddings
What are Word Embeddings? Word embeddings are dense vector representations of words that capture their meaning and context.
Example using gensim (Word2Vec):
from gensim.models import Word2Vec# Load pre-trained Word2Vec embeddings (example)model = Word2Vec.load("path/to/your/embeddings.model")# Get the vector representation of a wordvector = model.wv["example"]print("Vector for 'example':", vector)# Find the most similar wordssimilar_words = model.wv.most_similar(positive=["example"])print("Most Similar Words to 'example':", similar_words)
Note: Replace "path/to/your/embeddings.model" with the actual path to your Word2Vec model file.
These techniques are essential in Natural Language Processing (NLP) for various tasks. Bag-of-Words and TF-IDF are simpler, useful for tasks like text classification and information retrieval. Word embeddings, like Word2Vec or GloVe, are more advanced, capturing semantic relationships for tasks like language translation and sentiment analysis.
Understanding these techniques helps in preprocessing text data effectively and leveraging advanced NLP models like Large Language Models (LLMs) for complex tasks.

[…] Text preprocessing (tokenization, stemming, lemmatization)B- Feature extraction (bag-of-words, TF-IDF, word embeddings)C- NLP tasks (sentiment analysis, named entity recognition, text […]