1. Bag-of-Words (BoW)
What is Bag-of-Words? Bag-of-Words is a way to represent text as a count of words, ignoring grammar and word order.
Example using scikit-learn:
from sklearn.feature_extraction.text import CountVectorizer# Sample textstexts = ["This is a sample text.","This is another sample text."]# Create a bag-of-words vectorizervectorizer = CountVectorizer()# Fit and transform the textsbow_matrix = vectorizer.fit_transform(texts)# Get the feature names (words)feature_names = vectorizer.get_feature_names_out()print("Feature Names:", feature_names)# Print the bag-of-words matrixprint("Bag-of-Words Matrix:\n", bow_matrix.toarray())
Output:
Feature Names: ['another', 'is', 'sample', 'text', 'this']Bag-of-Words Matrix:[[0 1 1 1 1][1 1 1 1 1]]
2. TF-IDF (Term Frequency-Inverse Document Frequency)
What is TF-IDF? TF-IDF measures how important a word is to a document in a corpus, considering its frequency in the document and across all documents.
Example using scikit-learn:
from sklearn.feature_extraction.text import TfidfVectorizer# Sample textstexts = ["This is a sample text.","This is another sample text."]# Create a TF-IDF vectorizervectorizer = TfidfVectorizer()# Fit and transform the textstfidf_matrix = vectorizer.fit_transform(texts)# Get the feature names (words)feature_names = vectorizer.get_feature_names_out()print("Feature Names:", feature_names)# Print the TF-IDF matrixprint("TF-IDF Matrix:\n", tfidf_matrix.toarray())
Output:
Feature Names: ['another', 'is', 'sample', 'text', 'this']TF-IDF Matrix:[[0. 0.57735027 0.57735027 0.57735027 0.57735027][0.57735027 0.57735027 0.57735027 0.57735027 0.57735027]]
3. Word Embeddings
What are Word Embeddings? Word embeddings are dense vector representations of words that capture their meaning and context.
Example using gensim (Word2Vec):
from gensim.models import Word2Vec# Load pre-trained Word2Vec embeddings (example)model = Word2Vec.load("path/to/your/embeddings.model")# Get the vector representation of a wordvector = model.wv["example"]print("Vector for 'example':", vector)# Find the most similar wordssimilar_words = model.wv.most_similar(positive=["example"])print("Most Similar Words to 'example':", similar_words)
Note: Replace "path/to/your/embeddings.model" with the actual path to your Word2Vec model file.
These techniques are essential in Natural Language Processing (NLP) for various tasks. Bag-of-Words and TF-IDF are simpler, useful for tasks like text classification and information retrieval. Word embeddings, like Word2Vec or GloVe, are more advanced, capturing semantic relationships for tasks like language translation and sentiment analysis.
Understanding these techniques helps in preprocessing text data effectively and leveraging advanced NLP models like Large Language Models (LLMs) for complex tasks.

Slot nổ hũ tại xn88 – Tổng quan về nền tảng cược trực tuyến an toàn và chiến lược cược hiệu quả nổi tiếng “mát tay” nhờ cơ chế jackpot tích lũy liên tục từ cộng đồng người chơi. Nhiều thành viên đã đổi đời chỉ sau một lần quay trúng Jackpot trị giá hàng trăm triệu đồng. TONY03-11O
**mitolyn**
Mitolyn is a carefully developed, plant-based formula created to help support metabolic efficiency and encourage healthy, lasting weight management.
[…] Text preprocessing (tokenization, stemming, lemmatization)B- Feature extraction (bag-of-words, TF-IDF, word embeddings)C- NLP tasks (sentiment analysis, named entity recognition, text […]