Bag of Words
This interactive demo demonstrates how text is converted into numerical representations using the bag-of-words approach. Enter sentences to see how each word maps to a position in a 370,000+ word dictionary.
The demo uses one-hot encoding to create sparse vectors representing each sentence. Compare two sentences to see how bag-of-words treats text as an unordered collection of words - "john ate the horse" and "the horse ate john" produce identical representations.
Bag of Words Representation:
The bag-of-words model represents text as a collection of words, disregarding grammar and word order. Each unique word in the vocabulary gets assigned an index position.
One-Hot Encoding:
For a vocabulary of size
, each sentence is represented as a binary vector of length
, where position
is 1 if word
appears in the sentence, and 0 otherwise.
For vocabulary
and sentence
:
where
if
, else
Cosine Similarity:
To compare two vectors
and
:
This gives a value between 0 and 1, which we convert to a percentage.
How to Use:
1. Enter a sentence in Sentence 1 input field (e.g., "john ate the horse")
2. Click "Analyze Sentence 1" to see word-to-index mappings and the sparse vector representation
3. Enter a different sentence in Sentence 2 (e.g., "the horse ate john")
4. Click "Analyze Sentence 2" to process the second sentence
5. Click "Compare Sentences" to calculate cosine similarity between the two vectors
Try these examples:
• "john ate the horse" vs "the horse ate john" - should show 100% similarity (same words, different order)
• "i love cats" vs "i love dogs" - moderate similarity (two shared words out of three)
• "the cat sleeps" vs "dogs run fast" - low similarity (no shared words)
Understanding Bag of Words:
• Order Independence: Word order is completely ignored. "john ate the horse" = "the horse ate john" in terms of representation
• Loss of Meaning: "john ate the horse" and "the horse ate john" have very different meanings but identical bag-of-words representations
• Sparse Vectors: With 370,103 words in the dictionary, most vector positions are 0. A typical sentence only activates 3-8 positions
• Unknown Words: Words not in the dictionary are highlighted in red and ignored in the vector representation
• Case Insensitive: All words are converted to lowercase before lookup
• Punctuation: Automatically removed during tokenization
Real-World Applications:
• Document classification (spam detection, topic categorization)
• Information retrieval (search engines)
• Sentiment analysis (positive/negative classification)
• Text clustering (grouping similar documents)