Bag of Words

Bag of Words

This interactive demo demonstrates how text is converted into numerical representations using the bag-of-words approach. Enter sentences to see how each word maps to a position in a 370,000+ word dictionary.

The demo uses one-hot encoding to create sparse vectors representing each sentence. Compare two sentences to see how bag-of-words treats text as an unordered collection of words - "john ate the horse" and "the horse ate john" produce identical representations.

Bag of Words Representation:

The bag-of-words model represents text as a collection of words, disregarding grammar and word order. Each unique word in the vocabulary gets assigned an index position.

One-Hot Encoding:
For a vocabulary of size

n

, each sentence is represented as a binary vector of length

n

, where position

i

is 1 if word

w_{i}

appears in the sentence, and 0 otherwise.

For vocabulary

V = {w_{1}, w_{2}, . . ., w_{n}}

and sentence

s

BoW (s) = [b_{1}, b_{2}, \dots, b_{n}]

where

b_{i} = 1

w_{i} \in s

, else

b_{i} = 0

Cosine Similarity:
To compare two vectors

A

and

B

similarity = \frac{A \cdot B}{‖ A ‖ ‖ B ‖} = \frac{\sum_{i = 1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \times \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}}

This gives a value between 0 and 1, which we convert to a percentage.

How to Use:

1. Enter a sentence in Sentence 1 input field (e.g., "john ate the horse")
2. Click "Analyze Sentence 1" to see word-to-index mappings and the sparse vector representation
3. Enter a different sentence in Sentence 2 (e.g., "the horse ate john")
4. Click "Analyze Sentence 2" to process the second sentence
5. Click "Compare Sentences" to calculate cosine similarity between the two vectors

Try these examples:
• "john ate the horse" vs "the horse ate john" - should show 100% similarity (same words, different order)
• "i love cats" vs "i love dogs" - moderate similarity (two shared words out of three)
• "the cat sleeps" vs "dogs run fast" - low similarity (no shared words)

Understanding Bag of Words:

• Order Independence: Word order is completely ignored. "john ate the horse" = "the horse ate john" in terms of representation
• Loss of Meaning: "john ate the horse" and "the horse ate john" have very different meanings but identical bag-of-words representations
• Sparse Vectors: With 370,103 words in the dictionary, most vector positions are 0. A typical sentence only activates 3-8 positions
• Unknown Words: Words not in the dictionary are highlighted in red and ignored in the vector representation
• Case Insensitive: All words are converted to lowercase before lookup
• Punctuation: Automatically removed during tokenization

Real-World Applications:
• Document classification (spam detection, topic categorization)
• Information retrieval (search engines)
• Sentiment analysis (positive/negative classification)
• Text clustering (grouping similar documents)

Sentence 1

Word-to-Index Mapping:

Click "Analyze Sentence 1" to see results

Active Indices (One-Hot Vector):

Sentence 2

Word-to-Index Mapping:

Click "Analyze Sentence 2" to see results

Active Indices (One-Hot Vector):

Sentence Comparison

Cosine Similarity:

Dictionary Information

Using English dictionary with 370,103 words

Dictionary loaded successfully!