Bag of Words

This interactive demo demonstrates how text is converted into numerical representations using the bag-of-words approach. Enter sentences to see how each word maps to a position in a 370,000+ word dictionary.

The demo uses one-hot encoding to create sparse vectors representing each sentence. Compare two sentences to see how bag-of-words treats text as an unordered collection of words - "john ate the horse" and "the horse ate john" produce identical representations.
Bag of Words Representation:

The bag-of-words model represents text as a collection of words, disregarding grammar and word order. Each unique word in the vocabulary gets assigned an index position.

One-Hot Encoding:
For a vocabulary of size $n$, each sentence is represented as a binary vector of length $n$, where position $i$ is 1 if word $w_i$ appears in the sentence, and 0 otherwise.

For vocabulary $V = \{w_1, w_2, ..., w_n\}$ and sentence $s$:
$$\mathrm{BoW}(s) = [b_1, b_2, \ldots, b_n]$$
where $b_i = 1$ if $w_i \in s$, else $b_i = 0$

Cosine Similarity:
To compare two vectors $\mathbf{A}$ and $\mathbf{B}$:
$$\mathrm{similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{\lVert\mathbf{A}\rVert\, \lVert\mathbf{B}\rVert} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}}$$
This gives a value between 0 and 1, which we convert to a percentage.
How to Use:

1. Enter a sentence in Sentence 1 input field (e.g., "john ate the horse")
2. Click "Analyze Sentence 1" to see word-to-index mappings and the sparse vector representation
3. Enter a different sentence in Sentence 2 (e.g., "the horse ate john")
4. Click "Analyze Sentence 2" to process the second sentence
5. Click "Compare Sentences" to calculate cosine similarity between the two vectors

Try these examples:
• "john ate the horse" vs "the horse ate john" - should show 100% similarity (same words, different order)
• "i love cats" vs "i love dogs" - moderate similarity (two shared words out of three)
• "the cat sleeps" vs "dogs run fast" - low similarity (no shared words)
Understanding Bag of Words:

Order Independence: Word order is completely ignored. "john ate the horse" = "the horse ate john" in terms of representation
Loss of Meaning: "john ate the horse" and "the horse ate john" have very different meanings but identical bag-of-words representations
Sparse Vectors: With 370,103 words in the dictionary, most vector positions are 0. A typical sentence only activates 3-8 positions
Unknown Words: Words not in the dictionary are highlighted in red and ignored in the vector representation
Case Insensitive: All words are converted to lowercase before lookup
Punctuation: Automatically removed during tokenization

Real-World Applications:
• Document classification (spam detection, topic categorization)
• Information retrieval (search engines)
• Sentiment analysis (positive/negative classification)
• Text clustering (grouping similar documents)

Sentence 1

Word-to-Index Mapping:
Click "Analyze Sentence 1" to see results
Active Indices (One-Hot Vector):
-

Sentence 2

Word-to-Index Mapping:
Click "Analyze Sentence 2" to see results
Active Indices (One-Hot Vector):
-

Sentence Comparison

Cosine Similarity:
-

Dictionary Information

Using English dictionary with - words

Loading dictionary...