25 Sep 2024
Tokenizers
_Co-authored by Tamil Arasan, Selvakumar Murugan and Malaikannan Sankarasubbu
In Natural Language Processing (NLP), one of the foundational steps is transforming human language into a format that computational models can understand. This is where tokenizers come into play. Tokenizers are specialized tools that break down text into smaller units called tokens, and convert these tokens into numerical data that models can process.
Imagine you have the sentence:
Artificial intelligence is revolutionizing technology.
To a human, this sentence is clear and meaningful. But we do not
understand the whole sentence in one shot(okay may be you did, but I am
sure if I gave you a paragraph or a even better an essay, you will not
be able to understand them in one shot), but we make sense of parts of
it like words and then phrases and understand the whole sentence as a
composition of meanings from its parts. It is just how things work,
regardless whether we are trying to make a machine mimic our language
understanding or not. This has nothing to do with the reason ML models
or even computers in general work with numbers. It is purely how
language works and there is no going around it.
ML models like everything else we run on computers can only work with
numbers, and we need to transform the text into number or series of
numbers (since we have more than one word). We have a lot of freedom
when it comes to how we transform the text into numbers, and as always
with freedom comes complexity. But basically, tokenization as a whole is
a two step process. Finding all the words and assigning a unique
number - an ID to each token.
There are so many ways we can segment a sentence/paragraph into pieces
like phrases, words, sub-words or even individual characters.
Understanding why particular tokenization scheme is better requires a
grasp of how embeddings work. If you're familiar with NLP, you'd ask
"Why? Tokenization comes before the Embedding, right?" Yes, you're
right, but NLP is paradoxical like that. Don't worry we will cover that
as we go.
Background
Before we venture any further, lets understand the difference between
Neural networks and our typical computer programs. We all know by now
that for traditional computer programs, we write/translate the rules
into code by hand whereas, NNs learn the rules(mapping across input and
output) from data by the process called training. You see unlike in
normal programming style, where we have a plethora of data-structures
that can help with storing information in any shape or form we want,
along with algorithms that jump up and down, back and forth in a set of
instructions we call code, Neural Networks do not allow us to have all
sorts of control flow we'd like. In Neural Networks, there is only one
direction the "program" can run, left to right.
Unlike in traditional programs where the we can feed a program with
input in complicated ways, in Neural Networks, there are only fixed
number of ways, we can feed and it is usually in the form of vectors
(fancy name for list of numbers) and the vectors are of fixed size (or
dimension more precisely). In most DNNs, input and output sizes are
fixed regardless of the problem it is trying to solve. For example, CNNs
the input (usually image) size and number of channels is fixed. In RNNs,
the embedding dimensions, input vocabulary size, number of output labels
(classification problem e.g: sentiment classification) and or output
vocabulary size (text generation problems e.g: QA, translation) are all
fixed. In Transformer networks even the sentence length is fixed. This
is not a bad thing, constraints like these enable the network to
compress and capture the necessary information.
Also note that there are only few tools to test "equality" or
"relevance" or "correctness" for things inside the network because
only things that dwell inside the network are vectors. Cosine similarity
and attention scores are popular. You can think of vectors as variables
that keep track of state inside neural network program. But unlike in
traditional programs where you can declare variables as you'd like and
print them for troubleshooting, in networks the vector-variables are
only meaningful only at the boundaries of the layers(not entirely true)
within the networks.
Lets take a look at the simplest example to understand why just pulling
a vector from anywhere in the network will not be of any value for us.
In the following code, three functions perform the identical calculation
despite their code is slightly different. The unnecessarily
intentionally named variables temp
and growth_factor
need not be
created as exemplified by the first function, which directly embodies
the compound interest calculation formula, $A = P(1+\frac{R}{100})^{T}$.
When compared to temp
, the variable growth_factor
hold a more
meaningful interpretation - represents how much the money will grow due
to compounding interest over time. For more complicated formulae and
functions, we might create intermediate variables so that the code goes
easy on the eye, but they have no significance to the operation of the
function.
def compound_interest_1(P,R,T):
A = P * (math.pow((1 + (R/100)),T))
CI = A - P
return CI
def compound_interest_2(P,R,T):
temp = (1 + (R/100))
A = P * (math.pow(temp, T))
CI = A - P
return CI
def compound_interest_3(P,R,T):
growth_factor = (math.pow((1 + (R/100)),T))
A = P * growth_factor
CI = A - P
return CI
Another example to illustrate from operations perspective. Clock
arithmetic. Lets assign numbers 0 through 7 to weekdays starting from
Sunday to Saturday.
Table 1
Sun |
Mon |
Tue |
Wed |
Thu |
Fri |
Sat |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
John Conway suggests, a mnemonic device for thinking of the days of
the week as Noneday, Oneday, Twosday, Treblesday, Foursday, Fiveday,
and Six-a-day.
So if you want to know what day it is 137 days from today if today is
say, Thursday (i.e. 4). We can do $(4+137) mod 7 => 1$ i.e Monday. As
you can see adding numbers(days) in clock arithmetic results in a
meaningful output. You can days together to get another day. Okay lets
ask the question can we multiply two days together? Is it is in anyway
meaningful to multiply days? Just because we can multiply any number
mathematically, is it useful to do so in our clock arithmetic?
All of this digression is to emphasize that the embedding is deemed
to capture the meaning of words, vector from the last layers is deemed
to capture the meaning of a sentence lets say. But when you take a
vector (just because you can) within the layers for instance, it does
not refer to any meaningful unit such as words or phrases and sentence
as we understand it.
A little bit of history
If you're old enough, you might remember that before transformers
became standard paradigm in NLP, we had another one EEAP (Embed, Encode,
Attend, Predict). I am grossly oversimplifying here, but you can think
of it as follows,
- Embedding
-
Captures the meaning of words A matrix of size $N \times D$, where
- $N$ is the size of the vocabulary, i.e unique number of words in
the language
- $D$ is the dimension of embedding, vector corresponding to each
word.
Lookup the word-vector (embedding) for each word
- Encoding
- Find the meaning of a sentence, by using the meaning captured in
embeddings of the constituent words with help of RNNs like LSTM, GRU
or transformers like BERT, GPT that take the embeddings and produce
vector(s) for whole the sequence.
- Prediction
- Depending upon the task at hand, either assigns a label to the input
sentence, or generate another sentence word by word.
- Attention
- Helps with Prediction by focusing on what is important right now by
drawing a probability distribution (normalized attention scores)
over the all words. Words with high score are deemed important.
As you can see above, $N$ is the vocabulary size, i.e unique number of
words in the language. And handful of years ago, language usually meant
the corpus at hand (in order of few thousands of sentences) and datasets
like CNN/DailyMail were considered huge. There were clever tricks like
anonymizing named entities to force the ML models to focus on language
specific features like grammar instead of open world words like names of
Places, Presidents, Corporations and Countries, etc. Good times they
were! Point is, it is possible that the corpus you have in your
possession might not have all the words of the language. As we have
seen, the size of the Embedding must be fixed before training the
network. By good fortune if you stumble upon a new dataset and hence new
words, adding them to your model was not easy, because Embedding needs
to extend to accommodate this new (OOV) words and that requires
retraining of the whole network. OOV means Out Of the current model's
Vocabulary. And this is why simply segmenting the text on empty spaces
will not work.
With that background, lets dive in.
Tokenization
Tokenization is the process of segmenting the text into individual
pieces (usually words) so that ML model can digest them. It is the very
first step in any NLP system and influences everything that follows. For
understanding impact of tokenization, we need to understand how
embeddings and sentence length influence the model. We will call
sentence length as sequence length from here on, because sentence is
understood to be sequence of words, and we will experiment with sequence
of different things not just words, which we will call tokens.
Tokens can be anything
- Words -
"telephone" "booth" "is" "nearby" "the" "post" "office"
- Multiword Expressions (MWEs) -
"telephone booth" "is" "nearby" "the" "post office"
- Sub-words -
"tele" "#phone" "booth" "is" "near " "#by" "the" "post" "office"
- Characters -
"t" "e" "l" "e" "p" ... "c" "e"
We know segmenting the text based on empty spaces will not work, because
the vocabulary will keep growing. What about punctuations? Surely they
will help with words
don't, won't, aren't, o'clock, Wendy's, co-operation
{.verbatim} etc,
same reasoning applies here too. Moreover segmenting at punctuations
will create different problems, e.g: I.S.R.O > I, S, R, O
{.verbatim}
which is not ideal.
Objectives of Tokenization
The primary objectives of tokenization are:
- Handling OOV
- Tokenizers should be able to segment the text into pieces so that
any word in the language whether it is in the dataset or not, any
word we might conjure in foreseeable future, whether it is a
technical/domain specific terminology that scientists might utter to
sound intelligent or commonly used by everyone in day to day life.
An ideal tokenizer should be able to deal with all and any of them.
- Efficiency
- Reducing the size (length) of the input text to make computation
feasible and faster.
- Meaningful Representation
- Capturing the semantic essence of the text so that the model can
learn effectively. Which we will discuss a bit later.
Simple Tokenization Methods
Go through the code below, and see if you can make any inferences on the
table produced. It reads the book The
Republic and
counts the tokens on character, word and sentence levels and also
indicated the number of unique tokens in the whole book.
Code
``` {.python results=”output raw” exports=”both”}
from collections import Counter
from nltk.tokenize import sent_tokenize
with open(‘plato.txt’) as f:
text = f.read()
words = text.split()
sentences = sent_tokenize(text)
char_counter = Counter()
word_counter = Counter()
sent_counter = Counter()
char_counter.update(text)
word_counter.update(words)
sent_counter.update(sentences)
print(‘#+name: Vocabulary Size’)
print(‘|Type|Vocabulary Size|Sequence Length|’)
print(f’|Unique Characters|{len(char_counter)}|{len(text)}’)
print(f’|Unique Words|{len(word_counter)}|{len(words)}’)
print(f’|Unique Sentences|{len(sent_counter)}|{len(sentences)}’)
**Table 2**
| Type | Vocabulary Size | Sequence Length |
| ----------------- | --------------- | --------------- |
| Unique Characters | 115 | 1,213,712 |
| Unique Words | 20,710 | 219,318 |
| Unique Sentences | 7,777 | 8,714 |
## Study
Character-Level Tokenization
: In this most elementary method, text is broken down into individual
characters.
*\"data\"* \> `"d" "a" "t" "a"`{.verbatim}
Word-Level Tokenization
: This is the simplest and most used (before sub-word methods became
popular) method of tokenization, where text is split into individual
words based on spaces and punctuation. Still useful in some
applications and as a pedagogical launch pad into other tokenization
techniques.
*\"Machine learning models require data.\"* \>
`"Machine", "learning", "models", "require", "data", "."`{.verbatim}
Sentence-Level Tokenization
: This approach segments text into sentences, which is useful for
tasks like machine translation or text summarization. Sentence
tokenization is not as popular as we\'d like it to be.
*\"Tokenizers convert text. They are essential in NLP.\"* \>
`"Tokenizers convert text.", "They are essential in NLP."`{.verbatim}
n-gram Tokenization
: Instead of using sentences as a tokens, what if you could use
phrases of fixed length. The following shows the n-grams for n=2,
i.e 2-gram or bigram. Yes the `n`{.verbatim} in the n-grams stands
for how many words are chosen. n-grams can also be built from
characters instead of words, though not as useful as word level
n-grams.
*\"Data science is fun\"* \>
`"Data science", "science is", "is fun"`{.verbatim}.
**Table 3**
| Tokenization | Advantages | Disadvantages |
| ------------ | -------------------------------------- | ---------------------------------------------------- |
| Character | Minimal vocabulary size | Very long token sequences |
| | Handles any possible input | Require huge amount of compute |
| Word | Easy to implement and understand | Large vocabulary size |
| | Preserves meaning of words | Cannot cover the whole language |
| Sentence | Preserves the context within sentences | Less granular; may miss important word-level details |
| | Sentence-level semantics | Sentence boundary detection is challenging |
As you can see from the table, the vocabulary size and sequence length
have inverse correlation. The Neural networks requires that the tokens
should be present in many places and many times. That is how the
networks understand words. Remember when you don\'t know the meaning of
a word, you ask someone to use it in sentences? Same thing here, the
more sentences the token is present, the better the network can
understand it. But in case of sentence tokenization, you can see there
are as many tokens in its vocabulary as in the tokenized corpus. It is
safe to say that each token is occuring only once and that is not a
healthy diet for a network. This problem occurs in word-level
tokenization too but it is subtle, the out-of-vocabulary(OoV) problem.
To deal with OOV we need to stay between character level and word-level
tokens, enter \>\>\> sub-words \<\<\<.
# Advanced Tokenization Methods
Subword tokenization is an advanced tokenization technique that breaks
text into smaller units, smaller than words. It helps in handling rare
or unseen words by decomposing them into known subword units. Our hope
is that, the sub-words decomposed from text, can be used to compose new
unseen words and so act as the tokens for the unseen words. Common
algorithms include Byte Pair Encoding (BPE), WordPiece, SentencePiece.
*\"unhappiness\"* \> `"un", "happi", "ness"`{.verbatim}
BPE is originally a technique for compression of data. Repurposed to
compress text corpus by merging frequently occurring pairs of characters
or subwords. Think of it like what and how little number of unique
tokens you need to recreate the whole book when you are free to arrange
those tokens in a line as many time as you want.
Algorithm
: 1. *Initialization*: Start with a list of characters (initial
vocabulary) from the text(whole corpus).
2. *Frequency Counting*: Count all pair occurrences of consecutive
characters/subwords.
3. *Pair Merging*: Find the most frequent pair and merge it into a
single new subword.
4. *Update Text*: Replace all occurrences of the pair in the text
with the new subword.
5. *Repeat*: Continue the process until reaching the desired
vocabulary size or merging no longer provides significant
compression.
Advantages
: - Reduces the vocabulary size significantly.
- Handles rare and complex words effectively.
- Balances between word-level and character-level tokenization.
Disadvantages
: - Tokens may not be meaningful standalone units.
- Slightly more complex to implement.
## Trained Tokenizers
WordPiece and SentencePiece tokenization methods are extensions of BPE
where the vocabulary is not merely created by assuming merging most
frequent pair. These variants evaluate whether the given merges were
useful or not by measuring how much each merge maximizes the likelihood
of the corpus. In simple words, lets take two vocabularies, before and
after the merges, and train two language models and the model trained on
vocabulary after the merges have lower perplexity(think loss) then we
assume that the merges were useful. And we need to repeat this every
time we make a merge. Not practical, and hence there some mathematical
tricks we use to make this more practical that we will discuss in a
future post.
The iterative merging process is the training of tokenizer and this
training is different training of actual models. There are python
libraries for training your own tokenizer, but when you\'re planning to
use a pretrained language model, it is better to stick with the
pretrained tokenizer associated with that model. In the following
section we see how to train a simple BPE tokenizer, SentencePiece
tokenizer and how to use BERT tokenizer that comes with huggingface\'s
`transformers`{.verbatim} library.
## Tokenization Techniques Used in Popular Language Models
### Byte Pair Encoding (BPE) in GPT Models
GPT models, such as GPT-2 and GPT-3, utilize Byte Pair Encoding (BPE)
for tokenization.
``` {.python results="output code" exports="both"}
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
vocab_size=30000)
files = ["plato.txt"]
tokenizer.train(files, trainer)
tokenizer.model.save('.', 'bpe_tokenizer')
output = tokenizer.encode("Tokenization is essential first step for any NLP model.")
print("Tokens:", output.tokens)
print("Token IDs:", output.ids)
print("Length: ", len(output.ids))
Tokens: ['T', 'oken', 'ization', 'is', 'essential', 'first', 'step', 'for', 'any', 'N', 'L', 'P', 'model', '.']
Token IDs: [50, 6436, 2897, 127, 3532, 399, 1697, 184, 256, 44, 42, 46, 3017, 15]
Length: 14
SentencePiece in T5
T5 models use a Unigram Language Model for tokenization, implemented via
the SentencePiece library. This approach treats tokenization as a
probabilistic model over all possible tokenizations.
import sentencepiece as spm
spm.SentencePieceTrainer.Train('--input=plato.txt --model_prefix=unigram_tokenizer --vocab_size=3000 --model_type=unigram')
``` {.python results=”output code” exports=”both”}
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load(“unigram_tokenizer.model”)
text = “Tokenization is essential first step for any NLP model.”
pieces = sp.EncodeAsPieces(text)
ids = sp.EncodeAsIds(text)
print(“Pieces:”, pieces)
print(“Piece IDs:”, ids)
print(“Length: “, len(ids))
``` python
Pieces: ['▁To', 'k', 'en', 'iz', 'ation', '▁is', '▁essential', '▁first', '▁step', '▁for', '▁any', '▁', 'N', 'L', 'P', '▁model', '.']
Piece IDs: [436, 191, 128, 931, 141, 11, 1945, 123, 962, 39, 65, 17, 499, 1054, 1441, 1925, 8]
Length: 17
WordPiece Tokenization in BERT
``` {.python results=”output code”}
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)
text = “Tokenization is essential first step for any NLP model.”
encoded_input = tokenizer(text, return_tensors=’pt’)
print(“Tokens:”, tokenizer.convert_ids_to_tokens(encoded_input[‘input_ids’][0]))
print(“Token IDs:”, encoded_input[‘input_ids’][0].tolist())
print(“Length: “, len(encoded_input[‘input_ids’][0].tolist()))
```
Summary of Tokenization Methods
Table 4
Method |
Length |
Tokens |
BPE |
14 |
[‘T’, ‘oken’, ‘ization’, ‘is’, ‘essential’, ‘first’, ‘step’, ‘for’, ‘any’, ‘N’, ‘L’, ‘P’, ‘model’, ‘.’] |
SentencePiece |
17 |
[‘▁To’, ‘k’, ‘en’, ‘iz’, ‘ation’, ‘▁is’, ‘▁essential’, ‘▁first’, ‘▁step’, ‘▁for’, ‘▁any’, ‘▁’, ‘N’, ‘L’, ‘P’, ‘▁model’, ‘.’] |
WordPiece (BERT) |
12 |
[‘token’, ‘##ization’, ‘is’, ‘essential’, ‘first’, ‘step’, ‘for’, ‘any’, ‘nl’, ‘##p’, ‘model’, ‘.’] |
Different tokenization methods give different results for the same input
sentence. As we add more data to the tokenizer training, the differences
between WordPiece and SentencePiece might decrease, but they will not
vanish, because of the difference in their training process.
Table 5
Model |
Tokenization Method |
Library |
Key Features |
GPT |
Byte Pair Encoding |
tokenizers |
Balances vocabulary size and granularity |
BERT |
WordPiece |
transformers |
Efficient vocabulary, handles morphology |
T5 |
Unigram Language Model |
sentencepiece |
Probabilistic, flexible across languages |
Tokenization and Non English Languages
Tokenizing text is complex, especially when dealing with diverse
languages and scripts. Various challenges can impact the effectiveness
of tokenization.
Tokenization Issues with Complex Languages: With a focus on Tamil
Tokenizing text in languages like Tamil presents unique challenges due
to their linguistic and script characteristics. Understanding these
challenges is essential for developing effective NLP applications that
handle Tamil text accurately.
Challenges in Tokenizing Tamil Language
-
1. Agglutinative Morphology
Tamil is an agglutinative language, meaning it forms words by
concatenating morphemes (roots, suffixes, prefixes) to convey
grammatical relationships and meanings. A single word may express
what would be a full sentence in English.
- Impact on Tokenization
-
- Words can be very lengthy and contain many morphemes.
- போகமுடியாதவர்களுக்காவேயேதான்
-
2. Punarchi and Phonology
Tamil specific rules on how two words can be combined and resulting
word may not be phonologically identical to its parts. The
phonological transformations can cause problems with TTS/STT systems
too.
- Impact on Tokenization
-
- Surface forms of words may change when combined, making
boundary detection challenging.
- மரம் + வேர் > மரவேர்
- தமிழ் + இனிது > தமிழினிது
-
3. Complex Script and Orthography
Tamil alphabet representation in Unicode is suboptimal for
everything except for standardized storage format. Even simple
operations that are intuitive for native Tamil speaker, are harder
to implement because of this. Techniques like BPE applied on Tamil
text will break words at completely inappropriate points like
cutting an uyirmei letter into consonant and diacritic resulting in
meaningless output.
தமிழ் > த ம ி ழ, ்
Strategies for Effective Tokenization of Tamil Text
-
Language-Specific Tokenizers
Train Tamil specific subword tokenizers with initial seed tokens
prepared by better preprocessing techniques to avoid
[problem-3]{.spurious-link
target=”*3. Complex Script and Orthography”} type cases. Use
morphological analyzers to decompose words into root and affixes,
aiding in understanding and processing complex word forms.
Choosing the Right Tokenization Method
Challenges in Tokenization
- Ambiguity: Words can have multiple meanings, and tokenizers cannot
capture context. Example: The word "lead" can be a verb or a
noun.
- Handling Special Characters and Emojis: Modern text often includes
emojis, URLs, and hashtags, which require specialized handling.
- Multilingual Texts: Tokenizing text that includes multiple languages
or scripts adds complexity, necessitating adaptable tokenization
strategies.
Best Practices for Effective Tokenization
- Understand Your Data: Analyze the text data to choose the most
suitable tokenization method.
- Consider the Task Requirements: Different NLP tasks may benefit from
different tokenization granularities.
- Use Pre-trained Tokenizers When Possible: Leveraging existing
tokenizers associated with pre-trained models can save time and
improve performance.
- Normalize Text Before Tokenization: Cleaning and standardizing text
31 Aug 2024
Vector Databases 101
_Co-authored by Angu S KrishnaKumar, Kamal raj Kanagarajan and Malaikannan Sankarasubbu
Introduction
In the world of Large Language Models (LLMs), vector databases play a pivotal role in Retrieval Augmented Generation (RAG) applications.** These specialized databases are designed to store and retrieve high-dimensional vectors, which represent complex data structures like text, images, and audio. By leveraging vector databases, LLMs can access vast amounts of information and generate more informative and accurate responses. Retrieval Augmented Generation (RAG) is a technique that combines the power of large language models (LLMs) with external knowledge bases to generate more informative and accurate responses. By retrieving relevant information from a knowledge base and incorporating it into the LLM’s generation process, RAG can produce more comprehensive and contextually appropriate outputs.
How RAG Works:
- User Query: A user submits a query or prompt to the RAG system.
- Information Retrieval: The system retrieves relevant information from a knowledge base based on the query. VectorDBs play a key role in this. Embeddings aka vectors are stored in VectorDB and retrieval is done using similarity measures.
- Language Model Generation: The retrieved information is fed into a language model, which generates a response based on the query and the retrieved context.
In this blog series, we will delve into the intricacies of vector databases, exploring their underlying principles, key features, and real-world applications. We will also discuss the advantages they offer over traditional databases and how they are transforming the way we store, manage, and retrieve data.
What is a Vector?
A vector is a sequence of numbers that forms a group. For example
- (3) is a one dimensional vector.
- (2,8) is a two dimensional vector.
- (12,6,7,4) is a four dimensional vector.
A vector can be represented as by plotting on a graph. Lets take a 2D example
We can only visualize 3 dimensions, anything more than that you can just say it not visualize.
Below is an example of 4 dimension vector representation of the word king
What is a Vector Database?
A Vector Database (VectorDB) is a specialized database system designed to store, manage, and efficiently query high-dimensional vector data. Unlike traditional relational databases that work with structured data in tables, VectorDBs are optimized for handling vector embeddings – numerical representations of data in multi-dimensional space.
In a VectorDB:
- Each item (like a document, image, or concept) is represented as a vector – a list of numbers that describe the item’s features or characteristics.
- These vectors are stored in a way that allows for fast similarity searches and comparisons.
- The database is optimized for operations like finding the nearest neighbors to a given vector, which is crucial for many AI and machine learning applications.
VectorDBs are particularly useful in scenarios where you need to find similarities or relationships between large amounts of complex data, such as in recommendation systems, image recognition, or natural language processing tasks.
Key Concepts
-
Vector Embeddings
- Vector embeddings are numerical representations of data in a multi-dimensional space.
- They capture semantic meaning and relationships between different pieces of information.
- In natural language processing, word embeddings are a common type of vector embedding. Each word is represented by a vector of real numbers, where words with similar meanings are closer in the vector space.
- For detail concepts of embedding please refer to earlier blog Embeddings
Let’s look at an example of Word Vector output generated by Word2Vec
from gensim.models import Word2Vec
# Example corpus (a list of sentences, where each sentence is a list of words)
sentences = [
["machine", "learning", "is", "fascinating"],
["gensim", "is", "a", "useful", "library", "for", "word", "embeddings"],
["vector", "representations", "are", "important", "for", "NLP", "tasks"]
]
# Train a Word2Vec model with 300-dimensional vectors
model = Word2Vec(sentences, vector_size=300, window=5, min_count=1, workers=4)
# Get the 300-dimensional vector for a specific word
word_vector = model.wv['machine']
# Print the vector
print(f"Vector for 'machine': {word_vector}")
Sample Output for 300 dimension vector
Vector for 'machine': [ 2.41737941e-03 -1.42750892e-03 -4.85344668e-03 3.12493594e-03, 4.84531874e-03 -1.00165956e-03 3.41092921e-03 -3.41384278e-03, 4.22888929e-03 1.44586214e-03 -1.35438916e-03 -3.27448458e-03
4.70721726e-03 -4.50850562e-03 2.64214014e-03 -3.29884756e-03, -3.13906092e-03 1.09677911e-03 -4.94637461e-03 3.32896863e-03,2.03538216e-03 -1.52456785e-03 2.28793684e-03 -1.43519988e-03, 4.34566711e-03 -1.94705374e-03 1.93231280e-03 4.34081139e-03
...
3.40303702e-03 1.58637420e-03 -3.31261402e-03 2.01543484e-03,4.39879852e-03 2.54576413e-03 -3.30528596e-03 3.01509819e-03,2.15555660e-03 1.64605413e-03 3.02376228e-03 -2.62048110e-03
3.80181967e-03 -3.14147812e-03 2.23554621e-03 2.68812295e-03,1.80951719e-03 1.74256027e-03 -2.47024545e-03 4.06702763e-03,2.30203426e-03 -4.75471295e-03 -3.66776927e-03 2.06539119e-03]
- High Dimensional Space
- Vector databases typically work with vectors that have hundreds or thousands of dimensions. This high dimensionality allows for rich and nuanced representations of data.
- For example:
- A word might be represented by 300 dimensions
- An image could be represented by 1000 dimensions
- A user’s preferences might be captured in 500 dimensions
Why do you need a Vector Database when there is RDBMS like PostGreSQL or NoSQL DB like Elastic Search or MongoDB?
RDBMS
RDBMS are designed to store and manage structured data in a tabular format. They are based on the relational model, which defines data as a collection of tables, where each table represents a relation.
Key components of RDBMS:
- Tables: A collection of rows and columns, where each row represents a record and each column represents an attribute.
- Rows: Also known as records, they represent instances of an entity.
- Columns: Also known as attributes, they define the properties of an entity.
- Primary key: A unique identifier for each row in a table.
- Foreign key: A column in one table that references the primary key of another table, establishing a relationship between the two tables.
- Normalization: A process of organizing data into tables to minimize redundancy and improve data integrity.
Why RDBMS don’t apply to storing vectors:
- Data Representation:
- RDBMS store data in a tabular format, where each row represents an instance of an entity and each column represents an attribute.
- Vectors are represented as a sequence of numbers, which doesn’t fit well into the tabular structure of RDBMS.
- Query Patterns:
- RDBMS are optimized for queries based on joining tables and filtering data based on specific conditions.
- Vector databases are optimized for similarity search, which involves finding vectors that are closest to a given query vector. This type of query doesn’t align well with the traditional join-based queries of RDBMS.
- Data Relationships:
- RDBMS define relationships between entities using foreign keys and primary keys.
- In vector databases, relationships are implicitly defined by the proximity of vectors in the vector space. There’s no explicit need for foreign keys or primary keys.
- Performance Considerations:
- RDBMS are often optimized for join operations and range queries.
- Vector databases are optimized for similarity search, which requires efficient indexing and partitioning techniques.
Let’s also look at a table for a comparison of features
Feature |
VectorDB |
RDBMS |
Dimensional Efficiency |
Designed to handle high-dimensional data efficiently |
Performance degrades rapidly as dimensions increase |
Similarity Search |
Implement specialized algorithms for fast approximate nearest neighbor (ANN) searches |
Lack native support for ANN algorithms, making similarity searches slow and computationally expensive |
Indexing for Vector Spaces |
Use index structures optimized for vector data (e.g., HNSW, IVF) |
Rely on B-trees and hash indexes, which become ineffective in high-dimensional spaces |
Vector Operations |
Provide built-in, optimized support for vector operations |
Require complex, often inefficient SQL queries to perform vector computations |
Scalability for Vector Data |
Designed to distribute vector data and parallelize similarity searches across multiple nodes efficiently |
While scalable for traditional data, they’re not optimized for distributing and querying vector data at scale |
Real-time Processing |
Optimized for fast insertions and queries of vector data, supporting real-time applications |
May struggle with the speed requirements of real-time vector processing, especially at scale |
Storage Efficiency |
Use compact, specialized formats for storing dense vector data |
Less efficient when storing high-dimensional vectors, often requiring more space and slower retrieval |
Machine Learning Integration |
Seamlessly integrate with ML workflows, supporting operations common in AI applications |
Require additional processing and transformations to work effectively with ML pipelines |
Approximate Query Support |
Often support approximate queries, trading off some accuracy for significant speed improvements |
Primarily designed for exact queries, lacking native support for approximate vector searches |
In a nutshell, RDBMS are well-suited for storing and managing structured data, but they are not optimized for storing and querying vectors. Vector databases, on the other hand, are specifically designed for handling vectors and performing similarity search operations.
NoSQL Databases
NoSQL databases are designed to handle large datasets and unstructured or semi-structured data that don’t fit well into the relational model. They offer flexibility in data structures, scalability, and high performance.
Common types of NoSQL databases include:
- Key-value stores: Store data as key-value pairs.
- Document stores: Store data as documents, often in JSON or BSON format.
- Wide-column stores: Store data in wide columns, where each column can have multiple values.
- Graph databases: Store data as nodes and relationships, representing connected data.
Key characteristics of NoSQL databases:
- Flexibility: NoSQL databases offer flexibility in data structures, allowing for dynamic schema changes and accommodating evolving data requirements.
- Scalability: Many NoSQL databases are designed to scale horizontally, allowing for better performance and scalability as data volumes grow.
- High performance: NoSQL databases often provide high performance, especially for certain types of workloads.
- Eventual consistency: Some NoSQL databases prioritize availability and performance over strong consistency, offering eventual consistency guarantees.
Why NoSQL Databases Might Not Be Ideal for Storing and Retrieving Vectors
While NoSQL databases offer many advantages, they might not be the best choice for storing and retrieving vectors due to the following reasons:
- Data Representation: NoSQL databases, while flexible, might not be specifically optimized for storing and querying high-dimensional vectors. The data structures used in NoSQL databases might not be the most efficient for vector-based operations.
- Query Patterns: NoSQL databases are often designed for different query patterns than vector-based operations. While they can handle complex queries, they might not be as efficient for similarity search, which is a core operation for vector databases.
- Performance Considerations:
- Indexing: NoSQL databases often use different indexing techniques than RDBMS. While they can be efficient for certain types of queries, they might not be as optimized for vector-based similarity search.
- Memory requirements: For vector-based operations, especially in large-scale applications, the memory requirements can be significant. NoSQL databases like Elasticsearch, which are often used for full-text search and analytics, might require substantial memory resources to handle large vector datasets efficiently.
Elasticsearch as an Example:
Elasticsearch is a popular NoSQL database often used for full-text search and analytics. While it can be used to store and retrieve vectors, there are some considerations:
- Memory requirements: Storing and indexing large vector datasets in Elasticsearch can be memory-intensive, especially for high-dimensional vectors.
- Query performance: The performance of vector-based queries in Elasticsearch can depend on factors like the number of dimensions, the size of the dataset, and the indexing strategy used.
- Specialized plugins: Elasticsearch offers plugins like the
knn
plugin that can be used to optimize vector-based similarity search. However, these plugins might have additional performance and memory implications.
In a nutshell, while NoSQL databases offer many advantages, their suitability for storing and retrieving vectors depends on specific use cases and requirements. For applications that heavily rely on vector-based similarity search and require high performance, specialized vector databases might be a more appropriate choice.
A Deeper Dive into Similarity Search in Vector Databases
Similarity search is a fundamental operation in vector databases, involving finding the closest matches to a given query vector from a large dataset of vectors. This is crucial for applications like recommendation systems, image search, and natural language processing.
Similarity measures, algorithms, and data structures are crucial for efficient similarity search. Similarity measures (e.g., cosine, Euclidean) quantify the closeness between vectors. Algorithms (e.g., brute force, LSH, HNSW) determine how vectors are compared and retrieved. Data structures (e.g., inverted indexes, hierarchical graphs) optimize storage and retrieval. The choice of these components depends on factors like dataset size, dimensionality, and desired accuracy. By selecting appropriate measures, algorithms, and data structures, you can achieve efficient and accurate similarity search in various applications. Let’s look in details about the different similarity measures and algorithms/datastructures in the below section.
Understanding Similarity Measures
- Cosine Similarity: Measures the angle between two vectors. It’s suitable when the magnitude of the vectors doesn’t matter (e.g., document similarity based on word counts).
import numpy as np
def cosine_similarity(v1, v2):
"""Calculates the cosine similarity between two vectors.
Args:
v1: The first vector.
v2: The second vector.
Returns:
The cosine similarity between the two vectors.
"""
dot_product = np.dot(v1, v2)
norm_v1 = np.linalg.norm(v1)
norm_v2 = np.linalg.norm(v2)
return dot_product / (norm_v1 * norm_v2)
# Example usage
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
similarity = cosine_similarity(vector1, vector2)
print(similarity)
- Euclidean Distance: Measures the straight-line distance between two points in Euclidean space. It’s suitable when the magnitude of the vectors is important (e.g., image similarity based on pixel values).
import numpy as np
def euclidean_distance(v1, v2):
"""Calculates the Euclidean distance between two vectors.
Args:
v1: The first vector.
v2: The second vector.
Returns:
The Euclidean distance between the two vectors.
"""
return np.linalg.norm(v1 - v2)
# Example usage
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
distance = euclidean_distance(vector1, vector2)
print(distance)
- Hamming Distance: Measures the number of positions where two binary vectors differ. It’s useful for comparing binary data.
import numpy as np
def hamming_distance(v1, v2):
"""Calculates the Hamming distance between two binary vectors.
Args:
v1: The first binary vector.
v2: The second binary vector.
Returns:
The Hamming distance between the two vectors.
"""
return np.sum(v1 != v2)
# Example usage
vector1 = np.array([0, 1, 1, 0])
vector2 = np.array([1, 1, 0, 1])
distance = hamming_distance(vector1, vector2)
print(distance)
- Manhattan Distance: Also known as L1 distance, it measures the sum of absolute differences between corresponding elements of two vectors.
import numpy as np
def manhattan_distance(v1, v2):
"""Calculates the Manhattan distance between two vectors.
Args:
v1: The first vector.
v2: The second vector.
Returns:
The Manhattan distance between the two vectors.
"""
return np.sum(np.abs(v1 - v2))
# Example usage
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
distance = manhattan_distance(vector1, vector2)
print(distance)
Algorithms and Data Structures
Brute Force: is a straightforward but computationally expensive algorithm for finding the nearest neighbors in a dataset. It involves comparing the query vector with every other vector in the dataset to find the closest matches.
How Brute Force Works
- Iterate through the dataset: For each vector in the dataset, calculate its distance to the query vector.
- Maintain a list of closest neighbors: Keep track of the closest vectors found so far.
- Update the list: If the distance between the current vector and the query vector is smaller than the distance to the farthest neighbor in the list, replace the farthest neighbor with the current vector.
- Repeat: Continue this process until all vectors in the dataset have been compared.
Advantages and Disadvantages
- Advantages:
- Simple to implement.
- Guaranteed to find the exact nearest neighbors.
- Disadvantages:
- Extremely slow for large datasets.
- Inefficient for high-dimensional data.
Python Code Example
import numpy as np
def brute_force_search(query_vector, vectors, k=10):
"""Performs brute force search for the nearest neighbors.
Args:
query_vector: The query vector.
vectors: The dataset of vectors.
k: The number of nearest neighbors to find.
Returns:
A list of indices of the nearest neighbors.
"""
distances = np.linalg.norm(vectors - query_vector, axis=1)
nearest_neighbors = np.argsort(distances)[:k]
return nearest_neighbors
# Example usage
query_vector = np.random.rand(128)
vectors = np.random.rand(1000, 128)
nearest_neighbors = brute_force_search(query_vector, vectors, k=10)
Brute Force is generally not suitable for large datasets or high-dimensional data due to its computational complexity. For these scenarios, more efficient algorithms like LSH, HNSW, or IVF-Flat are typically used. However, it can be useful for small datasets or as a baseline for comparison with other algorithms.
Locality Sensitive Hashing (LSH): is a technique used to efficiently find similar items in large datasets. It works by partitioning the vector space into buckets and hashing similar vectors into the same bucket. This makes it possible to quickly find approximate nearest neighbors without having to compare every vector in the dataset.
How LSH Works
- Hash Function Selection: Choose a hash function that is sensitive to the distance between vectors. This means that similar vectors are more likely to be hashed into the same bucket.
- Hash Table Creation: Create multiple hash tables, each using a different hash function.
- Vector Hashing: For each vector, hash it into each hash table.
- Query Processing: When a query vector is given, hash it into each hash table.
- Candidate Selection: Retrieve all vectors that are in the same buckets as the query vector.
- Similarity Calculation: Calculate the actual similarity between the query vector and the candidate vectors.
LSH Families
- Random Projection: Projects vectors onto random hyperplanes.
- MinHash: Used for comparing sets of items.
- SimHash: Used for comparing documents based on their shingles.
LSH Advantages and Disadvantages
- Advantages:
- Efficient for large datasets.
- Can be used for approximate nearest neighbor search.
- Can be parallelized.
- Disadvantages:
- Can introduce false positives or negatives.
- Accuracy can be affected by the choice of hash functions and the number of hash tables.
Python Code Example using Annoy
from annoy import Annoy
# Create an Annoy index with LSH
annoy_index = Annoy(128, metric='angular', n_trees=10)
# Add vectors to the index
for i in range(1000):
vector = np.random.rand(128)
annoy_index.add_item(i, vector)
# Build the index
annoy_index.build()
# Search for nearest neighbors
query_vector = np.random.rand(128)
nns = annoy_index.get_nns_by_vector(query_vector, 10)
Note: The n_trees
parameter in Annoy determines the number of hash tables used. A larger number of trees generally improves accuracy but can increase memory usage.
By understanding the fundamentals of LSH and carefully selecting the appropriate parameters, you can effectively use it for similarity search in your applications.
Hierarchical Navigable Small World (HNSW): is a highly efficient algorithm for approximate nearest neighbor search in high-dimensional spaces. It constructs a hierarchical graph structure that allows for fast and accurate retrieval of similar items.
How HNSW Works
- Initialization: The algorithm starts by creating a single layer with all data points.
- Layer Creation: New layers are added iteratively. Each new point is connected to a subset of existing points based on their distance.
- Hierarchical Structure: The layers form a hierarchical structure, with higher layers having fewer connections and lower layers having more connections.
- Search: To find the nearest neighbors of a query point, the search starts from the top layer and gradually moves down the hierarchy, following the connections to find the most promising candidates.
Advantages of HNSW
- High Accuracy: HNSW often achieves high accuracy, even for high-dimensional data.
- Efficiency: It is very efficient for large datasets and can handle dynamic updates.
- Flexibility: The algorithm can be adapted to different distance metrics and data distributions.
Python Code Example using NMSLIB
from nmslib import NMSLIB
# Create an HNSW index
nmslib_index = NMSLIB.init(method='hnsw', space='cos')
# Add vectors to the index
nmslib_index.addDataPointBatch(vectors)
# Create the index
nmslib_index.createIndex()
# Search for nearest neighbors
query_vector = np.random.rand(128)
knn = nmslib_index.knnQuery(query_vector, k=10)
Note: The space
parameter in NMSLIB specifies the distance metric used (e.g., cos
for cosine similarity). You can also customize other parameters like the number of layers and the number of connections per layer to optimize performance for your specific application.
HNSW is a powerful algorithm for approximate nearest neighbor search, offering a good balance between accuracy and efficiency. It’s particularly well-suited for high-dimensional data and can be used in various applications, such as recommendation systems, image search, and natural language processing.
How IVF-Flat Works
- Quantization: The dataset is divided into
n
quantized subspaces (quantization cells). Each vector is assigned to a cell based on its similarity to a representative point (centroid) of the cell.
- Inverted File: An inverted index is created, where each quantized cell is associated with a list of vectors belonging to that cell.
- Flat Index: For each quantized cell, a flat index (e.g., a linear scan or a tree-based structure) is built to store the vectors assigned to that cell.
- Query Processing: When a query vector is given, it’s first quantized to find the corresponding cell. Then, the flat index for that cell is searched for the nearest neighbors.
- Refinement: The top candidates from the flat index can be further refined using exact nearest neighbor search or other techniques to improve accuracy.
Advantages of IVF-Flat
- Efficiency: IVF-Flat can be significantly faster than brute-force search for large datasets.
- Accuracy: It can achieve good accuracy, especially when combined with refinement techniques.
- Scalability: It can handle large datasets and high-dimensional vectors.
- Flexibility: The number of quantized cells and the type of flat index can be adjusted to balance accuracy and efficiency.
Python Code Example using Faiss
import faiss
# Create an IVF-Flat index
index = faiss.IndexIVFFlat(faiss.IndexFlatL2(dim), nlist, nprobe)
# Add vectors to the index
index.add(vectors)
# Search for nearest neighbors
query_vector = np.random.rand(dim)
distances, indices = index.search(query_vector, k)
In this example:
dim
is the dimensionality of the vectors.
nlist
is the number of quantized cells.
nprobe
is the number of cells to query during search.
IVF-Flat is a powerful technique for approximate nearest neighbor search in vector databases, offering a good balance between efficiency and accuracy. By carefully tuning the parameters, you can optimize its performance for your specific application.
How ScanNN Works
- Quantization: The dataset is divided into quantized subspaces (quantization cells). Each vector is assigned to a cell based on its similarity to a representative point (centroid) of the cell.
- Inverted Index: An inverted index is created, where each quantized cell is associated with a list of vectors belonging to that cell.
- Scan: During query processing, the query vector is quantized to find the corresponding cell. Then, the vectors in that cell are scanned to find the nearest neighbors.
- Refinement: The top candidates from the scan can be further refined using exact nearest neighbor search or other techniques to improve accuracy.
Advantages of ScanNN
- Scalability: ScanNN can handle large datasets and high-dimensional vectors efficiently.
- Efficiency: It uses inverted indexes to reduce the search space, making it faster than brute-force search.
- Accuracy: ScanNN can achieve good accuracy, especially when combined with refinement techniques.
- Flexibility: The number of quantized cells and the refinement strategy can be adjusted to balance accuracy and efficiency.
Python Code Example using Faiss
import faiss
# Create a ScanNN index
index = faiss.IndexScanNN(faiss.IndexFlatL2(dim), nlist, nprobe)
# Add vectors to the index
index.add(vectors)
# Search for nearest neighbors
query_vector = np.random.rand(dim)
distances, indices = index.search(query_vector, k)
In this example:
dim
is the dimensionality of the vectors.
nlist
is the number of quantized cells.
nprobe
is the number of cells to query during search.
ScanNN is a powerful algorithm for approximate nearest neighbor search in large-scale applications. It offers a good balance between efficiency and accuracy, making it a popular choice for various tasks, such as recommendation systems, image search, and natural language processing.
Disk-ANN: is a scalable approximate nearest neighbor search algorithm designed for very large datasets that don’t fit entirely in memory. It combines inverted files with on-disk storage to efficiently handle large-scale vector search.
How Disk-ANN Works
- Quantization: The dataset is divided into quantized subspaces (quantization cells), similar to IVF-Flat.
- Inverted Index: An inverted index is created, where each quantized cell is associated with a list of vectors belonging to that cell.
- On-Disk Storage: The inverted index and the vectors themselves are stored on disk, allowing for efficient handling of large datasets.
- Query Processing: When a query vector is given, it’s quantized to find the corresponding cell. The inverted index is used to retrieve the vectors in that cell from disk.
- Refinement: The retrieved vectors can be further refined using exact nearest neighbor search or other techniques to improve accuracy.
Advantages of Disk-ANN
- Scalability: Disk-ANN can handle extremely large datasets that don’t fit in memory.
- Efficiency: It uses inverted indexes and on-disk storage to optimize performance for large-scale search.
- Accuracy: Disk-ANN can achieve good accuracy, especially when combined with refinement techniques.
- Flexibility: The number of quantized cells and the refinement strategy can be adjusted to balance accuracy and efficiency.
Python Code Example using Faiss
import faiss
# Create a Disk-ANN index
index = faiss.IndexDiskANN(faiss.IndexFlatL2(dim), filename, nlist, nprobe)
# Add vectors to the index
index.add(vectors)
# Search for nearest neighbors
query_vector = np.random.rand(dim)
distances, indices = index.search(query_vector, k)
In this example:
filename
is the path to the disk file where the index will be stored.
- Other parameters are the same as in IVF-Flat.
Disk-ANN is a powerful algorithm for approximate nearest neighbor search in very large datasets. It provides a scalable and efficient solution for handling massive amounts of data while maintaining good accuracy.
Vector Database Comparison: Features, Use Cases, and Selection Guide
Just like in RDBMS or NOSQL world there are lot of choices for different databases, Vector Databases also have quite a bit choices, choosing the right one for your application matters quite a bit. Below table compares key features, use-cases and a selection guide
VectorDB |
Key Features |
Best For |
When to Choose |
Pinecone |
Fully managed service, Real-time updates, Hybrid search (vector + metadata), Serverless |
Production-ready applications, Rapid development, Scalable solutions |
When you need a fully managed solution, For applications requiring real-time updates, When combining vector search with metadata filtering |
Milvus |
Open-source, Scalable to billions of vectors, Supports multiple index types, Hybrid search capabilities |
Large-scale vector search, On-premises deployments, Customizable solutions |
When you need an open-source solution, for very large-scale vector search applications, When you require fine-grained control over indexing |
Qdrant |
Open-source, Rust-based for high performance, Supports filtering with payload, On-prem and cloud options |
High-performance vector search, Applications with complex filtering needs |
When performance is critical, for applications requiring advanced filtering, When you need both cloud and on-prem options |
Weaviate |
Open-source, GraphQL API, Multi-modal data support, AI-first database |
Semantic search applications, Multi-modal data storage and retrieval |
When working with multiple data types (text, images, etc.), If you prefer GraphQL for querying, for AI-centric applications |
Faiss (Facebook AI Similarity Search) |
Open-source, Highly efficient for dense vectors, GPU support |
Research and experimentation, Large-scale similarity search |
When you need low-level control, for integration into custom systems, When GPU acceleration is beneficial |
Elasticsearch with vector search |
Full-text search + vector capabilities, Mature ecosystem and extensive analytics features |
Applications combining traditional search and vector search |
when you need rich text analytics, When you’re already using Elasticsearch, For hybrid search applications (text + vector), When you need advanced analytics alongside vector search |
pgvector (PostgreSQL extension) |
Vector similarity search in PostgreSQL, Integrates with existing PostgreSQL databases |
Adding vector capabilities to existing PostgreSQL systems, Small to medium-scale applications |
When you’re already heavily invested in PostgreSQL, for projects that don’t require specialized vector DB features, When simplicity and familiarity are priorities |
Vespa |
Open-source, Combines full-text search, vector search, and structured data, Real-time indexing and serving |
Complex search and recommendation systems, Applications requiring structured, text, and vector data |
For large-scale, multi-modal search applications, When you need a unified platform for different data types, For real-time, high-volume applications |
AWS OpenSearch |
Fully managed AWS service, Combines traditional full-text search capabilities with vector-based similarity search. |
When you need to search for both text-based content and vectors. |
When you need to perform real-time searches and analytics on large datasets. When you want to leverage the broader AWS ecosystem for your application. For applications that require processing billions of vectors. |
Conclusion
For my previous startup that I (Malaikannan Sankarasubbu) founded Datalog dot ai doing a low code virtual assistant platform, we heavily leveraged FAISS to do Intent similarity, from that point to now there are quite a few options for Vector Databases
Vector databases have emerged as a powerful tool for handling unstructured and semi-structured data, offering efficient similarity search capabilities and supporting a wide range of applications. By understanding the fundamentals of vector databases, including similarity measures, algorithms, and data structures, you can select the right approach for your specific needs.
In future blog posts, we will delve deeper into performance considerations, including indexing techniques, hardware optimization, and best practices for scaling vector databases. We will also explore real-world use cases and discuss the challenges and opportunities that lie ahead in the field of vector databases.
05 Aug 2024
Breaking Down Data: The Science and Art of Chunking in Text Processing & RAG Pipeline
As the field of Natural Language Processing (NLP) continues to evolve, the combination of retrieval-based and generative models has emerged as a powerful approach for enhancing various NLP applications. One of the key techniques that significantly improves the efficiency and effectiveness of Retrieval-Augmented Generation (RAG) is chunking. In this blog, we will explore what chunking is, why it is important in RAG, the different ways to implement chunking, including content-aware and recursive chunking, how to evaluate the performance of chunking, chunking alternatives, and how it can be applied to optimize NLP systems.
What is Retrieval-Augmented Generation (RAG)?
Before diving into chunking, let’s briefly understand RAG. Retrieval-Augmented Generation is a framework that combines the strengths of retrieval-based models and generative models. It involves retrieving relevant information from a large corpus based on a query and using this retrieved information as context for a generative model to produce accurate and contextually relevant responses or content.
What is Chunking?
Chunking is the process of breaking down large text documents or datasets into smaller, manageable pieces, or “chunks.” These chunks can then be individually processed, indexed, and retrieved, making the overall system more efficient and effective. Chunking helps in dealing with large volumes of text by dividing them into smaller, coherent units that are easier to handle.
Why Do We Need Chunking?
Chunking is essential in RAG for several reasons:
Efficiency
- Computational cost: Processing smaller chunks of text requires less computational power compared to handling entire documents.
- Storage: Chunking allows for more efficient storage and indexing of information.
Accuracy
- Relevance: By breaking down documents into smaller units, it’s easier to identify and retrieve the most relevant information for a given query.
- Context preservation: Careful chunking can help maintain the original context of the text within each chunk.
Speed
- Retrieval time: Smaller chunks can be retrieved and processed faster, leading to quicker response times.
- Model processing: Language models can process smaller inputs more efficiently.
Limitations of Large Language Models
- Context window: LLMs have limitations on the amount of text they can process at once. Chunking helps to overcome this limitation.
In essence, chunking optimizes the RAG process by making it more efficient, accurate, and responsive.
Different Ways to Implement Chunking
There are various methods to implement chunking, depending on the specific requirements and structure of the text data. Here are some common approaches:
-
Fixed-Length Chunking: Divide the text into chunks of fixed length, typically based on a predetermined number of words or characters.
def chunk_text_fixed_length(text, chunk_size=200, by='words'):
if by == 'words':
words = text.split()
return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
elif by == 'characters':
return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
else:
raise ValueError("Parameter 'by' must be either 'words' or 'characters'.")
text = "The process is more important than the results. And if you take care of the process, you will get the results."
word_chunks = chunk_text_fixed_length(text, chunk_size=5, by='words')
character_chunks = chunk_text_fixed_length(text, chunk_size=5, by='characters')
print(word_chunks)
['The process is more important', 'than the results. And if', 'you take care of the', 'process, you will get the', 'results.']
print(character_chunks)
['The p', 'roces', 's is ', 'more ', 'impor', 'tant ', 'than ', 'the r', 'esult', 's. An', 'd if ', 'you t', 'ake c', 'are o', 'f the', ' proc', 'ess, ', 'you w', 'ill g', 'et th', 'e res', 'ults.']
-
Sentence-Based Chunking: Split the text into chunks based on complete sentences. This method ensures that each chunk contains coherent and complete thoughts.
import nltk
nltk.download('punkt')
def chunk_text_sentences(text, max_sentences=3):
sentences = nltk.sent_tokenize(text)
return [' '.join(sentences[i:i + max_sentences]) for i in range(0, len(sentences), max_sentences)]
text = """Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence. It deals with the interaction between computers and humans through natural language. NLP techniques are used to apply algorithms to identify and extract the natural language rules such that
the unstructured language data is converted into a form that computers can understand. Text mining and text classification are common applications of NLP. It's a powerful tool in the modern data-driven world."""
sentence_chunks = chunk_text_sentences(text, max_sentences=2)
for i, chunk in enumerate(sentence_chunks, 1):
print(f"Chunk {i}:\n{chunk}\n")
Chunk 1:
Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence. It deals with the interaction between computers and humans through natural language.
Chunk 2:
NLP techniques are used to apply algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand. Text mining and text classification are common applications of NLP.
Chunk 3:
It's a powerful tool in the modern data-driven world.
-
Paragraph-Based Chunking: Divide the text into chunks based on paragraphs. This approach is useful when the text is naturally structured into paragraphs that represent distinct sections or topics.
def chunk_text_paragraphs(text):
paragraphs = text.split('\n\n')
return [paragraph for paragraph in paragraphs if paragraph.strip()]
paragraph_chunks = chunk_text_paragraphs(text)
for i, chunk in enumerate(paragraph_chunks, 1):
print(f"Paragraph {i}:\n{chunk}\n")
Paragraph 1:
Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence.
Paragraph 2:
It deals with the interaction between computers and humans through natural language.
Paragraph 3:
NLP techniques are used to apply algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand.
Paragraph 4:
Text mining and text classification are common applications of NLP. It's a powerful tool in the modern data-driven world.
-
Thematic or Semantic Chunking: Use NLP techniques to identify and group related sentences or paragraphs into chunks based on their thematic or semantic content. This can be done using topic modeling or clustering algorithms.
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
nltk.download('punkt')
def chunk_text_thematic(text, n_clusters=5):
sentences = nltk.sent_tokenize(text)
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(sentences)
kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
clusters = kmeans.predict(X)
chunks = [[] for _ in range(n_clusters)]
for i, sentence in enumerate(sentences):
chunks[clusters[i]].append(sentence)
return [' '.join(chunk) for chunk in chunks]
thematic_chunks = chunk_text_thematic(text, n_clusters=3)
for i, chunk in enumerate(thematic_chunks, 1):
print(f"Chunk {i}:\n{chunk}\n")
- Sliding Window Chunking: Use a sliding window approach to create overlapping chunks. This method ensures that important information near the boundaries of chunks is not missed.
def chunk_text_sliding_window(text, chunk_size=200, overlap=50, unit='word'):
"""Chunks text using a sliding window.
Args:
text: The input text.
chunk_size: The desired size of each chunk.
overlap: The overlap between consecutive chunks.
unit: The chunking unit ('word', 'char', or 'token').
Returns:
A list of text chunks.
"""
if unit == 'word':
data = text.split()
elif unit == 'char':
data = text
else:
# Implement tokenization for other units
pass
chunks = []
for i in range(0, len(data), chunk_size - overlap):
if unit == 'word':
chunk = ' '.join(data[i:i+chunk_size])
else:
chunk = data[i:i+chunk_size]
chunks.append(chunk)
return chunks
-
Content-Aware Chunking: This advanced method involves using more sophisticated NLP techniques to chunk the text based on its content and structure. Content-aware chunking can take into account factors such as topic continuity, coherence, and discourse markers. It aims to create chunks that are not only manageable but also meaningful and contextually rich.
Example of Content-Aware Chunking using Sentence Transformers:
from sentence_transformers import SentenceTransformer, util
def content_aware_chunking(text, max_chunk_size=200):
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = nltk.sent_tokenize(text)
embeddings = model.encode(sentences, convert_to_tensor=True)
clusters = util.community_detection(embeddings, min_community_size=1)
chunks = []
for cluster in clusters:
chunk = ' '.join([sentences[i] for i in cluster])
if len(chunk.split()) <= max_chunk_size:
chunks.append(chunk)
else:
sub_chunks = chunk_text_fixed_length(chunk, max_chunk_size)
chunks.extend(sub_chunks)
return chunks
-
Recursive Chunking: Recursive chunking involves repeatedly breaking down chunks into smaller sub-chunks until each chunk meets a desired size or level of detail. This method ensures that very large texts are reduced to manageable and meaningful units at each level of recursion, making it easier to process and retrieve information.
Example of Recursive Chunking:
```python
def recursive_chunk(text, max_chunk_size):
“"”Recursively chunks text into smaller chunks.
Args:
text: The input text.
max_chunk_size: The maximum desired chunk size.
Returns:
A list of text chunks.
“””
if len(text) <= max_chunk_size:
return [text]
# Choose a splitting point based on paragraphs, sentences, or other criteria
# For example:
paragraphs = text.split(‘\n\n’)
if len(paragraphs) > 1:
chunks = []
for paragraph in paragraphs:
chunks.extend(recursive_chunk(paragraph, max_chunk_size))
return chunks
else:
# Handle single paragraph chunking, e.g., by sentence splitting
# …
# …
8. **Agentic Chunking**: Agent chunking is a sophisticated technique that involves using an LLM to dynamically determine chunk boundaries based on the content and context of the text. Below is an example of a prompt example for Agentic Chunking
**Example Prompt**:
``` Prompt
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
## You are an agentic chunker. You will be provided with a content.
Decompose the content into clear and simple propositions, ensuring they are interpretable out of context.
1. Split compound sentence into simple sentences. Maintain the original phrasing from the input whenever possible.
2. For any named entity that is accompanied by additional descriptive informaiton separate this information into its own distinct proposition.
3. Decontextualize proposition by adding necessary modifier to nouns or entire sentence and replacing pronouns (e.g. it, he, she, they, this, that) with the full name of the entities they refer to.
4. Present the results as list of strings, formatted in JSON
<|eot_id|><|start_header_id|>user<|end_header_id|>
Here is the content : {content}
strictly follow the instructions provided and output in the desired format only.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Chunk Size and Overlapping in Chunking
Determining the appropriate chunk size and whether to use overlapping chunks are critical decisions in the chunking process. These factors significantly impact the efficiency and effectiveness of the retrieval and generation stages in RAG systems.
Chunk Size
- Choosing Chunk Size: The ideal chunk size depends on the specific application and the nature of the text. Smaller chunks can provide more precise context but may miss broader information, while larger chunks capture more context but may introduce noise or irrelevant information.
- Small Chunks: Typically 100-200 words. Suitable for fine-grained retrieval where specific details are crucial.
- Medium Chunks: Typically 200-500 words. Balance between detail and context, suitable for most applications.
- Large Chunks: Typically 500-1000 words. Useful for capturing broader context but may be less precise.
- Impact of Chunk Size: The chunk size affects the retrieval accuracy and computational efficiency. Smaller chunks generally lead to higher retrieval precision but may require more chunks to cover the same amount of text, increasing computational overhead. Larger chunks reduce the number of chunks but may lower retrieval precision.
Overlapping Chunks
-
Purpose of Overlapping: Overlapping chunks ensure that important information near the boundaries of chunks is not missed. This approach is particularly useful when the text has high semantic continuity, and critical information may span across chunk boundaries.
- Degree of Overlap: The overlap size should be carefully chosen to balance redundancy and completeness. Common overlap sizes range from 10% to 50% of the chunk size.
- Small Overlap: 10-20% of the chunk size. Minimizes redundancy but may still miss some boundary information.
- Medium Overlap: 20-30% of the chunk size. Good balance between coverage and redundancy.
- Large Overlap: 30-50% of the chunk size. Ensures comprehensive coverage but increases redundancy and computational load.
- Example of Overlapping Chunking:
def chunk_text_sliding_window(text, chunk_size=200, overlap=50):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = words[i:i + chunk_size]
chunks.append(' '.join(chunk))
return chunks
Evaluating the performance of chunking is crucial to ensure that the chosen method effectively enhances the retrieval and generation processes. Here are some key metrics and approaches for evaluating chunking performance:
Retrieval Metrics
- Precision@K: Measures the proportion of relevant chunks among the top K retrieved chunks.
def precision_at_k(retrieved_chunks, relevant_chunks, k):
return len(set(retrieved_chunks[:k]) & set(relevant_chunks)) / k
- Recall@K: Measures the proportion of relevant chunks retrieved among the top K chunks.
def recall_at_k(retrieved_chunks, relevant_chunks, k):
return len(set(retrieved_chunks[:k]) & set(relevant_chunks)) / len(relevant_chunks)
- F1 Score: Harmonic mean of Precision@K and Recall@K, providing a balance between precision and recall.
def f1_score_at_k(precision, recall):
if precision + recall == 0:
return 0
return 2 * (precision * recall) / (precision + recall)
- MAP : Mean Average Precision (MAP) is primarily used in information retrieval and object detection tasks to evaluate the ranking of retrieved items
import numpy as np
def calculate_ap(y_true, y_score):
"""Calculates average precision for a single query.
Args:
y_true: Ground truth labels (0 or 1).
y_score: Predicted scores.
Returns:
Average precision.
"""
# Sort y_score and corresponding y_true in descending order
y_score, y_true = zip(*sorted(zip(y_score, y_true), key=lambda x: x[0], reverse=True))
correct_hits = 0
sum_precision = 0
for i, y in enumerate(y_true):
if y == 1:
correct_hits += 1
precision = correct_hits / (i + 1)
sum_precision += precision
return sum_precision / sum(y_true)
def calculate_map(y_true, y_score):
"""Calculates mean average precision.
Args:
y_true: Ground truth labels (list of lists).
y_score: Predicted scores (list of lists).
Returns:
Mean average precision.
"""
aps = []
for i in range(len(y_true)):
ap = calculate_ap(y_true[i], y_score[i])
aps.append(ap)
return np.mean(aps)
- NDCG: NDCG is a metric used to evaluate the quality of a ranking of items. It measures how well the most relevant items are ranked at the top of the list. In the context of chunking, we can potentially apply NDCG by ranking chunks based on a relevance score and evaluating how well the most relevant chunks are placed at the beginning of the list.
import numpy as np
def calculate_dcg(rel):
"""Calculates Discounted Cumulative Gain (DCG).
Args:
rel: Relevance scores of items.
Returns:
DCG value.
"""
return np.sum(rel / np.log2(np.arange(len(rel)) + 2))
def calculate_idcg(rel):
"""Calculates Ideal Discounted Cumulative Gain (IDCG).
Args:
rel: Relevance scores of items.
Returns:
IDCG value.
"""
rel = np.sort(rel)[::-1]
return calculate_dcg(rel)
def calculate_ndcg(rel):
"""Calculates Normalized Discounted Cumulative Gain (NDCG).
Args:
rel: Relevance scores of items.
Returns:
NDCG value.
"""
dcg = calculate_dcg(rel)
idcg = calculate_idcg(rel)
return dcg / idcg
# Example usage
relevance_scores = [3, 2, 1, 0]
ndcg_score = calculate_ndcg(relevance_scores)
print(ndcg_score)
Generation Metrics
- BLEU Score: Measures the overlap between the generated text and reference text, considering n-grams.
from nltk.translate.bleu_score import sentence_bleu
def bleu_score(reference, generated):
return sentence_bleu([reference.split()], generated.split())
- ROUGE Score: Measures the overlap of n-grams, longest common subsequence (LCS), and skip-bigram between the generated text and reference text.
from rouge import Rouge
rouge = Rouge()
def rouge_score(reference, generated):
scores = rouge.get_scores(generated, reference)
return scores[0]['rouge-l']['f']
- Human Evaluation: Involves subjective evaluation by human judges to assess the relevance, coherence, and overall quality of the generated responses. Human evaluation can provide insights that automated metrics might miss.
Chunking Alternatives
While chunking is an effective method for improving the efficiency and effectiveness of RAG systems, there are alternative techniques that can also be considered:
-
Hierarchical Indexing: Instead of chunking the text, hierarchical indexing organizes the data into a tree structure where each node represents a topic or subtopic. This allows for efficient retrieval by navigating through the tree based on the query’s context.
```python
class HierarchicalIndex:
def init(self):
self.tree = {}
def add_document(self, doc_id, topics):
current_level = self.tree
for topic in topics:
if topic not in current_level:
current_level[topic] = {}
current_level = current_level[topic]
current_level['doc_id'] = doc_id
def retrieve(self, query_topics):
current_level = self.tree
for topic in query_topics:
if topic in current_level:
current_level = current_level[topic]
else:
return []
return current_level.get('doc_id', [])
- Summarization: Instead of retrieving chunks, the system generates summaries of documents or sections that are relevant to the query. This can be done using extractive or abstractive summarization techniques.
from transformers import BartTokenizer, BartForConditionalGeneration
def generate_summary(text):
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=150, early_stopping=True)
return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
- Dense Passage Retrieval (DPR): DPR uses dense vector representations for both questions and passages, allowing for efficient similarity search using vector databases like FAISS.
from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer
from sklearn.metrics.pairwise import cosine_similarity
question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
def encode_texts(texts, tokenizer, encoder):
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
return encoder(**inputs).pooler_output
question_embeddings = encode_texts(["What is chunking?"], question_tokenizer, question_encoder)
context_embeddings = encode_texts(["Chunking is a process...", "Another context..."], context_tokenizer, context_encoder)
similarities = cosine_similarity(question_embeddings, context_embeddings)
- Graph-Based Representations: Instead of breaking the text into chunks, graph-based representations model the relationships between different parts of the text. Nodes represent entities, concepts, or chunks of text, and edges represent the relationships between them. This approach allows for more flexible and context-aware retrieval.
import networkx as nx
def build_graph(texts):
graph = nx.Graph()
for i, text in enumerate(texts):
graph.add_node(i, text=text)
# Add edges based on some similarity metric
for j in range(i + 1, len(texts)):
similarity = compute_similarity(text, texts[j])
if similarity > threshold:
graph.add_edge(i, j, weight=similarity)
return graph
def retrieve_from_graph(graph, query):
query_node = len(graph.nodes)
graph.add_node(query_node, text=query)
for i in range(query_node):
similarity = compute_similarity(query, graph.nodes[i]['text'])
if similarity > threshold:
graph.add_edge(query_node, i, weight=similarity)
# Retrieve nodes with highest similarity
neighbors = sorted(graph[query_node], key=lambda x: graph[query_node][x]['weight'], reverse=True)
return [graph.nodes[n]['text'] for n in neighbors[:k]]
Graph-based representations can capture complex relationships and provide a more holistic view of the text, making them a powerful alternative to chunking.
Conclusion
Chunking plays a pivotal role in enhancing the efficiency and effectiveness of Retrieval-Augmented Generation systems. By breaking down large texts into manageable chunks, we can improve retrieval speed, contextual relevance, scalability, and the overall quality of generated responses. Evaluating the performance of chunking methods involves considering retrieval and generation metrics, as well as efficiency and cost metrics. As NLP continues to advance, techniques like chunking will remain essential for optimizing the performance of RAG and other language processing systems. Additionally, exploring alternatives such as hierarchical indexing, passage retrieval, summarization, dense passage retrieval, and graph-based representations can further enhance the capabilities of RAG systems.
Embark on your journey to harness the power of chunking in RAG and unlock new possibilities in the world of Natural Language Processing!
If you found this blog post helpful, please consider citing it in your work:
@misc{malaikannan2024chunking,
author = {Sankarasubbu, Malaikannan},
title = {Breaking Down Data: The Science and Art of Chunking in Text Processing & RAG Pipeline},
year = {2024},
url = {https://malaikannan.github.io/2024/08/05/Chunking/},
note = {Accessed: 2024-08-12}
}
31 Jul 2024
Computers are meant to crunch numbers; it goes back to the original design of these machines. Representing text as numbers is the holy grail of Natural Language Processing (NLP), but how do we do that? Over the years, various techniques have been developed to achieve this. Early methods like n-grams (like bigrams and trigrams) and TF-IDF were able to convert words into numbers. Not just one number, a collection of them. Each word is represented by the collection of numbers. The collection of numbers is called vector and it had a size that is fixed called the dimension of the vector. Though they were useful, they had their limitations. The most important of the limitations is that the vectors for each words stands alone, i.e we could not do any mathematical operations like addition or subtraction between the vectors(actually we could but the resulting vector will not represent any word). That is where embeddings come in. Embedding is also a vector, and so each word get a corresponding vector but we can now do King - Man + Woman that will give us a vector which is close to the vector corresponding to Queen. Why is this useful? That is what we are going to explore in this article.
What are Embeddings?
Embeddings are numerical representations of text data where words or phrases from the vocabulary are mapped to vectors of real numbers. This mapping is crucial because it allows us to quantify and manipulate textual data in a way that machines can understand and process.
We understand what a word is, lets see what a vector is. A vector is a sequence of numbers that forms a group. For example
- (3) is a one dimensional vector.
- (2,8) is a two dimensional vector.
- (12,6,7,4) is a four dimensional vector.
A vector can be represented as by plotting on a graph. Lets take a 2D example
We can only 3 dimensions, anything more than that you can just say it not visualize.
Below is an example of 4 dimension vector representation of the word king
One of the seminal papers that have come out from Google is Word2vec. Lets see how Word2Vec works to get a conceptual understanding of how embedding works
How Word2Vec works
For a input text it looks at each word and the context of words around it. It trains on the text, and recognizes the order of each word, and the structure of the sentences. At the end of training each word is represented by a vector of N (mostly in 100 to 300 range) dimension.
When we train word2vec algorithm in the example discussed above “SanFrancisco is a beautiful California city. LosAngeles is a lovely California metropolis”
Lets assume that it outputs 2 dimension vectors for each words, since we can’t visualize anything more than 3 dimension.
- SanFrancisco (6,6)
- beautiful (-13,-4)
- California (10,8)
- city (2,10)
- LosAngeles (6.5,5)
- lovely(-12,-7)
- metropolis(2.5,8)
Below is a 2D Plot of vectors
You can see in the image that Word2vec algorithm inferred from the input text. SanFrancisco and LosAngeles are grouped together. Beautiful and lovely are grouped together. City and metropolis are grouped together. Beauty about this is, Word2vec deduced this purely from data, without being explicitly taught english or geography.
You will see more embedding approaches in the below sections
Key Characteristics of Embeddings:
- Dimensionality: Embeddings are vectors of fixed size. Common sizes range from 50 to 300 dimensions, though they can be larger depending on the complexity of the task.
- Continuous Space: Unlike traditional one-hot encoding, embeddings are dense and reside in a continuous vector space, making them more efficient and informative.
- Semantic Proximity: Words with similar meanings tend to have vectors that are close to each other in the embedding space.
The Evolution of Embeddings
Embeddings have evolved significantly over the years. Here are some key milestones:
-
Word2Vec (2013): Developed by Mikolov et al. at Google, Word2Vec was one of the first algorithms to create word embeddings. It uses two architectures—Continuous Bag of Words (CBOW) and Skip-gram—to learn word associations.
-
GloVe (2014): Developed by the Stanford NLP Group, GloVe (Global Vectors for Word Representation) improves upon Word2Vec by incorporating global statistical information of the corpus.
-
FastText (2016): Developed by Facebook’s AI Research (FAIR) lab, FastText extends Word2Vec by considering subword information, which helps in handling out-of-vocabulary words and capturing morphological details.
-
ELMo (2018): Developed by the Allen Institute for AI, ELMo (Embeddings from Language Models) generates context-sensitive embeddings, meaning the representation of a word changes based on its context in a sentence.
-
BERT (2018): Developed by Google, BERT (Bidirectional Encoder Representations from Transformers) revolutionized embeddings by using transformers to understand the context of a word bidirectionally. This model significantly improved performance on various NLP tasks.
From Word Embeddings to Sentence Embeddings
While word embeddings provide a way to represent individual words, they do not capture the meaning of entire sentences or documents. This limitation led to the development of sentence embeddings, which are designed to represent longer text sequences.
Word Embeddings
Word embeddings, such as those created by Word2Vec, GloVe, and FastText, map individual words to vectors. These embeddings capture semantic similarities between words based on their context within a large corpus of text. For example, the words “king” and “queen” might be close together in the embedding space because they often appear in similar contexts.
Sentence Embeddings
Sentence embeddings extend the concept of word embeddings to entire sentences or even paragraphs. These embeddings aim to capture the meaning of a whole sentence, taking into account the context and relationships between words within the sentence. There are several methods to create sentence embeddings:
-
Averaging Word Embeddings: One of the simplest methods is to average the word embeddings of all words in a sentence. While this method is straightforward, it often fails to capture the nuances and syntactic structures of sentences.
-
Doc2Vec: Developed by Mikolov and Le, Doc2Vec extends Word2Vec to larger text segments by considering the paragraph as an additional feature during training. This method generates embeddings for sentences or documents that capture more context compared to averaging word embeddings.
-
Recurrent Neural Networks (RNNs): RNNs, particularly Long Short-Term Memory (LSTM) networks, can be used to generate sentence embeddings by processing the sequence of words in a sentence. The hidden state of the RNN after processing the entire sentence can serve as the sentence embedding.
-
Transformers (BERT, GPT, etc.): Modern approaches like BERT and GPT use transformer architectures to generate context-aware embeddings for sentences. These models can process a sentence bidirectionally, capturing dependencies and relationships between words more effectively than previous methods.
Example: BERT Sentence Embeddings
BERT (Bidirectional Encoder Representations from Transformers) has set a new standard for generating high-quality sentence embeddings. By processing a sentence in both directions, BERT captures the full context of each word in relation to the entire sentence. The embeddings generated by BERT can be fine-tuned for various NLP tasks, such as sentiment analysis, question answering, and text classification.
To create a sentence embedding with BERT, you can use the hidden states of the transformer model. Typically, the hidden state corresponding to the [CLS] token (which stands for “classification”) is used as the sentence embedding.
How to Generate Embeddings
Generating embeddings involves training a model on a large corpus of text data. Here’s a step-by-step guide to generating word and sentence embeddings:
Generating Word Embeddings with Word2Vec
-
Data Preparation: Collect and preprocess a large text corpus. This involves tokenizing the text, removing stop words, and handling punctuation.
- Training the Model: Use the Word2Vec algorithm to train the model. You can choose between the CBOW or Skip-gram architecture. Libraries like Gensim in Python provide easy-to-use implementations of Word2Vec.
from gensim.models import Word2Vec
# Example sentences
sentences = [["I", "love", "machine", "learning"], ["Word2Vec", "is", "great"]]
# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
- Using the Embeddings: Once the model is trained, you can use it to get the embedding for any word in the vocabulary.
word_embedding = model.wv['machine']
Generating Sentence Embeddings with BERT
- Install Transformers Library: Use the Hugging Face Transformers library to easily work with BERT.
- Load Pretrained BERT Model: Load a pretrained BERT model and tokenizer.
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
- Tokenize Input Text: Tokenize your input text and convert it to input IDs and attention masks.
sentence = "BERT is amazing for sentence embeddings."
inputs = tokenizer(sentence, return_tensors='pt')
- Generate Embeddings: Pass the inputs through the BERT model to get the embeddings.
with torch.no_grad():
outputs = model(**inputs)
# The [CLS] token embedding
sentence_embedding = outputs.last_hidden_state[0][0]
- Using the Embeddings: The
sentence_embedding
can now be used for various NLP tasks.
Data Needed for Training Embeddings
The quality of embeddings heavily depends on the data used for training. Here are key considerations regarding the data needed:
-
Size of the Corpus: A large corpus is generally required to capture the diverse contexts in which words can appear. For example, training Word2Vec or BERT models typically requires billions of words. The larger the corpus, the better the embeddings can capture semantic nuances.
-
Diversity of the Corpus: The corpus should cover a wide range of topics and genres to ensure that the embeddings are generalizable. This means including text from various domains such as news articles, books, social media, academic papers, and more.
- Preprocessing: Proper preprocessing of the corpus is essential. This includes:
- Tokenization: Splitting text into words or subwords.
- Lowercasing: Converting all text to lowercase to reduce the vocabulary size.
- Removing Punctuation and Stop Words: Cleaning the text by removing unnecessary punctuation and common stop words that do not contribute to the meaning.
- Handling Special Characters: Dealing with special characters, numbers, and other non-alphabetic tokens appropriately.
-
Domain-Specific Data: For specialized applications, it is beneficial to include domain-specific data. For instance, medical embeddings should be trained on medical literature to capture the specialized vocabulary and context of the field.
-
Balanced Dataset: Ensuring that the dataset is balanced and not biased towards a particular topic or genre helps in creating more neutral and representative embeddings.
- Data Augmentation: In cases where data is limited, data augmentation techniques such as back-translation, paraphrasing, and synthetic data generation can be used to enhance the corpus.
Applications of Sentence Embeddings
Sentence embeddings have a wide range of applications in NLP:
- Text Classification: Sentence embeddings are used to represent sentences for classification tasks, such as identifying the topic of a sentence or determining the sentiment expressed in a review.
- Semantic Search: By comparing sentence embeddings, search engines can retrieve documents that are semantically similar to a query, even if the exact keywords are not matched.
- Summarization
: Sentence embeddings help in generating summaries by identifying the most important sentences in a document based on their semantic content.
- Translation: Sentence embeddings improve machine translation systems by providing a richer representation of the source sentence, leading to more accurate translations.
Embedding Dimension Reduction Methods
High-dimensional embeddings can be computationally expensive and may contain redundant information. Dimension reduction techniques help in simplifying these embeddings while preserving their essential characteristics. Here are some common methods:
- Principal Component Analysis (PCA): PCA is a linear method that reduces the dimensionality of data by transforming it into a new coordinate system where the greatest variances by any projection of the data come to lie on the first coordinates (principal components).
from sklearn.decomposition import PCA
# Assuming 'embeddings' is a numpy array of shape (n_samples, n_features)
pca = PCA(n_components=50)
reduced_embeddings = pca.fit_transform(embeddings)
- t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a nonlinear technique primarily used for visualizing high-dimensional data by reducing it to two or three dimensions.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
reduced_embeddings = tsne.fit_transform(embeddings)
- Uniform Manifold Approximation and Projection (UMAP): UMAP is another nonlinear technique that is faster and often more effective than t-SNE for dimension reduction, especially for larger datasets.
import umap
reducer = umap.UMAP(n_components=2)
reduced_embeddings = reducer.fit_transform(embeddings)
- Autoencoders: Autoencoders are a type of neural network used to learn efficient codings of input data. An autoencoder consists of an encoder and a decoder. The encoder compresses the input into a lower-dimensional latent space, and the decoder reconstructs the input from this latent space.
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
# Define encoder
input_dim = embeddings.shape[1]
encoding_dim = 50 # Size of the reduced dimension
input_layer = Input(shape=(input_dim,))
encoded = Dense(encoding_dim, activation='relu')(input_layer)
# Define decoder
decoded = Dense(input_dim, activation='sigmoid')(encoded)
# Build the autoencoder model
autoencoder = Model(input_layer, decoded)
encoder = Model(input_layer, encoded)
# Compile and train the autoencoder
autoencoder.compile(optimizer='adam', loss='mean_squared_error')
autoencoder.fit(embeddings, embeddings, epochs=50, batch_size=256, shuffle=True)
# Get the reduced embeddings
reduced_embeddings = encoder.predict(embeddings)
- Random Projection: Random projection is a simple and computationally efficient technique to reduce dimensionality. It is based on the Johnson-Lindenstrauss lemma, which states that high-dimensional data can be embedded into a lower-dimensional space with minimal distortion.
from sklearn.random_projection import SparseRandomProjection
transformer = SparseRandomProjection(n_components=50)
reduced_embeddings = transformer.fit_transform(embeddings)
Evaluating Embeddings
Evaluating embeddings is crucial to ensure that they capture meaningful relationships and semantics. Here are some common methods to evaluate embeddings:
-
Intrinsic Evaluation: These methods evaluate the quality of embeddings based on predefined linguistic tasks or properties without involving downstream tasks.
- Word Similarity: Measure the cosine similarity between word pairs and compare with human-annotated similarity scores. Popular datasets include WordSim-353 and SimLex-999.
from scipy.spatial.distance import cosine
similarity = 1 - cosine(embedding1, embedding2)
- Analogy Tasks: Evaluate embeddings based on their ability to solve word analogy tasks, such as “king - man + woman = queen.” Datasets like Google Analogy dataset are commonly used.
def analogy(model, word1, word2, word3):
vec = model[word1] - model[word2] + model[word3]
return model.most_similar([vec])[0][0]
-
Extrinsic Evaluation: These methods evaluate embeddings based on their performance on downstream NLP tasks.
- Text Classification: Use embeddings as features for text classification tasks and measure performance using metrics like accuracy, precision, recall, and F1 score.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(train_embeddings, train_labels)
predictions = model.predict(test_embeddings)
accuracy = accuracy_score(test_labels, predictions)
- Named Entity Recognition (NER): Evaluate embeddings by their performance on NER tasks, measuring precision, recall, and F1 score.
# Example using spaCy for NER
import spacy
from spacy.tokens import DocBin
nlp = spacy.load("en_core_web_sm")
nlp.entity.add_label("ORG")
train_docs = [nlp(text) for text in train_texts]
train_db = DocBin(docs=train_docs)
- Machine Translation: Assess the quality of embeddings by their impact on machine translation tasks, using BLEU or METEOR scores.
-
Clustering and Visualization: Visualizing embeddings using t-SNE or UMAP can provide qualitative insights into the structure and quality of embeddings.
import matplotlib.pyplot as plt
tsne = TSNE(n_components=2)
reduced_embeddings = tsne.fit_transform(embeddings)
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1])
for i, word in enumerate(words):
plt.annotate(word, xy=(reduced_embeddings[i, 0], reduced_embeddings[i, 1]))
plt.show()
Similarity vs. Retrieval Embeddings
Embeddings can be tailored for different purposes, such as similarity or retrieval tasks. Understanding the distinction between these two types of embeddings is crucial for optimizing their use in various applications.
Similarity Embeddings
Similarity embeddings are designed to capture the semantic similarity between different pieces of text. The primary goal is to ensure that semantically similar texts have similar embeddings.
Use Cases:
- Semantic Search: Finding documents or sentences that are semantically similar to a query.
- Recommendation Systems: Recommending items (e.g., articles, products) that are similar to a given item.
- Paraphrase Detection: Identifying sentences or phrases that convey the same meaning.
Evaluation:
- Cosine Similarity: Measure the cosine similarity between embeddings to evaluate their closeness.
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([embedding1], [embedding2])
- Clustering: Grouping similar items together using clustering algorithms like K-means.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(embeddings)
Retrieval Embeddings
Retrieval embeddings are optimized for information retrieval tasks, where the goal is to retrieve the most relevant documents from a large corpus based on a query.
Use Cases:
- Search Engines: Retrieving relevant web pages or documents based on user queries.
- Question Answering Systems: Finding relevant passages or documents that contain the answer to a user’s question.
- Document Retrieval: Retrieving documents that are most relevant to a given query.
Evaluation:
- Precision and Recall: Measure the accuracy of retrieved documents using precision, recall, and F1 score.
from sklearn.metrics import precision_score, recall_score, f1_score
precision = precision_score(true_labels, predicted_labels, average='weighted')
recall = recall_score(true_labels, predicted_labels, average='weighted')
f1 = f1_score(true_labels, predicted_labels, average='weighted')
- Mean Reciprocal Rank (MRR): Evaluate the rank of the first relevant document.
def mean_reciprocal_rank(rs):
"""Score is reciprocal of the rank of the first relevant item
First element is 'rank 1'. Relevance is binary (nonzero is relevant).
Example from information retrieval with binary relevance:
>>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
>>> mean_reciprocal_rank(rs)
0.61111111111111105
"""
rs = (np.asarray(r).nonzero()[0] for r in rs)
return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])
Symmetric vs. Asymmetric Embeddings
Symmetric and asymmetric embeddings are designed to handle different types of relationships in data, and understanding their differences can help in choosing the right approach for specific tasks.
Symmetric Embeddings
Symmetric embeddings are used when the relationship between two items is mutual. The similarity between two items is expected to be the same regardless of the order in which they are compared.
Use Cases:
- Similarity Search: Comparing the similarity between two items, such as text or images, where the similarity score should be the same in both directions.
- Collaborative Filtering: Recommending items
based on mutual user-item interactions, where the relationship is bidirectional.
Evaluation:
- Cosine Similarity: Symmetric embeddings often use cosine similarity to measure the closeness of vectors.
similarity = cosine_similarity([embedding1], [embedding2])
Asymmetric Embeddings
Asymmetric embeddings are used when the relationship between two items is directional. The similarity or relevance of one item to another may not be the same when the order is reversed.
Use Cases:
- Information Retrieval: Retrieving relevant documents for a query, where the relevance of a document to a query is not necessarily the same as the relevance of the query to the document.
- Knowledge Graph Embeddings: Representing entities and relationships in a knowledge graph, where the relationship is directional (e.g., parent-child, teacher-student).
Evaluation:
- Rank-Based Metrics: Asymmetric embeddings often use rank-based metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) to evaluate performance.
def mean_reciprocal_rank(rs):
rs = (np.asarray(r).nonzero()[0] for r in rs)
return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])
The Future of Embeddings
The field of embeddings is rapidly evolving. Researchers are exploring new ways to create more efficient and accurate representations, such as using unsupervised learning and combining embeddings with other techniques like graph networks. The ongoing advancements in this area promise to further enhance the capabilities of NLP systems.
Conclusion
Embeddings have revolutionized the field of NLP, providing a robust and efficient way to represent and process textual data. From word embeddings to sentence embeddings, these techniques have enabled significant advancements in how machines understand and interact with human language. With the help of dimension reduction methods, evaluation techniques, and tailored similarity and retrieval embeddings, embeddings can be optimized for a wide range of NLP tasks. Understanding the differences between symmetric and asymmetric embeddings further allows for more specialized applications. As we continue to develop more sophisticated models and techniques, embeddings will undoubtedly play a crucial role in advancing our understanding and interaction with human language.