Malaikannan The Deep Learning way of life

Chunking

Breaking Down Data: The Science and Art of Chunking in Text Processing & RAG Pipeline

As the field of Natural Language Processing (NLP) continues to evolve, the combination of retrieval-based and generative models has emerged as a powerful approach for enhancing various NLP applications. One of the key techniques that significantly improves the efficiency and effectiveness of Retrieval-Augmented Generation (RAG) is chunking. In this blog, we will explore what chunking is, why it is important in RAG, the different ways to implement chunking, including content-aware and recursive chunking, how to evaluate the performance of chunking, chunking alternatives, and how it can be applied to optimize NLP systems.

Chunking

What is Retrieval-Augmented Generation (RAG)?

Before diving into chunking, let’s briefly understand RAG. Retrieval-Augmented Generation is a framework that combines the strengths of retrieval-based models and generative models. It involves retrieving relevant information from a large corpus based on a query and using this retrieved information as context for a generative model to produce accurate and contextually relevant responses or content.

What is Chunking?

Chunking is the process of breaking down large text documents or datasets into smaller, manageable pieces, or “chunks.” These chunks can then be individually processed, indexed, and retrieved, making the overall system more efficient and effective. Chunking helps in dealing with large volumes of text by dividing them into smaller, coherent units that are easier to handle.

Why Do We Need Chunking?

Chunking is essential in RAG for several reasons:

Efficiency

  • Computational cost: Processing smaller chunks of text requires less computational power compared to handling entire documents.
  • Storage: Chunking allows for more efficient storage and indexing of information.

Accuracy

  • Relevance: By breaking down documents into smaller units, it’s easier to identify and retrieve the most relevant information for a given query.
  • Context preservation: Careful chunking can help maintain the original context of the text within each chunk.

Speed

  • Retrieval time: Smaller chunks can be retrieved and processed faster, leading to quicker response times.
  • Model processing: Language models can process smaller inputs more efficiently.

Limitations of Large Language Models

  • Context window: LLMs have limitations on the amount of text they can process at once. Chunking helps to overcome this limitation.

In essence, chunking optimizes the RAG process by making it more efficient, accurate, and responsive.

Different Ways to Implement Chunking

There are various methods to implement chunking, depending on the specific requirements and structure of the text data. Here are some common approaches:

  1. Fixed-Length Chunking: Divide the text into chunks of fixed length, typically based on a predetermined number of words or characters.

    def chunk_text_fixed_length(text, chunk_size=200, by='words'):
        if by == 'words':
            words = text.split()
            return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
        elif by == 'characters':
            return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
        else:
            raise ValueError("Parameter 'by' must be either 'words' or 'characters'.")
    
    text = "The process is more important than the results. And if you take care of the process, you will get the results."
    word_chunks = chunk_text_fixed_length(text, chunk_size=5, by='words')  
    character_chunks = chunk_text_fixed_length(text, chunk_size=5, by='characters')  
       
    
    print(word_chunks)
    ['The process is more important', 'than the results. And if', 'you take care of the', 'process, you will get the', 'results.']
    
    print(character_chunks)
    ['The p', 'roces', 's is ', 'more ', 'impor', 'tant ', 'than ', 'the r', 'esult', 's. An', 'd if ', 'you t', 'ake c', 'are o', 'f the', ' proc', 'ess, ', 'you w', 'ill g', 'et th', 'e res', 'ults.']
    
  2. Sentence-Based Chunking: Split the text into chunks based on complete sentences. This method ensures that each chunk contains coherent and complete thoughts.

    import nltk
    nltk.download('punkt')
       
    def chunk_text_sentences(text, max_sentences=3):
        sentences = nltk.sent_tokenize(text)
        return [' '.join(sentences[i:i + max_sentences]) for i in range(0, len(sentences), max_sentences)]
    
    text = """Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence. It deals with the interaction between computers and humans through natural language. NLP techniques are used to apply algorithms to identify and extract the natural language rules such that 
    the unstructured language data is converted into a form that computers can understand. Text mining and text classification are common applications of NLP. It's a powerful tool in the modern data-driven world."""
       
       
    
    sentence_chunks = chunk_text_sentences(text, max_sentences=2)
       
       
    for i, chunk in enumerate(sentence_chunks, 1):
        print(f"Chunk {i}:\n{chunk}\n")
    
    Chunk 1:
    Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence. It deals with the interaction between computers and humans through natural language.
       
    Chunk 2:
    NLP techniques are used to apply algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand. Text mining and text classification are common applications of NLP.
       
    Chunk 3:
    It's a powerful tool in the modern data-driven world.
    
  3. Paragraph-Based Chunking: Divide the text into chunks based on paragraphs. This approach is useful when the text is naturally structured into paragraphs that represent distinct sections or topics.

    def chunk_text_paragraphs(text):
        paragraphs = text.split('\n\n')
        return [paragraph for paragraph in paragraphs if paragraph.strip()]
    
    paragraph_chunks = chunk_text_paragraphs(text)
       
       
    for i, chunk in enumerate(paragraph_chunks, 1):
        print(f"Paragraph {i}:\n{chunk}\n")
    
    Paragraph 1:
    Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence.
       
    Paragraph 2:
    It deals with the interaction between computers and humans through natural language.
       
    Paragraph 3:
    NLP techniques are used to apply algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand.
       
    Paragraph 4:
    Text mining and text classification are common applications of NLP. It's a powerful tool in the modern data-driven world.
    
  4. Thematic or Semantic Chunking: Use NLP techniques to identify and group related sentences or paragraphs into chunks based on their thematic or semantic content. This can be done using topic modeling or clustering algorithms.

    import nltk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.cluster import KMeans
       
    nltk.download('punkt')
       
    def chunk_text_thematic(text, n_clusters=5):
        sentences = nltk.sent_tokenize(text)
        vectorizer = TfidfVectorizer(stop_words='english')
        X = vectorizer.fit_transform(sentences)
        kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
        clusters = kmeans.predict(X)
           
        chunks = [[] for _ in range(n_clusters)]
        for i, sentence in enumerate(sentences):
            chunks[clusters[i]].append(sentence)
           
        return [' '.join(chunk) for chunk in chunks]
       
       
       
    thematic_chunks = chunk_text_thematic(text, n_clusters=3)
       
       
    for i, chunk in enumerate(thematic_chunks, 1):
        print(f"Chunk {i}:\n{chunk}\n")
       
    
  5. Sliding Window Chunking: Use a sliding window approach to create overlapping chunks. This method ensures that important information near the boundaries of chunks is not missed.
     def chunk_text_sliding_window(text, chunk_size=200, overlap=50, unit='word'):
     """Chunks text using a sliding window.
    
     Args:
         text: The input text.
         chunk_size: The desired size of each chunk.
         overlap: The overlap between consecutive chunks.
         unit: The chunking unit ('word', 'char', or 'token').
    
     Returns:
         A list of text chunks.
     """
    
     if unit == 'word':
         data = text.split()
     elif unit == 'char':
         data = text
     else:
         # Implement tokenization for other units
         pass
    
     chunks = []
     for i in range(0, len(data), chunk_size - overlap):
         if unit == 'word':
         chunk = ' '.join(data[i:i+chunk_size])
         else:
         chunk = data[i:i+chunk_size]
         chunks.append(chunk)
    
     return chunks
    
    
  6. Content-Aware Chunking: This advanced method involves using more sophisticated NLP techniques to chunk the text based on its content and structure. Content-aware chunking can take into account factors such as topic continuity, coherence, and discourse markers. It aims to create chunks that are not only manageable but also meaningful and contextually rich.

    Example of Content-Aware Chunking using Sentence Transformers:

    from sentence_transformers import SentenceTransformer, util
    
    def content_aware_chunking(text, max_chunk_size=200):
        model = SentenceTransformer('all-MiniLM-L6-v2')
        sentences = nltk.sent_tokenize(text)
        embeddings = model.encode(sentences, convert_to_tensor=True)
        clusters = util.community_detection(embeddings, min_community_size=1)
           
        chunks = []
        for cluster in clusters:
            chunk = ' '.join([sentences[i] for i in cluster])
            if len(chunk.split()) <= max_chunk_size:
                chunks.append(chunk)
            else:
                sub_chunks = chunk_text_fixed_length(chunk, max_chunk_size)
                chunks.extend(sub_chunks)
           
        return chunks
    
  7. Recursive Chunking: Recursive chunking involves repeatedly breaking down chunks into smaller sub-chunks until each chunk meets a desired size or level of detail. This method ensures that very large texts are reduced to manageable and meaningful units at each level of recursion, making it easier to process and retrieve information.

    Example of Recursive Chunking: ```python def recursive_chunk(text, max_chunk_size): “"”Recursively chunks text into smaller chunks.

    Args: text: The input text. max_chunk_size: The maximum desired chunk size.

    Returns: A list of text chunks. “””

    if len(text) <= max_chunk_size: return [text]

    # Choose a splitting point based on paragraphs, sentences, or other criteria # For example: paragraphs = text.split(‘\n\n’) if len(paragraphs) > 1: chunks = [] for paragraph in paragraphs: chunks.extend(recursive_chunk(paragraph, max_chunk_size)) return chunks else: # Handle single paragraph chunking, e.g., by sentence splitting # …

# …


8. **Agentic Chunking**: Agent chunking is a sophisticated technique that involves using an LLM to dynamically determine chunk boundaries based on the content and context of the text. Below is an example of a prompt example for Agentic Chunking 

**Example Prompt**:

``` Prompt
<|begin_of_text|><|start_header_id|>system<|end_header_id|> 
## You are an agentic chunker. You will be provided with a content. 
Decompose the content into clear and simple propositions, ensuring they are interpretable out of context. 
1. Split compound sentence into simple sentences. Maintain the original phrasing from the input whenever possible. 
2. For any named entity that is accompanied by additional descriptive informaiton separate this information into its own distinct proposition.
3. Decontextualize proposition by adding necessary modifier to nouns or entire sentence and replacing pronouns (e.g. it, he, she, they, this, that) with the full name of the entities they refer to.
4. Present the results as list of strings, formatted in JSON 
<|eot_id|><|start_header_id|>user<|end_header_id|>
Here is the content : {content}
strictly follow the instructions provided and output in the desired format only.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Chunk Size and Overlapping in Chunking

Determining the appropriate chunk size and whether to use overlapping chunks are critical decisions in the chunking process. These factors significantly impact the efficiency and effectiveness of the retrieval and generation stages in RAG systems.

Chunk Size
  1. Choosing Chunk Size: The ideal chunk size depends on the specific application and the nature of the text. Smaller chunks can provide more precise context but may miss broader information, while larger chunks capture more context but may introduce noise or irrelevant information.
    • Small Chunks: Typically 100-200 words. Suitable for fine-grained retrieval where specific details are crucial.
    • Medium Chunks: Typically 200-500 words. Balance between detail and context, suitable for most applications.
    • Large Chunks: Typically 500-1000 words. Useful for capturing broader context but may be less precise.
  2. Impact of Chunk Size: The chunk size affects the retrieval accuracy and computational efficiency. Smaller chunks generally lead to higher retrieval precision but may require more chunks to cover the same amount of text, increasing computational overhead. Larger chunks reduce the number of chunks but may lower retrieval precision.
Overlapping Chunks
  1. Purpose of Overlapping: Overlapping chunks ensure that important information near the boundaries of chunks is not missed. This approach is particularly useful when the text has high semantic continuity, and critical information may span across chunk boundaries.

  2. Degree of Overlap: The overlap size should be carefully chosen to balance redundancy and completeness. Common overlap sizes range from 10% to 50% of the chunk size.
    • Small Overlap: 10-20% of the chunk size. Minimizes redundancy but may still miss some boundary information.
    • Medium Overlap: 20-30% of the chunk size. Good balance between coverage and redundancy.
    • Large Overlap: 30-50% of the chunk size. Ensures comprehensive coverage but increases redundancy and computational load.
  3. Example of Overlapping Chunking:
    def chunk_text_sliding_window(text, chunk_size=200, overlap=50):
        words = text.split()
        chunks = []
        for i in range(0, len(words), chunk_size - overlap):
            chunk = words[i:i + chunk_size]
            chunks.append(' '.join(chunk))
        return chunks
    

Evaluating the Performance of Chunking

Evaluating the performance of chunking is crucial to ensure that the chosen method effectively enhances the retrieval and generation processes. Here are some key metrics and approaches for evaluating chunking performance:

Retrieval Metrics
  1. Precision@K: Measures the proportion of relevant chunks among the top K retrieved chunks.
    def precision_at_k(retrieved_chunks, relevant_chunks, k):
        return len(set(retrieved_chunks[:k]) & set(relevant_chunks)) / k
    
  2. Recall@K: Measures the proportion of relevant chunks retrieved among the top K chunks.
    def recall_at_k(retrieved_chunks, relevant_chunks, k):
        return len(set(retrieved_chunks[:k]) & set(relevant_chunks)) / len(relevant_chunks)
    
  3. F1 Score: Harmonic mean of Precision@K and Recall@K, providing a balance between precision and recall.
    def f1_score_at_k(precision, recall):
        if precision + recall == 0:
            return 0
        return 2 * (precision * recall) / (precision + recall)
    
  4. MAP : Mean Average Precision (MAP) is primarily used in information retrieval and object detection tasks to evaluate the ranking of retrieved items
     import numpy as np
    
     def calculate_ap(y_true, y_score):
     """Calculates average precision for a single query.
    
     Args:
         y_true: Ground truth labels (0 or 1).
         y_score: Predicted scores.
    
     Returns:
         Average precision.
     """
    
     # Sort y_score and corresponding y_true in descending order
     y_score, y_true = zip(*sorted(zip(y_score, y_true), key=lambda x: x[0], reverse=True))
    
     correct_hits = 0
     sum_precision = 0
     for i, y in enumerate(y_true):
         if y == 1:
         correct_hits += 1
         precision = correct_hits / (i + 1)
         sum_precision += precision
     return sum_precision / sum(y_true)
    
     def calculate_map(y_true, y_score):
     """Calculates mean average precision.
    
     Args:
         y_true: Ground truth labels (list of lists).
         y_score: Predicted scores (list of lists).
    
     Returns:
         Mean average precision.
     """
    
     aps = []
     for i in range(len(y_true)):
         ap = calculate_ap(y_true[i], y_score[i])
         aps.append(ap)
     return np.mean(aps)
    
    
    
  5. NDCG: NDCG is a metric used to evaluate the quality of a ranking of items. It measures how well the most relevant items are ranked at the top of the list. In the context of chunking, we can potentially apply NDCG by ranking chunks based on a relevance score and evaluating how well the most relevant chunks are placed at the beginning of the list.
import numpy as np

def calculate_dcg(rel):
  """Calculates Discounted Cumulative Gain (DCG).

  Args:
    rel: Relevance scores of items.

  Returns:
    DCG value.
  """

  return np.sum(rel / np.log2(np.arange(len(rel)) + 2))

def calculate_idcg(rel):
  """Calculates Ideal Discounted Cumulative Gain (IDCG).

  Args:
    rel: Relevance scores of items.

  Returns:
    IDCG value.
  """

  rel = np.sort(rel)[::-1]
  return calculate_dcg(rel)

def calculate_ndcg(rel):
  """Calculates Normalized Discounted Cumulative Gain (NDCG).

  Args:
    rel: Relevance scores of items.

  Returns:
    NDCG value.
  """

  dcg = calculate_dcg(rel)
  idcg = calculate_idcg(rel)
  return dcg / idcg

# Example usage
relevance_scores = [3, 2, 1, 0]
ndcg_score = calculate_ndcg(relevance_scores)
print(ndcg_score)


Generation Metrics
  1. BLEU Score: Measures the overlap between the generated text and reference text, considering n-grams.
    from nltk.translate.bleu_score import sentence_bleu
    
    def bleu_score(reference, generated):
        return sentence_bleu([reference.split()], generated.split())
    
  2. ROUGE Score: Measures the overlap of n-grams, longest common subsequence (LCS), and skip-bigram between the generated text and reference text.
    from rouge import Rouge
    
    rouge = Rouge()
    
    def rouge_score(reference, generated):
        scores = rouge.get_scores(generated, reference)
        return scores[0]['rouge-l']['f']
    
  3. Human Evaluation: Involves subjective evaluation by human judges to assess the relevance, coherence, and overall quality of the generated responses. Human evaluation can provide insights that automated metrics might miss.

Chunking Alternatives

While chunking is an effective method for improving the efficiency and effectiveness of RAG systems, there are alternative techniques that can also be considered:

  1. Hierarchical Indexing: Instead of chunking the text, hierarchical indexing organizes the data into a tree structure where each node represents a topic or subtopic. This allows for efficient retrieval by navigating through the tree based on the query’s context. ```python class HierarchicalIndex: def init(self): self.tree = {}

    def add_document(self, doc_id, topics):
        current_level = self.tree
        for topic in topics:
            if topic not in current_level:
                current_level[topic] = {}
            current_level = current_level[topic]
        current_level['doc_id'] = doc_id
    
    def retrieve(self, query_topics):
        current_level = self.tree
        for topic in query_topics:
            if topic in current_level:
                current_level = current_level[topic]
            else:
                return []
        return current_level.get('doc_id', [])
    
  2. Summarization: Instead of retrieving chunks, the system generates summaries of documents or sections that are relevant to the query. This can be done using extractive or abstractive summarization techniques.
    from transformers import BartTokenizer, BartForConditionalGeneration
    
    def generate_summary(text):
        tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
        model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    
        inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)
        summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=150, early_stopping=True)
        return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
  3. Dense Passage Retrieval (DPR): DPR uses dense vector representations for both questions and passages, allowing for efficient similarity search using vector databases like FAISS.
    from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
    context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
    
    question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
    context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
    
    def encode_texts(texts, tokenizer, encoder):
        inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
        return encoder(**inputs).pooler_output
    
    question_embeddings = encode_texts(["What is chunking?"], question_tokenizer, question_encoder)
    context_embeddings = encode_texts(["Chunking is a process...", "Another context..."], context_tokenizer, context_encoder)
    
    similarities = cosine_similarity(question_embeddings, context_embeddings)
    
  4. Graph-Based Representations: Instead of breaking the text into chunks, graph-based representations model the relationships between different parts of the text. Nodes represent entities, concepts, or chunks of text, and edges represent the relationships between them. This approach allows for more flexible and context-aware retrieval.
   import networkx as nx

   def build_graph(texts):
       graph = nx.Graph()
       for i, text in enumerate(texts):
           graph.add_node(i, text=text)
           # Add edges based on some similarity metric
           for j in range(i + 1, len(texts)):
               similarity = compute_similarity(text, texts[j])
               if similarity > threshold:
                   graph.add_edge(i, j, weight=similarity)
       return graph

   def retrieve_from_graph(graph, query):
       query_node = len(graph.nodes)
       graph.add_node(query_node, text=query)
       for i in range(query_node):
           similarity = compute_similarity(query, graph.nodes[i]['text'])
           if similarity > threshold:
               graph.add_edge(query_node, i, weight=similarity)
       # Retrieve nodes with highest similarity
       neighbors = sorted(graph[query_node], key=lambda x: graph[query_node][x]['weight'], reverse=True)
       return [graph.nodes[n]['text'] for n in neighbors[:k]]

Graph-based representations can capture complex relationships and provide a more holistic view of the text, making them a powerful alternative to chunking.

Conclusion

Chunking plays a pivotal role in enhancing the efficiency and effectiveness of Retrieval-Augmented Generation systems. By breaking down large texts into manageable chunks, we can improve retrieval speed, contextual relevance, scalability, and the overall quality of generated responses. Evaluating the performance of chunking methods involves considering retrieval and generation metrics, as well as efficiency and cost metrics. As NLP continues to advance, techniques like chunking will remain essential for optimizing the performance of RAG and other language processing systems. Additionally, exploring alternatives such as hierarchical indexing, passage retrieval, summarization, dense passage retrieval, and graph-based representations can further enhance the capabilities of RAG systems.

Embark on your journey to harness the power of chunking in RAG and unlock new possibilities in the world of Natural Language Processing!

If you found this blog post helpful, please consider citing it in your work:

@misc{malaikannan2024chunking, author = {Sankarasubbu, Malaikannan}, title = {Breaking Down Data: The Science and Art of Chunking in Text Processing & RAG Pipeline}, year = {2024}, url = {https://malaikannan.github.io/2024/08/05/Chunking/}, note = {Accessed: 2024-08-12} }

Embeddings

Computers are meant to crunch numbers; it goes back to the original design of these machines. Representing text as numbers is the holy grail of Natural Language Processing (NLP), but how do we do that? Over the years, various techniques have been developed to achieve this. Early methods like n-grams (like bigrams and trigrams) and TF-IDF were able to convert words into numbers. Not just one number, a collection of them. Each word is represented by the collection of numbers. The collection of numbers is called vector and it had a size that is fixed called the dimension of the vector. Though they were useful, they had their limitations. The most important of the limitations is that the vectors for each words stands alone, i.e we could not do any mathematical operations like addition or subtraction between the vectors(actually we could but the resulting vector will not represent any word). That is where embeddings come in. Embedding is also a vector, and so each word get a corresponding vector but we can now do King - Man + Woman that will give us a vector which is close to the vector corresponding to Queen. Why is this useful? That is what we are going to explore in this article.

What are Embeddings?

Embeddings are numerical representations of text data where words or phrases from the vocabulary are mapped to vectors of real numbers. This mapping is crucial because it allows us to quantify and manipulate textual data in a way that machines can understand and process.

We understand what a word is, lets see what a vector is. A vector is a sequence of numbers that forms a group. For example

  • (3) is a one dimensional vector.
  • (2,8) is a two dimensional vector.
  • (12,6,7,4) is a four dimensional vector.

A vector can be represented as by plotting on a graph. Lets take a 2D example

2D Plot

We can only 3 dimensions, anything more than that you can just say it not visualize.

Below is an example of 4 dimension vector representation of the word king

King Vector

One of the seminal papers that have come out from Google is Word2vec. Lets see how Word2Vec works to get a conceptual understanding of how embedding works

How Word2Vec works

For a input text it looks at each word and the context of words around it. It trains on the text, and recognizes the order of each word, and the structure of the sentences. At the end of training each word is represented by a vector of N (mostly in 100 to 300 range) dimension.

Word2Vec

When we train word2vec algorithm in the example discussed above “SanFrancisco is a beautiful California city. LosAngeles is a lovely California metropolis”

Lets assume that it outputs 2 dimension vectors for each words, since we can’t visualize anything more than 3 dimension.

  • SanFrancisco (6,6)
  • beautiful (-13,-4)
  • California (10,8)
  • city (2,10)
  • LosAngeles (6.5,5)
  • lovely(-12,-7)
  • metropolis(2.5,8)

Below is a 2D Plot of vectors

2DPlot

You can see in the image that Word2vec algorithm inferred from the input text. SanFrancisco and LosAngeles are grouped together. Beautiful and lovely are grouped together. City and metropolis are grouped together. Beauty about this is, Word2vec deduced this purely from data, without being explicitly taught english or geography.

You will see more embedding approaches in the below sections

Key Characteristics of Embeddings:
  1. Dimensionality: Embeddings are vectors of fixed size. Common sizes range from 50 to 300 dimensions, though they can be larger depending on the complexity of the task.
  2. Continuous Space: Unlike traditional one-hot encoding, embeddings are dense and reside in a continuous vector space, making them more efficient and informative.
  3. Semantic Proximity: Words with similar meanings tend to have vectors that are close to each other in the embedding space.

The Evolution of Embeddings

Embeddings have evolved significantly over the years. Here are some key milestones:

  1. Word2Vec (2013): Developed by Mikolov et al. at Google, Word2Vec was one of the first algorithms to create word embeddings. It uses two architectures—Continuous Bag of Words (CBOW) and Skip-gram—to learn word associations.

  2. GloVe (2014): Developed by the Stanford NLP Group, GloVe (Global Vectors for Word Representation) improves upon Word2Vec by incorporating global statistical information of the corpus.

  3. FastText (2016): Developed by Facebook’s AI Research (FAIR) lab, FastText extends Word2Vec by considering subword information, which helps in handling out-of-vocabulary words and capturing morphological details.

  4. ELMo (2018): Developed by the Allen Institute for AI, ELMo (Embeddings from Language Models) generates context-sensitive embeddings, meaning the representation of a word changes based on its context in a sentence.

  5. BERT (2018): Developed by Google, BERT (Bidirectional Encoder Representations from Transformers) revolutionized embeddings by using transformers to understand the context of a word bidirectionally. This model significantly improved performance on various NLP tasks.

From Word Embeddings to Sentence Embeddings

While word embeddings provide a way to represent individual words, they do not capture the meaning of entire sentences or documents. This limitation led to the development of sentence embeddings, which are designed to represent longer text sequences.

Word Embeddings

Word embeddings, such as those created by Word2Vec, GloVe, and FastText, map individual words to vectors. These embeddings capture semantic similarities between words based on their context within a large corpus of text. For example, the words “king” and “queen” might be close together in the embedding space because they often appear in similar contexts.

Sentence Embeddings

Sentence embeddings extend the concept of word embeddings to entire sentences or even paragraphs. These embeddings aim to capture the meaning of a whole sentence, taking into account the context and relationships between words within the sentence. There are several methods to create sentence embeddings:

  1. Averaging Word Embeddings: One of the simplest methods is to average the word embeddings of all words in a sentence. While this method is straightforward, it often fails to capture the nuances and syntactic structures of sentences.

  2. Doc2Vec: Developed by Mikolov and Le, Doc2Vec extends Word2Vec to larger text segments by considering the paragraph as an additional feature during training. This method generates embeddings for sentences or documents that capture more context compared to averaging word embeddings.

  3. Recurrent Neural Networks (RNNs): RNNs, particularly Long Short-Term Memory (LSTM) networks, can be used to generate sentence embeddings by processing the sequence of words in a sentence. The hidden state of the RNN after processing the entire sentence can serve as the sentence embedding.

  4. Transformers (BERT, GPT, etc.): Modern approaches like BERT and GPT use transformer architectures to generate context-aware embeddings for sentences. These models can process a sentence bidirectionally, capturing dependencies and relationships between words more effectively than previous methods.

Example: BERT Sentence Embeddings

BERT (Bidirectional Encoder Representations from Transformers) has set a new standard for generating high-quality sentence embeddings. By processing a sentence in both directions, BERT captures the full context of each word in relation to the entire sentence. The embeddings generated by BERT can be fine-tuned for various NLP tasks, such as sentiment analysis, question answering, and text classification.

To create a sentence embedding with BERT, you can use the hidden states of the transformer model. Typically, the hidden state corresponding to the [CLS] token (which stands for “classification”) is used as the sentence embedding.

How to Generate Embeddings

Generating embeddings involves training a model on a large corpus of text data. Here’s a step-by-step guide to generating word and sentence embeddings:

Generating Word Embeddings with Word2Vec
  1. Data Preparation: Collect and preprocess a large text corpus. This involves tokenizing the text, removing stop words, and handling punctuation.

  2. Training the Model: Use the Word2Vec algorithm to train the model. You can choose between the CBOW or Skip-gram architecture. Libraries like Gensim in Python provide easy-to-use implementations of Word2Vec.
    from gensim.models import Word2Vec
    
    # Example sentences
    sentences = [["I", "love", "machine", "learning"], ["Word2Vec", "is", "great"]]
    
    # Train Word2Vec model
    model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
    
  3. Using the Embeddings: Once the model is trained, you can use it to get the embedding for any word in the vocabulary.
    word_embedding = model.wv['machine']
    
Generating Sentence Embeddings with BERT
  1. Install Transformers Library: Use the Hugging Face Transformers library to easily work with BERT.
    pip install transformers
    
  2. Load Pretrained BERT Model: Load a pretrained BERT model and tokenizer.
    from transformers import BertTokenizer, BertModel
    import torch
    
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased')
    
  3. Tokenize Input Text: Tokenize your input text and convert it to input IDs and attention masks.
    sentence = "BERT is amazing for sentence embeddings."
    inputs = tokenizer(sentence, return_tensors='pt')
    
  4. Generate Embeddings: Pass the inputs through the BERT model to get the embeddings.
    with torch.no_grad():
        outputs = model(**inputs)
    
    # The [CLS] token embedding
    sentence_embedding = outputs.last_hidden_state[0][0]
    
  5. Using the Embeddings: The sentence_embedding can now be used for various NLP tasks.

Data Needed for Training Embeddings

The quality of embeddings heavily depends on the data used for training. Here are key considerations regarding the data needed:

  1. Size of the Corpus: A large corpus is generally required to capture the diverse contexts in which words can appear. For example, training Word2Vec or BERT models typically requires billions of words. The larger the corpus, the better the embeddings can capture semantic nuances.

  2. Diversity of the Corpus: The corpus should cover a wide range of topics and genres to ensure that the embeddings are generalizable. This means including text from various domains such as news articles, books, social media, academic papers, and more.

  3. Preprocessing: Proper preprocessing of the corpus is essential. This includes:
    • Tokenization: Splitting text into words or subwords.
    • Lowercasing: Converting all text to lowercase to reduce the vocabulary size.
    • Removing Punctuation and Stop Words: Cleaning the text by removing unnecessary punctuation and common stop words that do not contribute to the meaning.
    • Handling Special Characters: Dealing with special characters, numbers, and other non-alphabetic tokens appropriately.
  4. Domain-Specific Data: For specialized applications, it is beneficial to include domain-specific data. For instance, medical embeddings should be trained on medical literature to capture the specialized vocabulary and context of the field.

  5. Balanced Dataset: Ensuring that the dataset is balanced and not biased towards a particular topic or genre helps in creating more neutral and representative embeddings.

  6. Data Augmentation: In cases where data is limited, data augmentation techniques such as back-translation, paraphrasing, and synthetic data generation can be used to enhance the corpus.

Applications of Sentence Embeddings

Sentence embeddings have a wide range of applications in NLP:

  1. Text Classification: Sentence embeddings are used to represent sentences for classification tasks, such as identifying the topic of a sentence or determining the sentiment expressed in a review.
  2. Semantic Search: By comparing sentence embeddings, search engines can retrieve documents that are semantically similar to a query, even if the exact keywords are not matched.
  3. Summarization

: Sentence embeddings help in generating summaries by identifying the most important sentences in a document based on their semantic content.

  1. Translation: Sentence embeddings improve machine translation systems by providing a richer representation of the source sentence, leading to more accurate translations.

Embedding Dimension Reduction Methods

High-dimensional embeddings can be computationally expensive and may contain redundant information. Dimension reduction techniques help in simplifying these embeddings while preserving their essential characteristics. Here are some common methods:

  1. Principal Component Analysis (PCA): PCA is a linear method that reduces the dimensionality of data by transforming it into a new coordinate system where the greatest variances by any projection of the data come to lie on the first coordinates (principal components).
    from sklearn.decomposition import PCA
    
    # Assuming 'embeddings' is a numpy array of shape (n_samples, n_features)
    pca = PCA(n_components=50)
    reduced_embeddings = pca.fit_transform(embeddings)
    
  2. t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a nonlinear technique primarily used for visualizing high-dimensional data by reducing it to two or three dimensions.
    from sklearn.manifold import TSNE
    
    tsne = TSNE(n_components=2)
    reduced_embeddings = tsne.fit_transform(embeddings)
    
  3. Uniform Manifold Approximation and Projection (UMAP): UMAP is another nonlinear technique that is faster and often more effective than t-SNE for dimension reduction, especially for larger datasets.
    import umap
    
    reducer = umap.UMAP(n_components=2)
    reduced_embeddings = reducer.fit_transform(embeddings)
    
  4. Autoencoders: Autoencoders are a type of neural network used to learn efficient codings of input data. An autoencoder consists of an encoder and a decoder. The encoder compresses the input into a lower-dimensional latent space, and the decoder reconstructs the input from this latent space.
    from tensorflow.keras.layers import Input, Dense
    from tensorflow.keras.models import Model
    
    # Define encoder
    input_dim = embeddings.shape[1]
    encoding_dim = 50  # Size of the reduced dimension
    input_layer = Input(shape=(input_dim,))
    encoded = Dense(encoding_dim, activation='relu')(input_layer)
    
    # Define decoder
    decoded = Dense(input_dim, activation='sigmoid')(encoded)
    
    # Build the autoencoder model
    autoencoder = Model(input_layer, decoded)
    encoder = Model(input_layer, encoded)
    
    # Compile and train the autoencoder
    autoencoder.compile(optimizer='adam', loss='mean_squared_error')
    autoencoder.fit(embeddings, embeddings, epochs=50, batch_size=256, shuffle=True)
    
    # Get the reduced embeddings
    reduced_embeddings = encoder.predict(embeddings)
    
  5. Random Projection: Random projection is a simple and computationally efficient technique to reduce dimensionality. It is based on the Johnson-Lindenstrauss lemma, which states that high-dimensional data can be embedded into a lower-dimensional space with minimal distortion.
    from sklearn.random_projection import SparseRandomProjection
    
    transformer = SparseRandomProjection(n_components=50)
    reduced_embeddings = transformer.fit_transform(embeddings)
    

Evaluating Embeddings

Evaluating embeddings is crucial to ensure that they capture meaningful relationships and semantics. Here are some common methods to evaluate embeddings:

  1. Intrinsic Evaluation: These methods evaluate the quality of embeddings based on predefined linguistic tasks or properties without involving downstream tasks.

    • Word Similarity: Measure the cosine similarity between word pairs and compare with human-annotated similarity scores. Popular datasets include WordSim-353 and SimLex-999.
      from scipy.spatial.distance import cosine
      
      similarity = 1 - cosine(embedding1, embedding2)
      
    • Analogy Tasks: Evaluate embeddings based on their ability to solve word analogy tasks, such as “king - man + woman = queen.” Datasets like Google Analogy dataset are commonly used.
      def analogy(model, word1, word2, word3):
          vec = model[word1] - model[word2] + model[word3]
          return model.most_similar([vec])[0][0]
      
  2. Extrinsic Evaluation: These methods evaluate embeddings based on their performance on downstream NLP tasks.

    • Text Classification: Use embeddings as features for text classification tasks and measure performance using metrics like accuracy, precision, recall, and F1 score.
      from sklearn.linear_model import LogisticRegression
      from sklearn.metrics import accuracy_score
      
      model = LogisticRegression()
      model.fit(train_embeddings, train_labels)
      predictions = model.predict(test_embeddings)
      accuracy = accuracy_score(test_labels, predictions)
      
    • Named Entity Recognition (NER): Evaluate embeddings by their performance on NER tasks, measuring precision, recall, and F1 score.
      # Example using spaCy for NER
      import spacy
      from spacy.tokens import DocBin
      
      nlp = spacy.load("en_core_web_sm")
      nlp.entity.add_label("ORG")
      
      train_docs = [nlp(text) for text in train_texts]
      train_db = DocBin(docs=train_docs)
      
    • Machine Translation: Assess the quality of embeddings by their impact on machine translation tasks, using BLEU or METEOR scores.
  3. Clustering and Visualization: Visualizing embeddings using t-SNE or UMAP can provide qualitative insights into the structure and quality of embeddings.

    import matplotlib.pyplot as plt
    
    tsne = TSNE(n_components=2)
    reduced_embeddings = tsne.fit_transform(embeddings)
    
    plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1])
    for i, word in enumerate(words):
        plt.annotate(word, xy=(reduced_embeddings[i, 0], reduced_embeddings[i, 1]))
    plt.show()
    

Similarity vs. Retrieval Embeddings

Embeddings can be tailored for different purposes, such as similarity or retrieval tasks. Understanding the distinction between these two types of embeddings is crucial for optimizing their use in various applications.

Similarity Embeddings

Similarity embeddings are designed to capture the semantic similarity between different pieces of text. The primary goal is to ensure that semantically similar texts have similar embeddings.

Use Cases:

  • Semantic Search: Finding documents or sentences that are semantically similar to a query.
  • Recommendation Systems: Recommending items (e.g., articles, products) that are similar to a given item.
  • Paraphrase Detection: Identifying sentences or phrases that convey the same meaning.

Evaluation:

  • Cosine Similarity: Measure the cosine similarity between embeddings to evaluate their closeness.
    from sklearn.metrics.pairwise import cosine_similarity
    
    similarity = cosine_similarity([embedding1], [embedding2])
    
  • Clustering: Grouping similar items together using clustering algorithms like K-means.
    from sklearn.cluster import KMeans
    
    kmeans = KMeans(n_clusters=5)
    clusters = kmeans.fit_predict(embeddings)
    
Retrieval Embeddings

Retrieval embeddings are optimized for information retrieval tasks, where the goal is to retrieve the most relevant documents from a large corpus based on a query.

Use Cases:

  • Search Engines: Retrieving relevant web pages or documents based on user queries.
  • Question Answering Systems: Finding relevant passages or documents that contain the answer to a user’s question.
  • Document Retrieval: Retrieving documents that are most relevant to a given query.

Evaluation:

  • Precision and Recall: Measure the accuracy of retrieved documents using precision, recall, and F1 score.
    from sklearn.metrics import precision_score, recall_score, f1_score
    
    precision = precision_score(true_labels, predicted_labels, average='weighted')
    recall = recall_score(true_labels, predicted_labels, average='weighted')
    f1 = f1_score(true_labels, predicted_labels, average='weighted')
    
  • Mean Reciprocal Rank (MRR): Evaluate the rank of the first relevant document.
    def mean_reciprocal_rank(rs):
        """Score is reciprocal of the rank of the first relevant item
        First element is 'rank 1'.  Relevance is binary (nonzero is relevant).
        Example from information retrieval with binary relevance:
        >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
        >>> mean_reciprocal_rank(rs)
        0.61111111111111105
        """
        rs = (np.asarray(r).nonzero()[0] for r in rs)
        return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])
    

Symmetric vs. Asymmetric Embeddings

Symmetric and asymmetric embeddings are designed to handle different types of relationships in data, and understanding their differences can help in choosing the right approach for specific tasks.

Symmetric Embeddings

Symmetric embeddings are used when the relationship between two items is mutual. The similarity between two items is expected to be the same regardless of the order in which they are compared.

Use Cases:

  • Similarity Search: Comparing the similarity between two items, such as text or images, where the similarity score should be the same in both directions.
  • Collaborative Filtering: Recommending items

based on mutual user-item interactions, where the relationship is bidirectional.

Evaluation:

  • Cosine Similarity: Symmetric embeddings often use cosine similarity to measure the closeness of vectors.
    similarity = cosine_similarity([embedding1], [embedding2])
    
Asymmetric Embeddings

Asymmetric embeddings are used when the relationship between two items is directional. The similarity or relevance of one item to another may not be the same when the order is reversed.

Use Cases:

  • Information Retrieval: Retrieving relevant documents for a query, where the relevance of a document to a query is not necessarily the same as the relevance of the query to the document.
  • Knowledge Graph Embeddings: Representing entities and relationships in a knowledge graph, where the relationship is directional (e.g., parent-child, teacher-student).

Evaluation:

  • Rank-Based Metrics: Asymmetric embeddings often use rank-based metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) to evaluate performance.
    def mean_reciprocal_rank(rs):
        rs = (np.asarray(r).nonzero()[0] for r in rs)
        return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])
    

The Future of Embeddings

The field of embeddings is rapidly evolving. Researchers are exploring new ways to create more efficient and accurate representations, such as using unsupervised learning and combining embeddings with other techniques like graph networks. The ongoing advancements in this area promise to further enhance the capabilities of NLP systems.

Conclusion

Embeddings have revolutionized the field of NLP, providing a robust and efficient way to represent and process textual data. From word embeddings to sentence embeddings, these techniques have enabled significant advancements in how machines understand and interact with human language. With the help of dimension reduction methods, evaluation techniques, and tailored similarity and retrieval embeddings, embeddings can be optimized for a wide range of NLP tasks. Understanding the differences between symmetric and asymmetric embeddings further allows for more specialized applications. As we continue to develop more sophisticated models and techniques, embeddings will undoubtedly play a crucial role in advancing our understanding and interaction with human language.

How to find a home with a kitchen you like with deeplearning?

Ai in clinical trials

American Tamil Enterprenur association invited me for a talk on AI. Below is the video on the talk, this was a fun talk loaded with lot of MEMEs

Learned index

Jeff Dean and Co came up with a Seminal Paper on whether Indexes can be learned using Neural Networks. I gave a talk in Saama Tech Talk Series on this topic. Guess I am moving more towards talking rather than writing nowadays.