Blog Prep (#5)

* Update readme * Update text
fuzzylabs · Jan 17, 2025 · a824bd0 · a824bd0
1 parent f8219be
commit a824bd0
Show file tree

Hide file tree

Showing 2 changed files with 60 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -1,8 +1,58 @@
-<h1 align="center">
-    rag_semantic_chunking &#128679;
-</h1>
+# L&L Blog Series - RAG Semantic Chunking Experiment
 
-Experiment with semantic chunking for an RAG application.
+In this project, we will experiment and compare the results of an RAG application with context reranking versus no reranking.
+
+The [experiment notebook](./rag_semantic_chunking/experiment_rag_semantic_chunking.ipynb) will implement and compare compare the simplest chunking method, fixed-size chunking, with the improvements offered by semantic chunking.
 
 # &#127939; How do I get started?
 If you haven't already done so, please read [DEVELOPMENT.md](DEVELOPMENT.md) for instructions on how to set up your virtual environment using Poetry.
+
+## 💻 Run Locally
+
+```bash
+poetry shell
+poetry install
+poetry run jupyter notebook
+```
+
+Once the notebook is up, make sure you update the `FILE_PATH` parameter value. Once the correct file path is set, click `Run -> Run all cells` option.
+
+It takes about 5 mins for everything to get completed if you have a Nvidia GPU. Otherwise, it will take about ~20-30 minutes.
+
+Jump to the `Comparison` cell and toggle between different dropdown options to compare the results from various approaches.
+
+# 💡 Background - Why do we chunk text?
+
+When building a RAG (Retrieval-Augmented Generation) system, the first step is to create a knowledge base. This involves processing our data which is typically in the form of PDFs or books and storing it in a database for answering user questions later. If we simply ingest the raw text from these documents, we’re left with massive blocks of text.
+
+Here’s the problem: language models have limits. They can’t process unlimited amounts of text at once, and there are two key reasons for this.
+
+## Chunking Methods
+
+### Fixed-size chunking
+
+If you’re new to chunking and don’t know much about different strategies, a simple approach is to chunk text by a fixed character or word length. For example, if you have a document with 1,000 words, you could divide it into 10 chunks of 100 words each.
+
+It’s probably the simplest method.
+
+### Semantic Chunking
+
+Instead of relying on arbitrary limits, we take an embedding-based approach, similar to how we build our database for document retrieval. Initially, we chunk the text using a naive method, then embed each chunk. The key idea is that we evaluate the embedding distances between chunks. If two chunks have embeddings that are close in distance, we group them together. If not, we leave them as separate chunks.
+
+## The Process of Semantic Trunking
+
+There isn’t a go to formula for semantic chunking just like there isn't a single best chunking strategy, it’s all about experimentation and iteration. The goal of semantic chunking is to make your data more valuable for your language model for your specific tasks.
+
+However, we can start with a simple approach:
+
+1. Split the document into sentences using punctuation (e.g., ., ?, !) or tools like spaCy or NLTK for more nuanced breaks.
+2. Calculate distances between sentence embeddings.
+3. Group similar sentences together or split sentences that aren’t similar.
+
+What you will find in the [notebook](./rag_semantic_chunking/experiment_rag_semantic_chunking.ipynb) is the implementation of semantic chunking from scratch so that we can better understand how it works. You will also find an implementation of semantic chunking using LangChain which is probably what you will use in your project.
+
+# Further Reading
+
+- [Lunch & Learn Blog Series - Reranking]()
+- [How to split text based on semantic similarity](https://python.langchain.com/docs/how_to/semantic-chunker/)
+- [Chunking Strategies for LLM Applications](https://www.pinecone.io/learn/chunking-strategies/)
diff --git a/rag_semantic_chunking/experiment_rag_semantic_chunking.ipynb b/rag_semantic_chunking/experiment_rag_semantic_chunking.ipynb
@@ -4,7 +4,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# RAG Semantic Chunking Experiment\n",
+    "# L&L Blog Series - RAG Semantic Chunking Experiment\n",
+    "\n",
+    "Did the lunch & learn blog series bring you here? If not, you should definitely check it out [here]().\n",
     "\n",
     "In this experiment, we'll compare the simplest chunking method, fixed-size chunking, with the improvements offered by semantic chunking.\n",
     "\n",
@@ -3063,7 +3065,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can see that the mean chunk distance is indeed larger for semantic chunks."
+    "We can see that the mean chunk distance is indeed larger for semantic chunks.\n",
+    "\n",
+    "Meaning that each chunk is now further apart to the neighbouring chunk after semantically chunked.\n"
    ]
   }
  ],