LangChain Text Splitters & Chunking

Analytics Vidhya · Beginner ·🧠 Large Language Models ·3mo ago
Skills: RAG Basics90%

Key Takeaways

The video discusses document splitters and chunkers in LangChain, highlighting their importance in building high-performance RAG systems, and explores various splitting mechanisms, including recursive character text splitter, character text splitter, and token-based splitters using libraries like TikToken, Spacy, and Sentence Transformers.

Full Transcript

Welcome back. In this video, we will understand what are document splitters and chunkers. So, we'll get an understanding of document splitters and chunkers. Understand how does a splitter work as well as look at some of the popular built-in document splitters in Langchain. Let's get started. So, what are document splitters and chunkers? Now, Langchain supports various document splitting and chunking mechanisms for typically transforming documents. So the whole idea of splitting documents is you have a large document and you split it into smaller more meaningful or more manageable chunks. Okay. So splitting documents into smaller chunks of paragraphs. This is typically done so that you can fit more relevant chunks into your LLM's context window because often if your large language model has a limited context window which is in most of the cases if you try to fit really large documents and a bunch of them you will get a token limit exceeded error. Okay. And documents can be split based on various methods like relevant sections, character counts, token counts and so on. Now how does a splitter work? At a high level, the text splitter works as follows. You split this text. So let's say you have this complete document text. You split the text into small semantically meaningful chunks. Okay? And then you combine these small chunks into a larger chunk until you reach a certain size based on the character count or the token count. As we have seen in the previous video in document loaders, some of the more complex document loaders, we could also do the chunking where we could define that okay, let every chunk be of 3,000 characters in length. Okay. So once you reach that size, make that chunk its own piece of text and then again start creating a new chunk of text with or without chunk overlaps. We will talk about some of these concepts when we get into the hands-on also. Okay. So how do you split the document into these chunks? The typical splitting strategy can be based on the number of characters. Which means let's say if your chunk size is of 3,000 characters, it will try to fit as many words as possible such that the total length of the characters of all those words is not more than 3,000 characters. It could also be based on the number of tokens or words like let's say you say that the chunk size is 100 tokens which means it will roughly try to fit anywhere between 90 to 100 words. Similarly, you can do semantic or sectionalbased chunking as we saw with the unstructured data loaders where you can chunk the main sections based on the relevant titles or headings of the documents. Okay. And chunk size measurement is typically done in the form of character or token counts. Now what are the popular built-in document splitters in lanch chain? You have the recursive character text splitter which is the most widely used and most popular text splitter which recursively splits text into larger chunks based on well-defined characters and then it tries to keep related pieces of text next to each other and it is lang chain's recommended way to start splitting your text into more manageable chunks. Character text splitter is like a specialized version of recursive character text splitter. It splits text based on just one single userdefined character. It's one of the simpler text splitters. Tik Token helps in splitting text based on tokens based on these pre-trained or fine-tuned LLM tokenizers just like chat GPD 3.5 chat GPD based on GPD version 4 and so on. Spacey is a popular NLP library and you can use spacy also to split text using the tokenizer from this popular spacy library. Sentence transformers is also another popular library. So you can split text based on tokens using trained open large language model tokenizers available from the sentence transformers library and of course as we have seen before in the previous videos also you can use unstructured.io. So the unstructured library allows for various splitting and chunking strategies including splitting text based on characters, key sections, titles and so on. In the next video, we will see how to apply some of these splitting and chunking strategies using some hands-on tech. Thank you and I'll see you in the next

Original Description

Description Why can’t you just upload a 100-page PDF to an LLM? In this video, we explore the critical step of Document Splitting and Chunking—the secret to building high-performance RAG systems that never hit token limits. Even if your LLM has a large context window, breaking data into smaller, semantically meaningful "chunks" is essential for accurate retrieval and cost-efficiency. We explain exactly how these splitters work and which strategies you should use for different types of data. What we cover in this lesson: The Goal of Chunking: Fitting data into the LLM context window and improving search relevance. How Text Splitters Work: The process of splitting, combining, and creating overlaps for context retention. Splitting Strategies: Choosing between character counts, token counts, and semantic/sectional chunking. LangChain’s Built-in Splitters: Recursive Character Text Splitter: Why this is the "gold standard" for most use cases. Character Text Splitter: A simple, specialized approach. Token-based Splitting: Using TikToken (OpenAI), spaCy, and Sentence Transformers. Unstructured.io: Advanced chunking based on document titles and headers. Mastering chunking is the difference between an AI that gives generic answers and one that finds the exact needle in the haystack. #RAG #LangChain #TextChunking #GenerativeAI #NLP #Python #OpenAI #VectorSearch #MachineLearning #AIEngineering
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Analytics Vidhya · Analytics Vidhya · 0 of 60

← Previous Next →
1 The DataHour: Data Science in Retail
The DataHour: Data Science in Retail
Analytics Vidhya
2 The DataHour: Anomaly detection using NLP and Predictive Modeling
The DataHour: Anomaly detection using NLP and Predictive Modeling
Analytics Vidhya
3 The DataHour: Energy Data Science Project from Scratch
The DataHour: Energy Data Science Project from Scratch
Analytics Vidhya
4 The DataHour: Explainable AI Need and Implementation
The DataHour: Explainable AI Need and Implementation
Analytics Vidhya
5 The DataHour: Google Cloud AI/ML
The DataHour: Google Cloud AI/ML
Analytics Vidhya
6 Prediction to Production in Machine Learning #machinelearning #prediction
Prediction to Production in Machine Learning #machinelearning #prediction
Analytics Vidhya
7 Practical Applications of Data science in Ecommerce
Practical Applications of Data science in Ecommerce
Analytics Vidhya
8 How to tackle Overfitting?#machinelearning #overfitting
How to tackle Overfitting?#machinelearning #overfitting
Analytics Vidhya
9 Building Data Pipelines on GCP #googlecloud #datapipelines #data
Building Data Pipelines on GCP #googlecloud #datapipelines #data
Analytics Vidhya
10 Hands-on with A/B Testing #abtesting #datascience
Hands-on with A/B Testing #abtesting #datascience
Analytics Vidhya
11 Efficient Implementations of Transformers #transformers #cnn  #machinelearning
Efficient Implementations of Transformers #transformers #cnn #machinelearning
Analytics Vidhya
12 Modern Deep Learning Architecture #deeplearning  #architecture #deeplearningtutorial
Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial
Analytics Vidhya
13 Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning
Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning
Analytics Vidhya
14 5 things you should know about Azure SQL #azure #sql #datahour #datascience
5 things you should know about Azure SQL #azure #sql #datahour #datascience
Analytics Vidhya
15 AI & ML in the Automotive Industry #machinelearning #ai
AI & ML in the Automotive Industry #machinelearning #ai
Analytics Vidhya
16 Building Machine Learning Models in BigQuery
Building Machine Learning Models in BigQuery
Analytics Vidhya
17 NLP aspects in Telecommunication Industry
NLP aspects in Telecommunication Industry
Analytics Vidhya
18 Practical Time Series Analysis
Practical Time Series Analysis
Analytics Vidhya
19 Fundamentals of Quantum Computing
Fundamentals of Quantum Computing
Analytics Vidhya
20 A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)
A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)
Analytics Vidhya
21 Classification Machine Learning Model from Scratch
Classification Machine Learning Model from Scratch
Analytics Vidhya
22 Knowledge Graph Solutions using Neo4j
Knowledge Graph Solutions using Neo4j
Analytics Vidhya
23 Model Guesstimation (MLOps)
Model Guesstimation (MLOps)
Analytics Vidhya
24 ETL Pipelines in Google Cloud Platform
ETL Pipelines in Google Cloud Platform
Analytics Vidhya
25 Key steps for Designing Convolutional Neural Network(CNN) for Image Classification
Key steps for Designing Convolutional Neural Network(CNN) for Image Classification
Analytics Vidhya
26 Getting Started with AWS EC2 #amazon #aws
Getting Started with AWS EC2 #amazon #aws
Analytics Vidhya
27 How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining
How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining
Analytics Vidhya
28 Certified AI & ML BlackBelt Plus Program #shorts
Certified AI & ML BlackBelt Plus Program #shorts
Analytics Vidhya
29 Visualizing Data using Python #machinelearning #visualization #python
Visualizing Data using Python #machinelearning #visualization #python
Analytics Vidhya
30 DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience
DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience
Analytics Vidhya
31 M in ML stands for Math & Magic
M in ML stands for Math & Magic
Analytics Vidhya
32 An Unsupervised ML approach using Clustering
An Unsupervised ML approach using Clustering
Analytics Vidhya
33 Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience
Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience
Analytics Vidhya
34 Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning
Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning
Analytics Vidhya
35 Practical MLOps #mlops #datascience
Practical MLOps #mlops #datascience
Analytics Vidhya
36 Data Engineering with Databricks #dataengineering #databricks
Data Engineering with Databricks #dataengineering #databricks
Analytics Vidhya
37 Multi-Objective Optimisation
Multi-Objective Optimisation
Analytics Vidhya
38 When Airflow Meets Kubernetes
When Airflow Meets Kubernetes
Analytics Vidhya
39 AI in Banking
AI in Banking
Analytics Vidhya
40 Learn Convolutional Neural Network for Image Recognition
Learn Convolutional Neural Network for Image Recognition
Analytics Vidhya
41 Extracting Value from Data
Extracting Value from Data
Analytics Vidhya
42 How to measure Marketing Channel Effectiveness
How to measure Marketing Channel Effectiveness
Analytics Vidhya
43 Transforming Lives | Data Science Immersive Bootcamp
Transforming Lives | Data Science Immersive Bootcamp
Analytics Vidhya
44 Stock Market Analysis - AI driven approach
Stock Market Analysis - AI driven approach
Analytics Vidhya
45 Become a Data Engineering Professional in 2022 | Future Trends + Skills Required
Become a Data Engineering Professional in 2022 | Future Trends + Skills Required
Analytics Vidhya
46 Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience
Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience
Analytics Vidhya
47 The Power of Visualization | Tableau Full Course | Analytics Vidhya
The Power of Visualization | Tableau Full Course | Analytics Vidhya
Analytics Vidhya
48 Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya
Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya
Analytics Vidhya
49 Data Visualization in Data Science | DataHour | Analytics Vidhya
Data Visualization in Data Science | DataHour | Analytics Vidhya
Analytics Vidhya
50 Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya
Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya
Analytics Vidhya
51 Solving any Machine Learning Problem | Approach and Steps Involved
Solving any Machine Learning Problem | Approach and Steps Involved
Analytics Vidhya
52 Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly
Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly
Analytics Vidhya
53 Data Engineering in E-Commerce | The Best Case Study
Data Engineering in E-Commerce | The Best Case Study
Analytics Vidhya
54 Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya
Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya
Analytics Vidhya
55 Introduction to Federated Learning | DataHour | Analytics Vidhya
Introduction to Federated Learning | DataHour | Analytics Vidhya
Analytics Vidhya
56 Diffusion Models for Generative Arts | DataHour | Analytics Vidhya
Diffusion Models for Generative Arts | DataHour | Analytics Vidhya
Analytics Vidhya
57 Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya
Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya
Analytics Vidhya
58 Learn Hypothesis Testing | DataHour | Analytics Vidhya
Learn Hypothesis Testing | DataHour | Analytics Vidhya
Analytics Vidhya
59 A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya
A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya
Analytics Vidhya
60 Making AI work for Business | DataHour | Analytics Vidhya
Making AI work for Business | DataHour | Analytics Vidhya
Analytics Vidhya

The video teaches the importance of document splitters and chunkers in LangChain for building high-performance RAG systems, and covers various splitting mechanisms and libraries. It provides a foundation for understanding how to split documents into smaller, more manageable chunks to avoid token limit errors.

Key Takeaways
  1. Understand the concept of document splitters and chunkers
  2. Learn how to split documents into smaller chunks using various methods
  3. Apply chunking strategies to fit LLM context windows
  4. Use built-in document splitters in LangChain
  5. Explore libraries like TikToken, Spacy, and Sentence Transformers for token-based splitting
💡 Document splitters and chunkers are crucial for building high-performance RAG systems, as they enable the splitting of large documents into smaller, more manageable chunks that can fit within LLM context windows.

Related AI Lessons

Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints
Learn to resume LSTM training with checkpoints using PyTorch and Lightning AI, enabling efficient model iteration and development
Dev.to · Rijul Rajesh
How AI Learns with Less Labeled Data
Learn how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection
Medium · AI
Comparing Sarvam-30B and Qwen2.5–14B on Spider Text-to-SQL: An Active-Parameter Perspective
Learn how to compare large language models like Sarvam-30B and Qwen2.5-14B on the Spider Text-to-SQL benchmark from an active-parameter perspective
Medium · LLM
Debugging Benchmark: DeepSeek V4 Pro vs MiMo V2.5 Pro
Compare the debugging capabilities of DeepSeek V4 Pro and MiMo V2.5 Pro on a real-world GitHub bug
Dev.to · Stanislav
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →