LangChain Text Splitters & Chunking
Skills:
RAG Basics90%
Key Takeaways
The video discusses document splitters and chunkers in LangChain, highlighting their importance in building high-performance RAG systems, and explores various splitting mechanisms, including recursive character text splitter, character text splitter, and token-based splitters using libraries like TikToken, Spacy, and Sentence Transformers.
Full Transcript
Welcome back. In this video, we will understand what are document splitters and chunkers. So, we'll get an understanding of document splitters and chunkers. Understand how does a splitter work as well as look at some of the popular built-in document splitters in Langchain. Let's get started. So, what are document splitters and chunkers? Now, Langchain supports various document splitting and chunking mechanisms for typically transforming documents. So the whole idea of splitting documents is you have a large document and you split it into smaller more meaningful or more manageable chunks. Okay. So splitting documents into smaller chunks of paragraphs. This is typically done so that you can fit more relevant chunks into your LLM's context window because often if your large language model has a limited context window which is in most of the cases if you try to fit really large documents and a bunch of them you will get a token limit exceeded error. Okay. And documents can be split based on various methods like relevant sections, character counts, token counts and so on. Now how does a splitter work? At a high level, the text splitter works as follows. You split this text. So let's say you have this complete document text. You split the text into small semantically meaningful chunks. Okay? And then you combine these small chunks into a larger chunk until you reach a certain size based on the character count or the token count. As we have seen in the previous video in document loaders, some of the more complex document loaders, we could also do the chunking where we could define that okay, let every chunk be of 3,000 characters in length. Okay. So once you reach that size, make that chunk its own piece of text and then again start creating a new chunk of text with or without chunk overlaps. We will talk about some of these concepts when we get into the hands-on also. Okay. So how do you split the document into these chunks? The typical splitting strategy can be based on the number of characters. Which means let's say if your chunk size is of 3,000 characters, it will try to fit as many words as possible such that the total length of the characters of all those words is not more than 3,000 characters. It could also be based on the number of tokens or words like let's say you say that the chunk size is 100 tokens which means it will roughly try to fit anywhere between 90 to 100 words. Similarly, you can do semantic or sectionalbased chunking as we saw with the unstructured data loaders where you can chunk the main sections based on the relevant titles or headings of the documents. Okay. And chunk size measurement is typically done in the form of character or token counts. Now what are the popular built-in document splitters in lanch chain? You have the recursive character text splitter which is the most widely used and most popular text splitter which recursively splits text into larger chunks based on well-defined characters and then it tries to keep related pieces of text next to each other and it is lang chain's recommended way to start splitting your text into more manageable chunks. Character text splitter is like a specialized version of recursive character text splitter. It splits text based on just one single userdefined character. It's one of the simpler text splitters. Tik Token helps in splitting text based on tokens based on these pre-trained or fine-tuned LLM tokenizers just like chat GPD 3.5 chat GPD based on GPD version 4 and so on. Spacey is a popular NLP library and you can use spacy also to split text using the tokenizer from this popular spacy library. Sentence transformers is also another popular library. So you can split text based on tokens using trained open large language model tokenizers available from the sentence transformers library and of course as we have seen before in the previous videos also you can use unstructured.io. So the unstructured library allows for various splitting and chunking strategies including splitting text based on characters, key sections, titles and so on. In the next video, we will see how to apply some of these splitting and chunking strategies using some hands-on tech. Thank you and I'll see you in the next
Original Description
Description
Why can’t you just upload a 100-page PDF to an LLM? In this video, we explore the critical step of Document Splitting and Chunking—the secret to building high-performance RAG systems that never hit token limits.
Even if your LLM has a large context window, breaking data into smaller, semantically meaningful "chunks" is essential for accurate retrieval and cost-efficiency. We explain exactly how these splitters work and which strategies you should use for different types of data.
What we cover in this lesson:
The Goal of Chunking: Fitting data into the LLM context window and improving search relevance.
How Text Splitters Work: The process of splitting, combining, and creating overlaps for context retention.
Splitting Strategies: Choosing between character counts, token counts, and semantic/sectional chunking.
LangChain’s Built-in Splitters:
Recursive Character Text Splitter: Why this is the "gold standard" for most use cases.
Character Text Splitter: A simple, specialized approach.
Token-based Splitting: Using TikToken (OpenAI), spaCy, and Sentence Transformers.
Unstructured.io: Advanced chunking based on document titles and headers.
Mastering chunking is the difference between an AI that gives generic answers and one that finds the exact needle in the haystack.
#RAG #LangChain #TextChunking #GenerativeAI #NLP #Python #OpenAI #VectorSearch #MachineLearning #AIEngineering
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Analytics Vidhya · Analytics Vidhya · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
The DataHour: Data Science in Retail
Analytics Vidhya
The DataHour: Anomaly detection using NLP and Predictive Modeling
Analytics Vidhya
The DataHour: Energy Data Science Project from Scratch
Analytics Vidhya
The DataHour: Explainable AI Need and Implementation
Analytics Vidhya
The DataHour: Google Cloud AI/ML
Analytics Vidhya
Prediction to Production in Machine Learning #machinelearning #prediction
Analytics Vidhya
Practical Applications of Data science in Ecommerce
Analytics Vidhya
How to tackle Overfitting?#machinelearning #overfitting
Analytics Vidhya
Building Data Pipelines on GCP #googlecloud #datapipelines #data
Analytics Vidhya
Hands-on with A/B Testing #abtesting #datascience
Analytics Vidhya
Efficient Implementations of Transformers #transformers #cnn #machinelearning
Analytics Vidhya
Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial
Analytics Vidhya
Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning
Analytics Vidhya
5 things you should know about Azure SQL #azure #sql #datahour #datascience
Analytics Vidhya
AI & ML in the Automotive Industry #machinelearning #ai
Analytics Vidhya
Building Machine Learning Models in BigQuery
Analytics Vidhya
NLP aspects in Telecommunication Industry
Analytics Vidhya
Practical Time Series Analysis
Analytics Vidhya
Fundamentals of Quantum Computing
Analytics Vidhya
A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)
Analytics Vidhya
Classification Machine Learning Model from Scratch
Analytics Vidhya
Knowledge Graph Solutions using Neo4j
Analytics Vidhya
Model Guesstimation (MLOps)
Analytics Vidhya
ETL Pipelines in Google Cloud Platform
Analytics Vidhya
Key steps for Designing Convolutional Neural Network(CNN) for Image Classification
Analytics Vidhya
Getting Started with AWS EC2 #amazon #aws
Analytics Vidhya
How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining
Analytics Vidhya
Certified AI & ML BlackBelt Plus Program #shorts
Analytics Vidhya
Visualizing Data using Python #machinelearning #visualization #python
Analytics Vidhya
DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience
Analytics Vidhya
M in ML stands for Math & Magic
Analytics Vidhya
An Unsupervised ML approach using Clustering
Analytics Vidhya
Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience
Analytics Vidhya
Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning
Analytics Vidhya
Practical MLOps #mlops #datascience
Analytics Vidhya
Data Engineering with Databricks #dataengineering #databricks
Analytics Vidhya
Multi-Objective Optimisation
Analytics Vidhya
When Airflow Meets Kubernetes
Analytics Vidhya
AI in Banking
Analytics Vidhya
Learn Convolutional Neural Network for Image Recognition
Analytics Vidhya
Extracting Value from Data
Analytics Vidhya
How to measure Marketing Channel Effectiveness
Analytics Vidhya
Transforming Lives | Data Science Immersive Bootcamp
Analytics Vidhya
Stock Market Analysis - AI driven approach
Analytics Vidhya
Become a Data Engineering Professional in 2022 | Future Trends + Skills Required
Analytics Vidhya
Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience
Analytics Vidhya
The Power of Visualization | Tableau Full Course | Analytics Vidhya
Analytics Vidhya
Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya
Analytics Vidhya
Data Visualization in Data Science | DataHour | Analytics Vidhya
Analytics Vidhya
Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya
Analytics Vidhya
Solving any Machine Learning Problem | Approach and Steps Involved
Analytics Vidhya
Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly
Analytics Vidhya
Data Engineering in E-Commerce | The Best Case Study
Analytics Vidhya
Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya
Analytics Vidhya
Introduction to Federated Learning | DataHour | Analytics Vidhya
Analytics Vidhya
Diffusion Models for Generative Arts | DataHour | Analytics Vidhya
Analytics Vidhya
Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya
Analytics Vidhya
Learn Hypothesis Testing | DataHour | Analytics Vidhya
Analytics Vidhya
A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya
Analytics Vidhya
Making AI work for Business | DataHour | Analytics Vidhya
Analytics Vidhya
More on: RAG Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints
Dev.to · Rijul Rajesh
How AI Learns with Less Labeled Data
Medium · AI
Comparing Sarvam-30B and Qwen2.5–14B on Spider Text-to-SQL: An Active-Parameter Perspective
Medium · LLM
Debugging Benchmark: DeepSeek V4 Pro vs MiMo V2.5 Pro
Dev.to · Stanislav
🎓
Tutor Explanation
DeepCamp AI