LangChain Text Splitters & Chunking

Analytics Vidhya · Beginner ·🧠 Large Language Models ·3mo ago

Skills: RAG Basics90%

Key Takeaways

The video discusses document splitters and chunkers in LangChain, highlighting their importance in building high-performance RAG systems, and explores various splitting mechanisms, including recursive character text splitter, character text splitter, and token-based splitters using libraries like TikToken, Spacy, and Sentence Transformers.

Full Transcript

Welcome back. In this video, we will understand what are document splitters and chunkers. So, we'll get an understanding of document splitters and chunkers. Understand how does a splitter work as well as look at some of the popular built-in document splitters in Langchain. Let's get started. So, what are document splitters and chunkers? Now, Langchain supports various document splitting and chunking mechanisms for typically transforming documents. So the whole idea of splitting documents is you have a large document and you split it into smaller more meaningful or more manageable chunks. Okay. So splitting documents into smaller chunks of paragraphs. This is typically done so that you can fit more relevant chunks into your LLM's context window because often if your large language model has a limited context window which is in most of the cases if you try to fit really large documents and a bunch of them you will get a token limit exceeded error. Okay. And documents can be split based on various methods like relevant sections, character counts, token counts and so on. Now how does a splitter work? At a high level, the text splitter works as follows. You split this text. So let's say you have this complete document text. You split the text into small semantically meaningful chunks. Okay? And then you combine these small chunks into a larger chunk until you reach a certain size based on the character count or the token count. As we have seen in the previous video in document loaders, some of the more complex document loaders, we could also do the chunking where we could define that okay, let every chunk be of 3,000 characters in length. Okay. So once you reach that size, make that chunk its own piece of text and then again start creating a new chunk of text with or without chunk overlaps. We will talk about some of these concepts when we get into the hands-on also. Okay. So how do you split the document into these chunks? The typical splitting strategy can be based on the number of characters. Which means let's say if your chunk size is of 3,000 characters, it will try to fit as many words as possible such that the total length of the characters of all those words is not more than 3,000 characters. It could also be based on the number of tokens or words like let's say you say that the chunk size is 100 tokens which means it will roughly try to fit anywhere between 90 to 100 words. Similarly, you can do semantic or sectionalbased chunking as we saw with the unstructured data loaders where you can chunk the main sections based on the relevant titles or headings of the documents. Okay. And chunk size measurement is typically done in the form of character or token counts. Now what are the popular built-in document splitters in lanch chain? You have the recursive character text splitter which is the most widely used and most popular text splitter which recursively splits text into larger chunks based on well-defined characters and then it tries to keep related pieces of text next to each other and it is lang chain's recommended way to start splitting your text into more manageable chunks. Character text splitter is like a specialized version of recursive character text splitter. It splits text based on just one single userdefined character. It's one of the simpler text splitters. Tik Token helps in splitting text based on tokens based on these pre-trained or fine-tuned LLM tokenizers just like chat GPD 3.5 chat GPD based on GPD version 4 and so on. Spacey is a popular NLP library and you can use spacy also to split text using the tokenizer from this popular spacy library. Sentence transformers is also another popular library. So you can split text based on tokens using trained open large language model tokenizers available from the sentence transformers library and of course as we have seen before in the previous videos also you can use unstructured.io. So the unstructured library allows for various splitting and chunking strategies including splitting text based on characters, key sections, titles and so on. In the next video, we will see how to apply some of these splitting and chunking strategies using some hands-on tech. Thank you and I'll see you in the next

Original Description

Description Why can’t you just upload a 100-page PDF to an LLM? In this video, we explore the critical step of Document Splitting and Chunking—the secret to building high-performance RAG systems that never hit token limits. Even if your LLM has a large context window, breaking data into smaller, semantically meaningful "chunks" is essential for accurate retrieval and cost-efficiency. We explain exactly how these splitters work and which strategies you should use for different types of data. What we cover in this lesson: The Goal of Chunking: Fitting data into the LLM context window and improving search relevance. How Text Splitters Work: The process of splitting, combining, and creating overlaps for context retention. Splitting Strategies: Choosing between character counts, token counts, and semantic/sectional chunking. LangChain’s Built-in Splitters: Recursive Character Text Splitter: Why this is the "gold standard" for most use cases. Character Text Splitter: A simple, specialized approach. Token-based Splitting: Using TikToken (OpenAI), spaCy, and Sentence Transformers. Unstructured.io: Advanced chunking based on document titles and headers. Mastering chunking is the difference between an AI that gives generic answers and one that finds the exact needle in the haystack. #RAG #LangChain #TextChunking #GenerativeAI #NLP #Python #OpenAI #VectorSearch #MachineLearning #AIEngineering

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Analytics Vidhya · Analytics Vidhya · 0 of 60

← Previous Next →

The DataHour: Data Science in Retail

The DataHour: Data Science in Retail

Analytics Vidhya

The DataHour: Anomaly detection using NLP and Predictive Modeling

The DataHour: Anomaly detection using NLP and Predictive Modeling

Analytics Vidhya

The DataHour: Energy Data Science Project from Scratch

The DataHour: Energy Data Science Project from Scratch

Analytics Vidhya

The DataHour: Explainable AI Need and Implementation

The DataHour: Explainable AI Need and Implementation

Analytics Vidhya

The DataHour: Google Cloud AI/ML

The DataHour: Google Cloud AI/ML

Analytics Vidhya

Prediction to Production in Machine Learning #machinelearning #prediction

Prediction to Production in Machine Learning #machinelearning #prediction

Analytics Vidhya

Practical Applications of Data science in Ecommerce

Practical Applications of Data science in Ecommerce

Analytics Vidhya

How to tackle Overfitting?#machinelearning #overfitting

How to tackle Overfitting?#machinelearning #overfitting

Analytics Vidhya

Building Data Pipelines on GCP #googlecloud #datapipelines #data

Building Data Pipelines on GCP #googlecloud #datapipelines #data

Analytics Vidhya

Hands-on with A/B Testing #abtesting #datascience

Hands-on with A/B Testing #abtesting #datascience

Analytics Vidhya

Efficient Implementations of Transformers #transformers #cnn #machinelearning

Efficient Implementations of Transformers #transformers #cnn #machinelearning

Analytics Vidhya

Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial

Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial

Analytics Vidhya

Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning

Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning

Analytics Vidhya

5 things you should know about Azure SQL #azure #sql #datahour #datascience

5 things you should know about Azure SQL #azure #sql #datahour #datascience

Analytics Vidhya

AI & ML in the Automotive Industry #machinelearning #ai

AI & ML in the Automotive Industry #machinelearning #ai

Analytics Vidhya

Building Machine Learning Models in BigQuery

Building Machine Learning Models in BigQuery

Analytics Vidhya

NLP aspects in Telecommunication Industry

NLP aspects in Telecommunication Industry

Analytics Vidhya

Practical Time Series Analysis

Practical Time Series Analysis

Analytics Vidhya

Fundamentals of Quantum Computing

Fundamentals of Quantum Computing

Analytics Vidhya

A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)

A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)

Analytics Vidhya

Classification Machine Learning Model from Scratch

Classification Machine Learning Model from Scratch

Analytics Vidhya

Knowledge Graph Solutions using Neo4j

Knowledge Graph Solutions using Neo4j

Analytics Vidhya

Model Guesstimation (MLOps)

Model Guesstimation (MLOps)

Analytics Vidhya

ETL Pipelines in Google Cloud Platform

ETL Pipelines in Google Cloud Platform

Analytics Vidhya

Key steps for Designing Convolutional Neural Network(CNN) for Image Classification

Key steps for Designing Convolutional Neural Network(CNN) for Image Classification

Analytics Vidhya

Getting Started with AWS EC2 #amazon #aws

Getting Started with AWS EC2 #amazon #aws

Analytics Vidhya

How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining

How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining

Analytics Vidhya

Certified AI & ML BlackBelt Plus Program #shorts

Certified AI & ML BlackBelt Plus Program #shorts

Analytics Vidhya

Visualizing Data using Python #machinelearning #visualization #python

Visualizing Data using Python #machinelearning #visualization #python

Analytics Vidhya

DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience

DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience

Analytics Vidhya

M in ML stands for Math & Magic

M in ML stands for Math & Magic

Analytics Vidhya

An Unsupervised ML approach using Clustering

An Unsupervised ML approach using Clustering

Analytics Vidhya

Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience

Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience

Analytics Vidhya

Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning

Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning

Analytics Vidhya

Practical MLOps #mlops #datascience

Practical MLOps #mlops #datascience

Analytics Vidhya

Data Engineering with Databricks #dataengineering #databricks

Data Engineering with Databricks #dataengineering #databricks

Analytics Vidhya

Multi-Objective Optimisation

Multi-Objective Optimisation

Analytics Vidhya

When Airflow Meets Kubernetes

When Airflow Meets Kubernetes

Analytics Vidhya

Analytics Vidhya

Learn Convolutional Neural Network for Image Recognition

Learn Convolutional Neural Network for Image Recognition

Analytics Vidhya

Extracting Value from Data

Extracting Value from Data

Analytics Vidhya

How to measure Marketing Channel Effectiveness

How to measure Marketing Channel Effectiveness

Analytics Vidhya

Transforming Lives | Data Science Immersive Bootcamp

Transforming Lives | Data Science Immersive Bootcamp

Analytics Vidhya

Stock Market Analysis - AI driven approach

Stock Market Analysis - AI driven approach

Analytics Vidhya

Become a Data Engineering Professional in 2022 | Future Trends + Skills Required

Become a Data Engineering Professional in 2022 | Future Trends + Skills Required

Analytics Vidhya

Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience

Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience

Analytics Vidhya

The Power of Visualization | Tableau Full Course | Analytics Vidhya

The Power of Visualization | Tableau Full Course | Analytics Vidhya

Analytics Vidhya

Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya

Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya

Analytics Vidhya

Data Visualization in Data Science | DataHour | Analytics Vidhya

Data Visualization in Data Science | DataHour | Analytics Vidhya

Analytics Vidhya

Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya

Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Solving any Machine Learning Problem | Approach and Steps Involved

Solving any Machine Learning Problem | Approach and Steps Involved

Analytics Vidhya

Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly

Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly

Analytics Vidhya

Data Engineering in E-Commerce | The Best Case Study

Data Engineering in E-Commerce | The Best Case Study

Analytics Vidhya

Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya

Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Introduction to Federated Learning | DataHour | Analytics Vidhya

Introduction to Federated Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Analytics Vidhya

Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya

Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya

Analytics Vidhya

Learn Hypothesis Testing | DataHour | Analytics Vidhya

Learn Hypothesis Testing | DataHour | Analytics Vidhya

Analytics Vidhya

A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya

A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya

Analytics Vidhya

Making AI work for Business | DataHour | Analytics Vidhya

Making AI work for Business | DataHour | Analytics Vidhya

Analytics Vidhya

The video teaches the importance of document splitters and chunkers in LangChain for building high-performance RAG systems, and covers various splitting mechanisms and libraries. It provides a foundation for understanding how to split documents into smaller, more manageable chunks to avoid token limit errors.

Key Takeaways

Understand the concept of document splitters and chunkers
Learn how to split documents into smaller chunks using various methods
Apply chunking strategies to fit LLM context windows
Use built-in document splitters in LangChain
Explore libraries like TikToken, Spacy, and Sentence Transformers for token-based splitting

💡 Document splitters and chunkers are crucial for building high-performance RAG systems, as they enable the splitting of large documents into smaller, more manageable chunks that can fit within LLM context windows.

🔒 Pro feature: Ask AI to explain this lesson →

More on: RAG Basics

View skill →

High Performance (Realtime) RAG Chains: From Basic to Advanced

High Performance (Realtime) RAG Chains: From Basic to Advanced

Coding the Ultimate RAG Engine from Zero

Coding the Ultimate RAG Engine from Zero

Building Agentic RAG From Scratch in Pure Python

Building Agentic RAG From Scratch in Pure Python

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

Akamai Developers

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

Related AI Lessons

Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints

Learn to resume LSTM training with checkpoints using PyTorch and Lightning AI, enabling efficient model iteration and development

Dev.to · Rijul Rajesh

How AI Learns with Less Labeled Data

Learn how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

Comparing Sarvam-30B and Qwen2.5–14B on Spider Text-to-SQL: An Active-Parameter Perspective

Learn how to compare large language models like Sarvam-30B and Qwen2.5-14B on the Spider Text-to-SQL benchmark from an active-parameter perspective

Debugging Benchmark: DeepSeek V4 Pro vs MiMo V2.5 Pro

Compare the debugging capabilities of DeepSeek V4 Pro and MiMo V2.5 Pro on a real-world GitHub bug

Dev.to · Stanislav

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)