Documents Splitting For Chat With Our Data Using LangChain Text Splitters

Hello Learners…

Welcome to the blog…

Table Of Contents

  • Introduction
  • Documents Splitting For Chat With Our Data Using LangChain Text Splitters
  • Summary
  • References

Introduction

In this post, we learn Documents Splitting For Chat With Our Data Using LangChain Text Splitters. After Loading the documents from our data we have to split that data into small chunks.

I hope you see our other post which is required to move forward with document splitting.

Documents Splitting For Chat With Our Data Using LangChain Text Splitters

After loading documents into a standard format using LanChain documents loaders we have to split them into smaller chunks.

This may sound easy, but there are many things that we have to take care of that can make a significant impact on the results.

Document splitting happens after loading the data into document format. It is very important that how we split the documents into smaller chunks.

Documents Splitting For Chat With Our Data Using LangChain Text Splitters.

Let’s take an example,

Document: The hardware configuration of my computer is 4 GB ram with 500 GB SSD. The cores in my laptop are 12 with 2 GHz.The operating system I used is Linux OS.

Chunk-1: The hardware configuration of my computer is 4 GB.

Chunk-2: with 500 GB SSD. The cores in my computer are 12.

Chunk-3: with 2 GHz. The operating system is Linux OS.

Question: what is my computer hardware configuration?

For this question, if we use only chunk-1 or only chunk-2 at that time we don’t get the perfect answers we want.

So, splitting the documents into specific sizes is very important based on our use cases and we have to define chunk size which can give us the best result for our questions.

The basics of all the text splitters in langChain involve splitting on chunks in some chunk size with some chunk overlap.

There are a lot of different types of splitters in Langchain.

CharacterTextSplitter()

  • Implementation of splitting text that looks at characters

MarkdownHeaderTextSplitter()

  • Implementation of splitting markdown files based on specified headers

TokenTextSplitter()

  • Implementation of splitting that looks at tokens

SentenceTransformerTokenTextSplitter()

  • Implementations of splitting text that looks at tokens

RecursiveCharacterTextSplitter()

  • Implementation of splitting text that looks at characters.
  • It recursively tries to split by different characters to find one that works.

Language()

  • It is used for CPP, Python, Ruby, Markdown, etc.

NLTKTextSplitter()

  • Implementation of splitting text that looks at sentences using NLTK.

SpacyTextSplitter()

  • Implementation of Splitting text that looks at sentences using Spacy.

Here we see some of these Text Splitters which we can use for our use cases.

Let’s Implements This Using Python

First, we have to install the required Python libraries,

pip install langchain tiktoken

CharacterTextSplitter()

The text paragraph that we use here for an example and understanding purpose.

text="NLP stands for Natural Language Processing. \
 It is a subfield of artificial intelligence (AI) and linguistics that focuses on the interaction \
 between computers and human language. \
 NLP aims to enable computers to understand, interpret, \
 and generate natural language in a way that is meaningful and useful to humans."

Python Code for CharacterTextSplitter(),

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size =50
chunk_overlap = 4

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
print(c_splitter.split_text(text))

The output of the above code,

['NLP stands for Natural Language Processing.  It is', 'is a subfield of artificial intelligence (AI) and', 'and linguistics that focuses on the interaction', 'between computers and human language.  NLP aims', 'to enable computers to understand, interpret,', 'and generate natural language in a way that is', 'is meaningful and useful to humans.']

c_splitter = CharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=0,
    separator = ' '
)


print(c_splitter.split_text(text))

Here we use separate, and we get the output below,

['NLP stands for Natural Language Processing. It is a subfield of artificial intelligence (AI) and linguistics that focuses on the interaction between computers and human language. NLP aims to enable computers to understand, interpret, and generate natural language in a way that is meaningful and', 'useful to humans.']

RecursiveCharacterTextSplitter()

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size =50
chunk_overlap = 4

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
print(r_splitter.split_text(text))

The output of the above code,

['NLP stands for Natural Language Processing.  It is', 'is a subfield of artificial intelligence (AI) and', 'and linguistics that focuses on the interaction', 'between computers and human language.  NLP aims', 'to enable computers to understand, interpret,', 'and generate natural language in a way that is', 'is meaningful and useful to humans.']

Now reduce the chunk size…

we pass in a list of separators, and these are the default separators but we’re just putting them in this code to better show what’s going on.

And so, we can see that we’ve got a list of double newline, single newline, space, and then nothing, an empty string. What this mean is that when we are splitting a piece of text it will first try to split it by double newlines.

if it still needs to split the individual chunks more it will go on to single newlines. And then, if it still needs to do more it goes on to the space.

finally, it will just go character by character if it really needs to do that.

Example:1
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""]
)

print(r_splitter.split_text(text))

Here we use separate, and we get the output below,

['NLP stands for Natural Language Processing.  It is a subfield of artificial intelligence (AI) and linguistics that focuses on the interaction  between computers and human language.  NLP aims to enable computers to understand, interpret,  and generate natural language in a way that is meaningful and', 'useful to humans.']
Example:2
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)

r_splitter.split_text(text)
['NLP stands for Natural Language Processing.',
 'It is a subfield of artificial intelligence (AI) and linguistics that focuses on the interaction  between computers and human language.',
 'NLP aims to enable computers to understand, interpret,  and generate natural language in a way that is meaningful and useful to humans.']

They can vary on how they split the chunks, and what characters go into that. Also, they can vary in how they measure the length of the chunks.

  • It is by characters
  • It is by tokens

Another important part of splitting into chunks is also the metadata.

Maintaining the same metadata across all chunks but also adding in new pieces of metadata when relevant, so some text splitters are focused on that.

The splitting of chunks can often be specific to the type of document that we are working with and this is visible when we are splitting the code or text.

So, In LangChain there is a language text splitter that has a bunch of different separators like Python, Ruby, C, And when splitting these documents. It takes those other languages and the relevant separators for those languages into account when it is doing the splitting.

Summary

Also, visit the below URL to learn more about LangChain,

Happy Learning And Keep Learning…

Thank You…

References

2 thoughts on “Documents Splitting For Chat With Our Data Using LangChain Text Splitters”

Leave a Comment