Documents Loading For Chat With Our Data Using LangChain Document Loaders

Hello Learners…

Welcome to the blog…

Table Of Contents

  • Introduction
  • Documents Loading For Chat With Our Data Using LangChain Document Loaders
  • Summary
  • References

Introduction

In this post, we learn about Documents Loading For Chat With Our Data Using LangChain Document Loaders. When we are working with different types of documents then how we can use LangChain?

LangChain Document Loaders are responsible for loading documents into the LangChain system. They handle various types of documents, including PDFs, and convert them into a format that can be processed by the LangChain system.

Documents Loading For Chat With Our Data Using LangChain Document Loaders

To create an application where we can chat with our data. first, we have to load our data into a format where LLMs can be worked with. At that time langChain document loaders come into play.

LangChain has 80+ different types of document loaders. Document loaders deal with the specifics of accessing and converting data from a variety of different formats and sources into a standardized format.

there can be different places that we want to load data from like websites different databases, and YouTube, and these documents can come in many different data types like PDF, HTML, or JSON.

and so the whole purpose of document loaders is to take this variety of data sources and load them into a standard document object.

which consist of content and then associated metadata there are a lot of different types of document loaders in LangChain there are a lot that deal with loading on structure data like that feel from public data sources like you to Twitter hacker news and there are also even more that deal with loading structure data from the proprietary data sources that we or our company might have figma, notion.

Some of the most commonly used Document Loaders include,

  • PDFs
  • YouTube
  • URLs

PDFs

the first type of document that we are going to be working with is a PDF.

Installation Of required Libraries

pip install langchain pypdf

so let’s import the relative document loader from the line chain we are going to use the PyPDF loader from LangChain.

from langchain.document_loaders import PyPDFLoader

Now load the pdf file using the file path.

loader=PyPDFLoader("./Natural Language Processing with Python-11-20.pdf")

pages=loader.load()

print(len(pages))

#output
10 #number of pages in pdf file

Each page is a document. A Document contains text (page_context) and metadata.

here we can load the documents by just calling the load methods

So let’s try to understand what exactly we have loaded.

By default it will load a list of documents in this case there are 10 different pages in PDF. Every page is a unique document with its unique metadata.

first_page=pages[0]

let’s take the first one and see what it consists of some page content which is the content of the page.

This can be a bit long so let’s just print out the first 500 characters.

print(first_page.page_content[:500])

The other piece of information that is really important is the metadata associated with each document.

first_page.metadata

We can see here that there are two different pieces one is the source information which is the PDF file that we give as a path and the second is corresponding to the page of the pdf that was loaded from.

YouTube

When we don’t want to see the full YouTube video and directly want to get some answers to the questions at that time we can use the LangChain YouTube loader for that.

It is very good for those who have no time to watch full YouTube videos but they can ask related questions and get the answers.

First, we have to install the required libraries,

pip install yt_dlp pydub

The main key part is the YouTube audio loader, which loads an audio file from a youtube video.

The other key part is the OpenAI whisper parser. This will use OpenAI’s whisper model, a speech-to-text model to convert the youtube audio into a text format that we can work with.

we can now specify a URL, specify a directory in which to save the audio file, and then create the generic loader as a combination of this youtube audio loader combined with the OpenAI whisper parser.

And Then we can call “loader.load” to load the documents corresponding to this youtube.

Import Required Libraries

from langchain.document_loaders.generic import GenericLoader

from langchain.document_loaders.parsers import OpenAIWhisperParser

from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

Note: This code takes several minutes to complete.

from dotenv import load_dotenv,find_dotenv
import openai
import os
__ = load_dotenv('.env') #read local .env file
openai.api_key=os.environ["OPENAI_API_KEY"]

URL="https://www.youtube.com/shorts/LZ0Z8PE7dWo"
save_dir="./audio/"

loader=GenericLoader(YoutubeAudioLoader([URL],save_dir),OpenAIWhisperParser())

Now we load the youtube data,

docs=loader.load()
print(docs)

print the first 100 characters

docs[0].page_content[:100]
#Output 

'Hello learners. Welcome to the Galaxy of AI. Galaxy of AI is a blog about artificial intelligence, m'

You can download the full code from here:

Summary

Also, you can read,

References

3 thoughts on “Documents Loading For Chat With Our Data Using LangChain Document Loaders”

Leave a Comment