Langchain model to extract key information from pdf

I was looking for a solution to extract key information from pdf based on my instruction.

Here's what I've done:

Extract the pdf text using ocr
Use langchain splitter , CharacterTextSplitter, to split the text into chunks
Use Langchain, FAISS, OpenAIEmbedding to extract information based on the instruction

The problems that i faced are:

Sometimes the several first items in the doc is being skipped
It only returns few items, instead of the whole items, let's say the item is 1000, because of the limitation of chatgpt of returning response, i splitted it into 20 products first, and how can i continue to grab the rest of the products? so i can combine them later.

I am using gpt-3.5-turbo for now.

The goals are:

It can return all goods based on what's inside the pdf doc, and the goods could be thousands. (remember there's gpt token limitation for returning response)

My question is:

What is the best langchain model or best methods to achieve my goals

(I'm quite new in this langchain world)

This is the code

from langchain.text_splitter import CharacterTextSplitterfrom langchain_community.vectorstores import faissfrom langchain.chains.question_answering import load_qa_chainfrom langchain.chains import (    StuffDocumentsChain, LLMChain, ConversationalRetrievalChain)from langchain_core.prompts import PromptTemplatefrom langchain_openai import ChatOpenAI, OpenAIEmbeddingsimport osimport sentry_sdkfrom flask_cors import CORS, cross_originfrom instructions import (GOODS_INSTRUCTION_V2,                          GOODS_INSTRUCTION_WITH_LIMIT_20_V2)from flask import Flask, request, jsonify, abortimport jsonimport concurrent.futuresfrom pdf2image import convert_from_pathimport pytesseractimport requestsfrom datetime import datetimefrom urllib.parse import unquotefrom pathlib import Pathimport osfrom typing_extensions import Concatenatefrom utils import Utilsapp = Flask(__name__)CORS(app)utils = Utils()llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo",                 openai_api_key="abc123", max_tokens=2000)@app.route('/upload', methods=['POST'])def upload_file():    start_time = datetime.now()    if 'file' not in request.files:        return "No file part"    file = request.files['file']    if file.filename == '':        return "No selected file"    # Save the uploaded file and get its filename    filename = utils.save_uploaded_file(file)    embeddings = OpenAIEmbeddings()    # Construct expected text file path    expected_text_file = os.path.splitext(filename)[0] +".txt"    expected_text_file = expected_text_file.replace("uploads/", "output/")    print("Expected text file:", expected_text_file)    file_size = 0    if os.path.exists(expected_text_file):        with open(expected_text_file, 'r', encoding='utf-8') as file:            extracted_text = file.read()        print("Text loaded from existing file")    else:        # Check file size        file_size = os.path.getsize(filename)        # Extract text from the PDF or use OCR if needed        extracted_text = utils.extract_text_from_pdf_using_ocr(filename)    text_splitter = CharacterTextSplitter(        separator="\n",        chunk_size=800,        chunk_overlap=200,        length_function=len,    )    chunks = text_splitter.split_text(extracted_text)    print("Chunks Length:", len(chunks))    document_search = faiss.FAISS.from_texts(chunks, embeddings)    ai_response = start_ai_processing(        document_search, GOODS_INSTRUCTION_WITH_LIMIT_20_V2, 'gpt-3.5-turbo')    end_time = datetime.now()    utils.send_slack_message(        filename, file_size, start_time, end_time, ai_response, 'gpt-3.5-turbo')    return jsonify({'data': ai_response})def start_ai_processing(document_search, instruction, gpt_model='gpt-3.5-turbo'):    chain = load_qa_chain(llm, chain_type="stuff")    query = instruction    docs = document_search.similarity_search(query)    result = chain.run(input_documents=docs, question=query)    result = result.replace("```json\n", "").replace("\n```", "").replace("\n", "")    print("Result:", result)    parsed_response = json.loads(result)    return parsed_responseif __name__ == '__main__':    app.run(debug=True, port=6000, threaded=False)

This is the instruction:

GOODS_INSTRUCTION_WITH_LIMIT_20_V2 =  ("""Task: Extract Goods Information from Shipping Invoice and Format as JSONObjective: Analyze a shipping invoice and exclusively extract the list of goods, presenting the details in a structured JSON format with camelCase key names. Maintain the specific document order.Details to Extract for Each Good:Product Code,HS Code / Item Code,Product Description,Quantity,Unit Price / Net,Total Price / Extension,Nett Weight,Gross Weight,Total VolumeExtraction and Formatting Instructions:1.Sequentially retrieve data from the first page to the end.2.List items exactly as they appear without combining them.3.For quantity, you can get from total price / unit price4.Return data in a well-structured JSON format. Any JSON formatting error will result in task failure.5. Maximum 20 items in the goods list, if there are more than 20 items, just return 20 items and stop processing the rest.The structure of json should be like this:{"goods": [        {"productCode": "","hsCode": "","productDescription": "","quantity": 0.0,"unitPrice": 0.0,"totalPrice": 0.0,"nettWeight": 0.0,"grossWeight": 0.0,"totalVolume": 0.0        }    ]}""")

Any help will be appreciated!Thanks

Langchain model to extract key information from pdf

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112