Quantcast
Viewing all articles
Browse latest Browse all 14011

Langchain model to extract key information from pdf

I was looking for a solution to extract key information from pdf based on my instruction.

Here's what I've done:

  1. Extract the pdf text using ocr
  2. Use langchain splitter , CharacterTextSplitter, to split the text into chunks
  3. Use Langchain, FAISS, OpenAIEmbedding to extract information based on the instruction

The problems that i faced are:

  1. Sometimes the several first items in the doc is being skipped
  2. It only returns few items, instead of the whole items, let's say the item is 1000, because of the limitation of chatgpt of returning response, i splitted it into 20 products first, and how can i continue to grab the rest of the products? so i can combine them later.

I am using gpt-3.5-turbo for now.

The goals are:

  1. It can return all goods based on what's inside the pdf doc, and the goods could be thousands. (remember there's gpt token limitation for returning response)

My question is:

  1. What is the best langchain model or best methods to achieve my goals

(I'm quite new in this langchain world)

This is the code

from langchain.text_splitter import CharacterTextSplitterfrom langchain_community.vectorstores import faissfrom langchain.chains.question_answering import load_qa_chainfrom langchain.chains import (    StuffDocumentsChain, LLMChain, ConversationalRetrievalChain)from langchain_core.prompts import PromptTemplatefrom langchain_openai import ChatOpenAI, OpenAIEmbeddingsimport osimport sentry_sdkfrom flask_cors import CORS, cross_originfrom instructions import (GOODS_INSTRUCTION_V2,                          GOODS_INSTRUCTION_WITH_LIMIT_20_V2)from flask import Flask, request, jsonify, abortimport jsonimport concurrent.futuresfrom pdf2image import convert_from_pathimport pytesseractimport requestsfrom datetime import datetimefrom urllib.parse import unquotefrom pathlib import Pathimport osfrom typing_extensions import Concatenatefrom utils import Utilsapp = Flask(__name__)CORS(app)utils = Utils()llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo",                 openai_api_key="abc123", max_tokens=2000)@app.route('/upload', methods=['POST'])def upload_file():    start_time = datetime.now()    if 'file' not in request.files:        return "No file part"    file = request.files['file']    if file.filename == '':        return "No selected file"    # Save the uploaded file and get its filename    filename = utils.save_uploaded_file(file)    embeddings = OpenAIEmbeddings()    # Construct expected text file path    expected_text_file = os.path.splitext(filename)[0] +".txt"    expected_text_file = expected_text_file.replace("uploads/", "output/")    print("Expected text file:", expected_text_file)    file_size = 0    if os.path.exists(expected_text_file):        with open(expected_text_file, 'r', encoding='utf-8') as file:            extracted_text = file.read()        print("Text loaded from existing file")    else:        # Check file size        file_size = os.path.getsize(filename)        # Extract text from the PDF or use OCR if needed        extracted_text = utils.extract_text_from_pdf_using_ocr(filename)    text_splitter = CharacterTextSplitter(        separator="\n",        chunk_size=800,        chunk_overlap=200,        length_function=len,    )    chunks = text_splitter.split_text(extracted_text)    print("Chunks Length:", len(chunks))    document_search = faiss.FAISS.from_texts(chunks, embeddings)    ai_response = start_ai_processing(        document_search, GOODS_INSTRUCTION_WITH_LIMIT_20_V2, 'gpt-3.5-turbo')    end_time = datetime.now()    utils.send_slack_message(        filename, file_size, start_time, end_time, ai_response, 'gpt-3.5-turbo')    return jsonify({'data': ai_response})def start_ai_processing(document_search, instruction, gpt_model='gpt-3.5-turbo'):    chain = load_qa_chain(llm, chain_type="stuff")    query = instruction    docs = document_search.similarity_search(query)    result = chain.run(input_documents=docs, question=query)    result = result.replace("```json\n", "").replace("\n```", "").replace("\n", "")    print("Result:", result)    parsed_response = json.loads(result)    return parsed_responseif __name__ == '__main__':    app.run(debug=True, port=6000, threaded=False)

This is the instruction:

GOODS_INSTRUCTION_WITH_LIMIT_20_V2 =  ("""Task: Extract Goods Information from Shipping Invoice and Format as JSONObjective: Analyze a shipping invoice and exclusively extract the list of goods, presenting the details in a structured JSON format with camelCase key names. Maintain the specific document order.Details to Extract for Each Good:Product Code,HS Code / Item Code,Product Description,Quantity,Unit Price / Net,Total Price / Extension,Nett Weight,Gross Weight,Total VolumeExtraction and Formatting Instructions:1.Sequentially retrieve data from the first page to the end.2.List items exactly as they appear without combining them.3.For quantity, you can get from total price / unit price4.Return data in a well-structured JSON format. Any JSON formatting error will result in task failure.5. Maximum 20 items in the goods list, if there are more than 20 items, just return 20 items and stop processing the rest.The structure of json should be like this:{"goods": [        {"productCode": "","hsCode": "","productDescription": "","quantity": 0.0,"unitPrice": 0.0,"totalPrice": 0.0,"nettWeight": 0.0,"grossWeight": 0.0,"totalVolume": 0.0        }    ]}""")

Any help will be appreciated!Thanks


Viewing all articles
Browse latest Browse all 14011

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>