Good afternoon,
I have been setting up some code for extracting text with fitz library (PyMuPDF).Module has been correctly installed via lambda layer and it is working as expected, but when i try to use the official fitz utils script i get
ModuleNotFoundError: No module named 'fitz'
example code:
def extract_text(pdf_stream): try: pdf_doc = fitz.open(stream=pdf_stream, filetype='pdf') # Save the PDF document to a file pdf_doc.save('/tmp/file.pdf') #/tmp is a file destination required to save file, everything else is read only in lambda logger.info("PDF file saved. Running fitzcli.py.") cmd_args = ["python", "fitzcli.py", "gettext", "-input", "file.pdf", "-output", "tmp/extracted_text.txt", "-mode", "layout"] subprocess.run(cmd_args, check=True) with open('extracted_text.txt', 'r') as open_file: read_file = open_file.read() # Assuming extract_top_rows function is defined elsewhere in your code headers_text = extract_top_rows(read_file) return headers_text except Exception as e: logger.error(f"An error occurred: {e}") raise
link of the scripthttps://github.com/pymupdf/PyMuPDF-Utilities/blob/master/text-extraction/fitzcli.py
i cannot alter code because of the licensing contraints
i have tried copying lambda execution enviroment and run subprocess with that env.
env = os.environ.copy()cmd_args = ["python", "fitzcli.py", "gettext", "-input", "file.pdf", "-output", "tmp/extracted_text.txt", "-mode", "layout"]subprocess.run(cmd_args, check=True, env=env)
was expecting to run subprocess