I'm working on a project in Python and I'm having difficulty extracting text from a PDF file. I've tried pdfminer, pypdf and pdftotext(subprocess) but I haven't been successful.
For pypdf
from pypdf import PdfReaderreader = PdfReader("1-100-1.pdf")number_of_pages = len(reader.pages)page = reader.pages[0]text = page.extract_text()
For pdfminer
from io import StringIOfrom pdfminer.converter import TextConverterfrom pdfminer.layout import LAParamsfrom pdfminer.pdfdocument import PDFDocumentfrom pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreterfrom pdfminer.pdfpage import PDFPagefrom pdfminer.pdfparser import PDFParseroutput_string = StringIO()with open('1-100-1.pdf', 'rb') as in_file: parser = PDFParser(in_file) doc = PDFDocument(parser) rsrcmgr = PDFResourceManager() device = TextConverter(rsrcmgr, output_string, laparams=LAParams()) interpreter = PDFPageInterpreter(rsrcmgr, device) for page in PDFPage.create_pages(doc): interpreter.process_page(page)print(output_string.getvalue())
Linux Subprocess
pdftotext 1-100-1.pdf
It's not working as expected. The PDF file seems to be causing issues.
Google drive Link For the pdf click here
I expect to get the names of the participants like Aakash Bist from pdf 1-100-1.pdf but getting empty string
How to properly extract text from a PDF in Python? If you have any alternative libraries or methods, I'd greatly appreciate the help.