Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13921

Trouble extracting text from PDF in Python [closed]

$
0
0

I'm working on a project in Python and I'm having difficulty extracting text from a PDF file. I've tried pdfminer, pypdf and pdftotext(subprocess) but I haven't been successful.

For pypdf

from pypdf import PdfReaderreader = PdfReader("1-100-1.pdf")number_of_pages = len(reader.pages)page = reader.pages[0]text = page.extract_text()

For pdfminer

from io import StringIOfrom pdfminer.converter import TextConverterfrom pdfminer.layout import LAParamsfrom pdfminer.pdfdocument import PDFDocumentfrom pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreterfrom pdfminer.pdfpage import PDFPagefrom pdfminer.pdfparser import PDFParseroutput_string = StringIO()with open('1-100-1.pdf', 'rb') as in_file:    parser = PDFParser(in_file)    doc = PDFDocument(parser)    rsrcmgr = PDFResourceManager()    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())    interpreter = PDFPageInterpreter(rsrcmgr, device)    for page in PDFPage.create_pages(doc):        interpreter.process_page(page)print(output_string.getvalue())

Linux Subprocess

pdftotext 1-100-1.pdf 

It's not working as expected. The PDF file seems to be causing issues.

Google drive Link For the pdf click here

I expect to get the names of the participants like Aakash Bist from pdf 1-100-1.pdf but getting empty string

How to properly extract text from a PDF in Python? If you have any alternative libraries or methods, I'd greatly appreciate the help.


Viewing all articles
Browse latest Browse all 13921

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>