Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

How to convert from PDF to TXT without unintended line breaks?

$
0
0

I am trying to convert a very clean PDF file into txt file using python. I have tried using pyPDF2 and PDFMiner, both worked perfectly in text recognition.

However, as in PDF the lines are wrapped, the extracted .txt file have unintended line break at the end: e.g line 1: "is an account of the Elder /n Days, ". There should not be a line break between the "Elder" and the "days".

txt file

The PDF file:enter image description here

When edited with Acrobat, it can be clearly seen the original text in PDF contains no hard line break, and could be edited as a paragraph instead of single lines.enter image description here

The Code I have tried (adapted from an answer from here: convert from pdf to text: lines and words are broken)

import io as iofrom io import StringIOfrom pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreterfrom pdfminer.converter import TextConverterfrom pdfminer.layout import LAParamsfrom pdfminer.pdfpage import PDFPageimport osimport sys, getopt#converts pdf, returns its text content as a stringdef convert(fname, pages=None):    if not pages:        pagenums = set()    else:        pagenums = set(pages)    output = io.StringIO()    manager = PDFResourceManager()    converter = TextConverter(manager, output, laparams=LAParams())    interpreter = PDFPageInterpreter(manager, converter)    infile = open(fname, 'rb')    for page in PDFPage.get_pages(infile, pagenums):        interpreter.process_page(page)    infile.close()    converter.close()    text = output.getvalue()    output.close    return textpath='D:\Folder\File.pdf'a=convert(path)f=open("D:\Folder\File.txt",'a',encoding='utf-8')f.write(a)f.close()

Viewing all articles
Browse latest Browse all 23131

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>