I am trying to convert a very clean PDF file into txt file using python. I have tried using pyPDF2 and PDFMiner, both worked perfectly in text recognition.
However, as in PDF the lines are wrapped, the extracted .txt file have unintended line break at the end: e.g line 1: "is an account of the Elder /n Days, ". There should not be a line break between the "Elder" and the "days".
When edited with Acrobat, it can be clearly seen the original text in PDF contains no hard line break, and could be edited as a paragraph instead of single lines.
The Code I have tried (adapted from an answer from here: convert from pdf to text: lines and words are broken)
import io as iofrom io import StringIOfrom pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreterfrom pdfminer.converter import TextConverterfrom pdfminer.layout import LAParamsfrom pdfminer.pdfpage import PDFPageimport osimport sys, getopt#converts pdf, returns its text content as a stringdef convert(fname, pages=None): if not pages: pagenums = set() else: pagenums = set(pages) output = io.StringIO() manager = PDFResourceManager() converter = TextConverter(manager, output, laparams=LAParams()) interpreter = PDFPageInterpreter(manager, converter) infile = open(fname, 'rb') for page in PDFPage.get_pages(infile, pagenums): interpreter.process_page(page) infile.close() converter.close() text = output.getvalue() output.close return textpath='D:\Folder\File.pdf'a=convert(path)f=open("D:\Folder\File.txt",'a',encoding='utf-8')f.write(a)f.close()

