I am trying to extract comments from a PDF using Python. These are the two pieces of code that I have tested:
One using PyPDF2
:
import PyPDF2src = 'xxxx.pdf'input1 = PyPDF2.PdfFileReader(open(src, "rb"))nPages = input1.getNumPages()df_comments = pd.DataFrame()for i in range(nPages) : annotation = [] page = [] page0 = input1.getPage(i) try : for annot in page0['/Annots'] : annotation.append(annot.getObject()) page = [i+1] * len(annotation) page = pd.DataFrame(page) annotation = pd.DataFrame(annotation) df_temp = pd.concat([page, annotation], axis=1) df_comments = pd.concat([df_comments, df_temp], ignore_index=True) except : # there are no annotations on this page pass
and the other using fitz
:
import fitzdoc = fitz.open(src)for i in range(doc.pageCount): page = doc[i] for annot in page.annots(): print(annot.info)
The comments are getting extracted, however when I check the PDF I see that the comments are not being extracted sequentially. I have tried to check other parameters like creation date, modification date but that is not helping me.
Is their a way I can extract them serially as they are appearing in the PDF? Or Can I extract the text as well from the PDF against which the comment has been tagged?