Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 14389

Extracting comments/annotations from PDF sequentially - Python

$
0
0

I am trying to extract comments from a PDF using Python. These are the two pieces of code that I have tested:

One using PyPDF2:

import PyPDF2src = 'xxxx.pdf'input1 = PyPDF2.PdfFileReader(open(src, "rb"))nPages = input1.getNumPages()df_comments = pd.DataFrame()for i in range(nPages) :    annotation = []    page = []    page0 = input1.getPage(i)    try :        for annot in page0['/Annots'] :            annotation.append(annot.getObject())        page = [i+1] * len(annotation)        page = pd.DataFrame(page)        annotation = pd.DataFrame(annotation)        df_temp = pd.concat([page, annotation], axis=1)        df_comments = pd.concat([df_comments, df_temp], ignore_index=True)    except :         # there are no annotations on this page        pass

and the other using fitz:

import fitzdoc = fitz.open(src)for i in range(doc.pageCount):    page = doc[i]    for annot in page.annots():        print(annot.info)

The comments are getting extracted, however when I check the PDF I see that the comments are not being extracted sequentially. I have tried to check other parameters like creation date, modification date but that is not helping me.

Is their a way I can extract them serially as they are appearing in the PDF? Or Can I extract the text as well from the PDF against which the comment has been tagged?


Viewing all articles
Browse latest Browse all 14389

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>