I would like to use Python to retrieve metadata stored in PDF files. I am trying to use Python xmptools, but find that I cannot extract all the metadata. For example, this paper is available in PDF format. I have the following script that tries to extract the metadata
from xmptools import XMPMetadata, DCxmp = XMPMetadata.fromFile("Leonard_2015_Comment_on_‘Dimensionless_units_in_the_SI’.pdf")[0]print( xmp.getContainerItems(DC.publisher) )This works fine. The result is [rdflib.term.Literal('IOP Publishing')]. However, if I change the last line to
print( xmp.getContainerItems(DC.identifier) )then I get None as a result.
I think this may be due to the XML inside the PDF file. The data concerned with these two queries are
<dc:publisher><rdf:Bag><rdf:li>IOP Publishing</rdf:li></rdf:Bag></dc:publisher><dc:identifier>doi:10.1088/0026-1394/52/4/613</dc:identifier>In the case of publisher, the information is wrapped in RDF tags, but that is not the case for identifier.
Is there a way for xmptools to read simple entries where RDF tags have not been used?