Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23276

Converting PDF to Markdown in Python with structure preservation

$
0
0

I need to convert a PDF text document to Markdown and maintaining its structure (ie. indexed numbered headers and subheaders should have their correspective number of hashtags # in markdown to keep the same structure tree).I have explored alone PDFMinersix but I am basically extracting text and I don't see a functionality capable of mapping the structure tree to markdown format, or am I wrong?

For me it's important to convert the document to text and being able to retain structure tree hierarchy. Either in 1 or 2 steps is the same for me.

Any recommendations for Python libraries or best practices that have proven effective in similar scenarios? I am looking for a solution that could scale hundreds of documents and so possibly nothing hardcoded, even though the documents will actually share most of the structure and indexing.


Viewing all articles
Browse latest Browse all 23276

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>