I have one complete (static, it doesn't rely on calls to the internet) HTML file that's < 900 KB in size, and I am currently using PDF Kit to create a single PDF from it that ends up being about 100 pages long. The PDF is 30-40 mB - which is way too large, frankly - considering each page of the PDF is just text and a small image repeated 4 times.
The way I create the PDF is pretty simple.
installation:
apt-get install wkhtmltopdf -ypip install pdfkit==1.0.0pip install pypdf2==2.10.5import pdfkitdef html_to_pdf(html_path: str, pdf_path: str): pdfkit.from_file( input=html_path, output_path=pdf_path, configuration=pdfkit.configuration(), options={'zoom': '0.9588', # seemed to be the right zoom through trial and error'disable-smart-shrinking': '', 'page-size': 'Letter','orientation': 'Landscape','margin-top': '0','margin-right': '0','margin-left': '0','margin-bottom': '0','encoding': "UTF-8", })html_to_pdf(".my_html_file.html", "my_pdf_file.pdf")The image I pull in - I've tried resizing the image and shrunk it to be about 30% of its original size, but there was no change at all in the size of the resulting .pdf.
What I notice about the PDF's I generate with PDFKit is that it's not really a PDF. As in - you can't really search the text, highlight text blocks, etc. It acts like it's essentially a big image on every page. When I do a print from my browser on the HTML and convert that to a PDF - I can do all those things for example.
I am stuck building something programmatically - so I need this to be automated. Is there some setting I'm missing with PDF Kit?
Also what can be noted is I have access to the actual string I use to make the HTML - I don't have to read an HTML file. Would that make a difference?
I'm also open to not using PDF kit at all. I just need something that doesn't require a license.