PDF Formation

Original link: https://editor.leonh.space/2022/pdf/

For ordinary users, to save documents as PDF, most of them are done by pressing “Export”, “Save as a new file”, and “Print”, but it is not easy to do it on the program side.

How to generate PDF

A typical situation in a web app is nothing more than converting documents, reports, tickets, receipts, and bills that were originally in HTML form into PDFs, which are generally done in several ways:

Generated with browser

There are many PDF generation kits available on the front end, but the problem is that you can’t guarantee that the user’s browser is consistent, and there may be differences in layout, fonts, and colors.

Generated on the backend

Taking Python as an example, the common ones are WeasyPrint and Python-PDFKit .

PDFKit is a tool based on wkhtmltopdf , and wkhtmltopdf is a tool based on Qt WebKit. This Qt WebKit does not know which era of WebKit it is. It has extremely poor support for contemporary CSS, and flexbox and grid are not available. I can’t help admiring the past. The three treasures that I work with can also use wkhtmltopdf to give birth to a good label, give Dabao a like!

WeasyPrint takes the path of parsing HTML by itself. Its parser comes from other Python packages. The advantages are that it is lightweight and fast. The problem is that its parser is not better than a real browser after all, and its fault tolerance and standard support are not good enough, such as the following question:

  • CSS flexbox is supported, but CSS grid is not supported at all.
  • Support justify-content , align-items , but not place-content , place-items .

WeasyPrint is far better of the two, and its examples are beautifully crafted and tempting to use.

Generated with a headless browser on the backend

Personally, I prefer to use Playwright in this regard, although this is a bit overkill. The problem with this type of solution is that it has to install a fat browser, it takes time to start the browser, and it eats memory monsters, which is a big overhead for my 286 host. , but in the case that none of the other programs can play, it seems that there is no choice.

Considerations for generating PDFs

Converting HTML to PDF is not just as simple as saving a new file, there are also these considerations:

  • Security, where does the source HTML come from, is it reliable? Is there any user input? Is it possible to be attacked by injection?
  • Do you need a footer on the first page of a PDF page? Does the build tool have support?
  • Is the paper size standard A4? If it is a custom size, does the generator support it?
  • Does the generator support CSS page break statements? Will it be necessary to repeat the columns of the cross-page table?
  • Is the content to be generated a complex data table? Does the backend group come out?
  • Will the back-end headless browser be called a lot to eat up host resources? Need to introduce queue? Should I respond instantly?

CSS in action

On the CSS side, the print style can be set with certain properties:

 @page { size: 11.3cm 4.3cm ; margin: 0 ; } @media print {    body , div { outline-width: 0px !important ; } } br #page-breaker { break-after: page; display: none; }

described as follows:

  • @page block is responsible for setting the page size and margins.
  • The @media print block is used to set other printing styles.
  • break-after: page sets the forced page break property, which sets an invisible page break element.

WeasyPrint in action

Code directly:

 import weasyprint def html_to_pdf ( html_str : str): html = weasyprint. HTML ( string =html_str) pdf_bytes: bytes = html. write_pdf ()    return pdf_bytes html_raw = template. render () pdf_bytes = html_to_pdf (html_raw)

The above template.render() is actually the HTML generation function of Jinja2. I won’t go into details here. In short, it’s an HTML string.

Later, weasyprint.HTML() is used to parse the HTML string into its own HTML object, and then generate a PDF response.

Playwright hands-on

The same code directly:

 from playwright.sync_api import sync_playwright def html_to_pdf ( html_str : str): pdf_bytes: bytes         with sync_playwright () as p: browser = p.chromium. launch () page = browser. new_page () page. set_content ( html =html_str) pdf_bytes = page. pdf (            height =' 4.3cm ',            margin ={ ' top ': ' 0 ', ' right ': ' 0 ', ' bottom ': ' 0 ', ' left ': ' 0 ' },            width =' 11.3cm ', ) browser. close ()     return pdf_bytes html_raw = template. render () pdf_bytes = html_to_pdf (html_raw)

Because Playwright needs to open the browser, it uses the with block to manage the life cycle of the browser, and closes when it is done.

Compared with WeasyPrint, which directly parses HTML into PDF, Playwright is the concept of printing, so you have to set the printing parameters here, mainly the paper size and printing margins.

Playwright problem in Windows, asynchronous environment

On the Playwright website it says:

Incompatible with SelectorEventLoop of asyncio on Windows

Unfortunately, Jupyter or uvicorn are running under asyncio. In the environment under their control, whether it is sync_playwright or async_playwright , whether it is SelectorEventLoop or ProactorEventLoop , running Playwright will encounter NotImplementedError , so you have to zoom in and open another process deal with.

The modifications after using multiprocessing are as follows:

 import multiprocessing as mp from playwright.sync_api import sync_playwright def html_to_pdf ( q : mp.Queue, html_str : str): pdf_bytes: bytes         with sync_playwright () as p: browser = p.chromium. launch () page = browser. new_page () page. set_content ( html =html_str) pdf_bytes = page. pdf (            height =' 4.3cm ',            margin ={ ' top ': ' 0 ', ' right ': ' 0 ', ' bottom ': ' 0 ', ' left ': ' 0 ' },            width =' 11.3cm ', ) browser. close () q. put ( obj =pdf_bytes) html_raw = template. render () q = mp. Queue () process = mp. Process ( target =html_to_pdf, args =(q, html_raw,)) process. start () pdf_bytes:bytes = q. get () process. join () file_like = BytesIO ( initial_bytes =pdf_bytes)

in:

  • q is a Queue() object, which is used for cross-process communication, using q.put() to throw out the result, and q.get() to receive the result on the other.
  • The process is responsible for running Playwright. The meaning of start() is very clear, that is to run the process, but what does the last join() mean? It doesn’t “join” anything, don’t understand it from a semantic point of view, its role is to confirm the end of the child process and let the main process continue to go down, similar to await in an asynchronous environment.

Epilogue

After practice, WeasyPrint and Playwright are more reliable. Although WeasyPrint has a weakness in CSS support, it can still use flexbox to form a usable layout, and it is not as resource-intensive as Playwright. On the contrary, Playwright has better CSS support, but relatively The generation speed is really slower, and the choice between the two is still inconclusive. As for the wkhtmltopdf mentioned in many articles, forget it.

This article is reprinted from: https://editor.leonh.space/2022/pdf/
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment