In part 1 I explored how to use Puppeteer to generate a PDF from HTML. Once we have a PDF, though, how can we manipulate it? For example, what if we need to automatically generate an index? We would somehow need to identify which keywords appear on which pages, turn that into a list, then add that list onto the end of the PDF. Is there a way to do all that in Node?
These questions led me to HummusJS, a Node module for parsing and manipulating PDFs. Unlike other Node.js PDF manipulation modules such as pdf-merge or node-pdftk, HummusJS doesn’t require installing PDFtk, so all you need to get started is the HummusJS Node module itself.
Here’s how I used Hummus to append an index to a PDF:
Step 1: parse the PDF
parser.js:
extractText is the result of require()ing this file, part of a text extraction sample from the HummusJS author. The module exported by parser.js simply wraps a PDFReader around a buffer containing the PDF data, which is then used by extractText(), which returns an array of text information for each page.
Step 2: use parsed text to build index as an HTML list
index.js
Here, appendIndex() accepts the PDF to create an index for and a list of keywords to include, then passes the result of parser.parsePDFBuffer() to locateIndexKeywords(), which finds the first occurrence of each keyword in the PDF and returns a dictionary of keyword to page number. Then, getIndexHTML() transforms that dictionary into an HTML list.
Step 3: make a new PDF from the index HTML and append it to the original PDF
combiner.js
combinePDFBuffers() takes two buffers containing PDF data and appends the second to the first.
index.js
renderer is renderer.js from part 1. Just like in part 1, its rendererHTMLtoPDF() method is used to turn the index’s HTML into a buffer of PDF data. Finally, combiner.combinePDFBuffers() produces a new buffer containing the data of the original PDF plus the freshly rendered index PDF.