Using tags in PDF import

March 5, 2021

While I would prefer another format,* I came across a way to improve PDF import that may be interesting to OakTree. In general, a PDF just indicates where to draw text on a page, and even deciding which letters are part of the same word requires heuristics. 20 years ago, semantic tagging was introduced for PDFs to solve this problem. In a well-tagged PDF, compliant with ISO 32000-2, 14.8 "Tagged PDF", or ISO 14289-1 (PDF/UA-1), a table, for example, is identified as such, with its rows and cells, instead of just being characters arranged in a certain way on a page. Headers can be identified by level, so that a table of contents can be constructed.

It has been slow to catch on, but is growing in support: MS Word, LibreOffice, Adobe InDesign, FineReader, Chrome, … Unfortunately, not all of these tag by default, but MS Word now does. Some publishers and archives now require a tagged PDF to be submitted.

My discovery today is that in 2019 the PDF Foundation published detailed instructions for deriving HTML from a tagged PDF. I suspect this would be a useful model for importing a tagged PDF directly into Accordance. This would make import more reliable and would allow it to capture the structure of the document without guessing, probably reducing support requests from users, who usually have no idea how complicated it is to import from a PDF.

* My ideal would be an HTML or XML format that could produce a Tool with Unicode, footnotes, fields (page number, section, …), tables, and images. It would be even better if it could produce Reference Tools. If the import works, I would have no need for it to produce a User Tool: I would rather edit my input file than a User Tool.

Using tags in PDF import

Recommended Posts

jlm

Link to comment

Share on other sites

Please sign in to comment

Browse

Activity