An option for verifying PDFs might be using PDFLib Text Extraction Toolkit from PDFlib GmbHT. This toolkit appears to be very powerful and supports the extraction of text, images as well as all objects that make up a PDF file.
The example below are using Squish's API to drive an application developed using Qt framework. As well, while the script examples here are in Perl, you can replace it with the language of your choice. As long as it offers similar services to the programmer.
In the following example:
get_pdf_objs conceptually uses the external application tet.exe (Text Extraction Toolkit) in combination with Perl’s system function to extract all text and images from a PDF file and store it in a txt file or image file respectively.
parse_pdf is not really a parser, in this example its only role is to produce an array of text strings that it will extract from the text file produced by get_pdf_objs.
Using Perl or any other of the supported scripting languages, we can easily script the handling of PDF files using an external tool like PDFLib TET and a custom utility function developed for this purpose. The above example illustrates this concept; the process is outlined below:
- Start the application
- Navigate the UI until a report is generated
- Print that report
- Pass the file to a PDFLib function to extract text, images, etc.
- Parse the extracted text, images, etc. and compare to known good values
I'd like to know your thoughts via comments.