Question

Extracting information from content of PDF file (and possibly also other formats, such as Microsoft Word)

0

Hello, I’m trying to extract information from some PDF files which generally follows a set format. I have tried to take a look on https://appstore.home.mendix.com/link/app/DocumentParser, but it appears that the Document Parser is just used to extract metadata from a PDF, and not actual information from the contents of the PDF file (on a second note, the SOAP service for the DocumentParser no longer exists). Does anyone have an existing solution for extracting information from a PDF file, or will I need to write a custom java action to fulfill this task? The PDF files that I work with contains information in the text of the document, as well as structured into tables in the PDF. For future compability, being able to extract information from Microsoft Word documents is also relevant. Thanks in advance!

asked 2019-04-11

Andreas Marc Debess

3 answers

1

Hi Erwin,

I have tried to use pdfbox, which works for getting text out of PDF files, as you mentioned; currently when I try to find data from tables in PDFs, this becomes a bit more tedious, since the PDF format defines tables as just various line strokes around texts. For this, I’ve taken a look at https://github.com/tabulapdf/tabula-java which seems to be able to achieve this goal when I try the demo command line app, using the tabula-1.0.2-jar-with-dependencies.jar.

~~Do you have any tips for making a temporary Java File from a Mendix FileDocument, as well as Mendix not finding the required classes in the jar?~~ While the question about using a temporary File object is still relevant to me, I have solved the task of extracting information from tables now.

answered 2019-04-12

Andreas Marc Debess

1

Hi Andreas,

I published a module for reading content, and reading and setting metadata on PDF files. Might be useful for future reference or other people with the same issue.

You can find the module here: https://appstore.home.mendix.com/link/app/109922/

answered 2019-06-18

Ward Brink

Erwin 't Hoen · Accepted Answer · 2019-04-11

Andreas,

AFAIK there currently is no module available for this. I have created a function in the past to find a QR-code inside a pdf and this was quite straightforward in java. IN order to extract text from a pdf I would go for Apache pdfbox.

See https://stackoverflow.com/questions/23813727/how-to-extract-text-from-a-pdf-file-with-apache-pdfbox for a simple example of a standalone java class that extracts the text from a PDF. With some alterations and some tweaks you should be able to use this.

I’ve been using PDFbox to generate and update information in pdf’s a this works like a charm.