scraping PDFs

Jeff · Aug 12, 2011

Hello,

I have a stack of PDFs (created electronically thankfully) that I need to parse a bit of text from. Been looking through the forum and PlanetPDF a bit for solutions, most posts are for working with Distiller the other way 'round, or outdated.

My current solution, which 'works' in a grim fashion, is to ducttape the handy pdftohtml (http://pdftohtml.sourceforge.net/) to a vba call, then parse one of the resulting html frames.

It ain't pretty, so I wondered how others might've approached this?

Thanks for your insights.

scraping PDFs

Jeff