When/Why do some PDF files have text you can read in notepad and others don't

I receive PDF files from many different hospitals using EPIC. Some of the PDF files I can open in notepad and read the actual content of the pdf. Some I can only read the beginning and ending sections of Objects. And still others are completely gibberish.

Is this a function of print driver settings? If so, what settings do I want used so that I can read all the content?

Why? Glad you asked. (I am willing to admit there might be a better way)
I need to 'weed out' information the hospitals are sending to us. If I can read the content programmatically, then I can build my own condensed html page that is formatted the way we need to see the data, in the order we need to see the data.

Hope someone can shed some light on this for me.

Thanks in advance.


James Mikel


4 Answers

Voted Best Answer

You can use a 3rd party tool like PDFtk to extract the metadata using the dumpdata option.

pdftk mydoc.pdf dump_data output mydoc.data.txt

Metadata will be represented as key/value pairs, like so:

InfoKey: Creator

InfoValue: Acrobat PDFMaker 6.0 for Word

InfoKey: Title

InfoValue: Brian Eno: His Music and the Vertical Color of Sound

InfoKey: Author

InfoValue: Eric Tamm

InfoKey: Producer

InfoValue: Acrobat Distiller 6.0.1 (Windows)

InfoKey: ModDate

InfoValue: D:20040420234132-07'00'

InfoKey: CreationDate

InfoValue: D:20040420234045-07'00'

This may provide and easier and more consistent format for parsing programatically.


By George Kaiser   

PDF is NOT a text format. PDF is a binary format, and simply opening it in a text editor does not give you any suvvicient information about parsing.

To parse a PDF programmatically, you will need an appropriate library; there are some Open Source libraries floating around, and there are commercial libraires available as well. In any case, it will be a considerable programming poriject.

However, when the documents you are receiving are filled out forms, you can extract the data, even with an Action in Acrobat Pro. You have several possibilities to extract; the standard format is FDF, which you then can interpret, or, in order to make parsing easier, you can use XFDF, the XML representation of FDF.

Hope this can help.

Max Wyss.


Max Wyss   

PDFs can contain a lot of content that is not only the items to be displayed but also how to display the text in fonts not available on a user's system. The PDF content to be dispalyed can consist of text, images, or multi media items. There are also form fields, annotations, and JavaScipt code used in form calculations. All of this is explained in ISO -32000. O'Reilly Media also has technical material about PDFs construction. So the PDF can have a huge amount of data within the file, to facilitate the transfer of large files, the PDF specification allows the use of compression for text, imgage, and multimedia data items. Then the PDF can have a password applied to it so only users with the password can open the PDF or so the creator can control how the data of the PDF can be modidfied.

Whith Acrobat or 3rd party tools you should be able to convert non-password restricted PDFs to text or DOC files and even convert a compressed PDF to a plain text PDF.


George Kaiser   

I have a parser that worked when I received PDF files that when opened in notepad had text like this...(current example at bottom)

%PDF-1.4
%âãÏÓ
1 0 obj
<<
/Author()/Title()/Subject()/Producer()/Keywords()/CreationDate(03/04/13 08:51:41)/ModDate()/Creator(Epic Systems Corporation)
>>
endobj
4 0 obj
<</Length 107028>>
stream
q 0.750000 0 0 0.750000 0.000000 792.000000 cm
q q q 0.066667 -0.000000 -0.000000 0.066667 0.000000 -0.000000 cm
BT /F0 200 Tf 0 g 864 -15085 Td(Viewed &/or Printed on 3/4/2013 8:51 AM) Tj ET Q
q 0.066667 -0.000000 -0.000000 0.066667 0.000000 -0.000000 cm
BT /F0 200 Tf 0 g 11030 -15085 Td(Page ) Tj ET Q
q 0.066667 -0.000000 -0.000000 0.066667 0.000000 -0.000000 cm
BT /F0 200 Tf 0 g 11552 -15085 Td(1) Tj ET Q
q 0.066667 -0.000000 -0.000000 0.066667 0.000000

Now I am getting files that look like this...

%PDF-1.4
%âãÏÓ
1 0 obj
<<
/Author()/Title()/Subject()/Producer()/Keywords()/CreationDate(04/25/13 08:55:50)/ModDate()/Creator(Epic Systems Corporation)
>>
endobj
4 0 obj
<</Filter /FlateDecode /Length 6189>>
stream
xœímo9’ÇßpßÀ{·‹Øn6É~wP$ÙÖŒ%y$%“ ûÆ;ñ$¾íÝ8s;ûíOMå–¥ 骆;™Á@œ$æ¯ÿU¬²»Å˜û


James Mikel   


Please specify a reason: