Extracting metadata from an MS-Word (or other) document

P

peter

Is there a supported way to get metadata out of a word document? At the
moment what I'd like is a script that told me how many pages there are
in a word document - it is more useful to me than how many KB there
are. I know that I can work it out by opening it, but that is a hassle.

Alternatively, is there an unsupported script that can extract
metadata?

Alternatively, does anybody know the layout of metadata in an MS-Word
2004 document so that I can extract it?

I understood that, at some stage, MS was going to use XML to store word
documents - has this occured yet? If it has, and when it has,
presumably it will be extremely easy to do this.
 
J

John McGhie [MVP - Word and Word Macintosh]

Hi Peter:

OK, let me give you a simple answer, then tell you why it won't work :)
Fellow Brit Jonathan West explains how to do it here:
http://www.word.mvps.org/faqs/macrosvba/DSOFile.htm

Coupla "considerations"... A Word Document does not have any "pages"
internally. Pages are an "output" concept. Word does not generate pages in
a document until it sends it to an output device (screen or printer). So
the property you are looking for may not be stored in the file.

You will read a property with DSOFile that will indicate the page count
assigned by the last version of Word to repaginate and save the document.
This could be quite at variance with what the file now contains.

The number of pages in a document depends on which printer Word connects to
and what paper sizes are available in that printer and on which paper size
the document is set to and on which fonts the printer contains (in other
words, the answer is massively dynamic and the only way to know accurately
is to open the file in Word, force a repagination, then look).

However, if you want this for estimating purposes, chances are you do not
need to be that accurate. What I would do is return the size of each .doc
file in bytes, subtract 20,000, then divide the remainder by 7,000. (The
internal structure of a Word document: fonts, styles, headers, footers etc
-- occupies between 15,000 and 25,000 bytes before you insert the first text
character).

Your answer will be within 20 per cent of correct. There's literally
hundreds of considerations you have to handle (e.g. How many tables and
graphics in the document, and how big are they).

Even the calculation is really variable: Internally, Word stores the Text
component of a document in Unicode, potentially encrypted, with deleted and
revised text retained, and compressed.

The only way to get it accurate is to open each file in Word and look.

Is Microsoft going to convert Office products to storing in XML? Yes. Has
it happened yet? Yes: Enterprise versions (i.e. The version available to
companies purchasing on a volume licence) of Office 2003 System have this
now. Other versions of Office 2003 have the ability to read and write XML.

In Mac Office, if you save as "Web Page" you actually get XML. How much XML
you get depends on the version of Mac Word you use and the filter setting
you use: "Save Entire File" gives you rich XML output including most of the
metadata.

The next versions of Microsoft Office, "Office 12", on both PC and Mac, will
use XML as the native file format.

Cheers

Is there a supported way to get metadata out of a word document? At the
moment what I'd like is a script that told me how many pages there are
in a word document - it is more useful to me than how many KB there
are. I know that I can work it out by opening it, but that is a hassle.

Alternatively, is there an unsupported script that can extract
metadata?

Alternatively, does anybody know the layout of metadata in an MS-Word
2004 document so that I can extract it?

I understood that, at some stage, MS was going to use XML to store word
documents - has this occured yet? If it has, and when it has,
presumably it will be extremely easy to do this.

--

Please reply to the newsgroup to maintain the thread. Please do not email
me unless I ask you to.

John McGhie <[email protected]>
Microsoft MVP, Word and Word for Macintosh. Consultant Technical Writer
Sydney, Australia +61 4 1209 1410
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top