XML? Sounds easy, but XML is a general-purpose markup standard and a
family of facilities that work with it. Of itself, XML does nothing, so
it's important for you to know what you are trying to achieve.
Word 2003 and later (and Mac Word 2008) have a number of facilities for
working with XML. In essence,
a. you can save and open documents in the Word 2003 and Word 2007
WordML/WordProcessingML formats
b. you can work with what Microsoft calls "Custom XML", in essence
enabling you to define an XML schema and use Word as an editor for
capturing data that conforms to that schema
c. some other versions of Word on both Windows and Mac can open/save
in some of those (a) formats using converters that you can download from
Microsoft sites.
Roughly speaking, (a) and (b) have very little to do with each other,
and from what you say, I'd say that (a) is likely to be more useful to
you than (b). Perhaps you are looking for some way to tag your Word text
according to some existing XML standard/schema such as docbook (see
http://www.docbook.org )
The good news is that that should be possible because in theory you can
transform any piece of XML into any other (within reason) using XSL
transforms.
The bad news is that
a. XSL usually has quite a steep learning curve (unless you just "get it")
b. WordML has a lot of stuff in it that you may find particularly
tricky yo handle using XSL. WordML has to be able to encode everything
that Word can do, which means it has to encode a mass of style and
formatting information, and it has to do it in a way that does not break
XML syntax rules. So Word puts quite a lot of stuff into its XML files
that does not correspond to anything you see on screen or even in the
Word Object model. So the XML that represents what you may think of as a
simple bulleted paragraph may actually be quite complex, and may also be
in several parts of the document
So getting what /you/ want out of WordML may be non-trivial. There may
well be sites with lots of useful XSL code that will help: I don't know.
But if that's what you want to do, it's worth bearing in mind that Word
(2003/2007 anyway) can save in several different XML formats:
a. .docx/docm. These are actually .zip format files with a number of
WordML .xml files zipped up inside them, so before you can do any XSL
processing on them you have to locate and extract the relevant .xml
b. .xml - the Word 2007 format .xml files actually contain most, but
not all, of the same WordML stuff that .docx files contain, but in a
single file format that you could in theory feed straight to an XSL
processor
c. .odt files. These use the OpenDocument XML standard that is also
used by OpenOffice, not WordML. Word 2007 SP2 can save in this format
natively. Off the top of my head, I don't know whether the external
converter package that other versions of Word use can do it. These files
are also .zip files with .xml inside, and they do not necessarily encode
everything that .docx can encode, but they may actually be more useful
as a way of getting what you want, because
- they are simpler
- I suspect there are more tools out there capable of transforming .odt
Peter Jamieson
http://tips.pjmsn.me.uk
Visit Londinium at
http://www.ralphwatson.tv