You're probably going to have to do some reading in the links previously provided and do some testing with documents created by Word
that represent your environment <g>. Part of the parsing of the file depends on what it is you're looking to extract if you don't
want to use one of Word's two built in 'web document save formats'. If you're parsing on a 'look for X and ignore the rest' that
may be easier than trying to generate your own HTML and still have it look like the original document.
There isn't a simple 'always' answer for the HTML document anymore than there is for 'what's in a regular Word document' as far as
content and style and formatting (the Word 2007 spec on this runs to a 1,000+ pages <g>)
For example, Word creates CSS, but a CSS template can also be attached to a document and applied.
Assuming the HTML was generated using the 'Word Web Document' save format, rather than the 'Word Web Document-Filtered' file type
choice, then there is usually a <div class...> for each new section of the document, but you can add numerous sections in Word as
'section breaks' of various types. There are also <p class...> that Word will generate for a Style change that is listed in the
<Styles> section, among others.
The Styles section in the Web document reflects all the styles in use in the regular Word document, which can include the default
ones built into Word (and that varies by version), any created by the user, or Word on the fly, and can include direct formatting,
if the user paints the text in the document with formatting that isn't part of a given style.
For example, if the text in Word was typed in
This is sample text paragraph 1
This is sample text paragraph 2.
and the text was just entered with the default, out-of-the box, "normal" style in a Word 2003 document (and that style becomes
MsoNormal in the web document), then the text in Word generated 'HTML' would be
===========typed in sample ==========
<body lang=EN-US style='tab-interval:.5in'>
<div class=Section1>
<p class=MsoNormal>This is text sample paragraph 1.</p>
<p class=MsoNormal>This is text sample paragraph 2.</p>
</div>
</body>
</html>
==============end typed in sample
but, if the same text is pasted into the document it could be
===========pasted in================
<body lang=EN-US style='tab-interval:.5in'>
<div class=Section1>
<p class=MsoNormal>This is sample text paragraph 1.<o
></o
></p>
<p class=MsoNormal>This is sample text paragraph 2.<o
></o
></p>
</div>
</body>
============end basic typed sample =========================
If the word 'sample' was painted with italics in the first paragraph and with the yellow highlighter tool in the 2nd paragraph then
the typed text becomes
the following (Word generally adds <span...> tags for direct formatting over a text with an applied style.
=============== italics and highlighter tool sample ============
<body lang=EN-US style='tab-interval:.5in'>
<div class=Section1>
<p class=MsoNormal>This is text <i style='mso-bidi-font-style:normal'>sample</i>paragraph 1.</p>
<p class=MsoNormal>This is text <span style='background:yellow;mso-highlight:yellow'>sample</span> paragraph 2.</p>
</div>
</body>
=============
If someone applied, through either promote/demote in outline view, or by style selection the default Heading 2 style to the 2nd
sentence, you'd find that for that particular style Word would not use the <Styles> listing, but would use the HTML <H2> style as
shown here trying to use a W3C 'standard' (HTML) formatting as first choice.
========== Word built in HTML style ===========
<body lang=EN-US style='tab-interval:.5in'>
<div class=Section1>
<p class=MsoNormal>This is text sample paragraph 1.</p>
<h2>This is text sample paragraph 2.</h2>
</div>
</body>
=========
To keep from starting from scratch <g> you may find the Word2HTML.XSL style sheet tool, helpful in parsing the Word [web]
documents. It's part of the WMLView.exe download linked from
http://blogs.msdn.com/brian_jones/archive/2005/09/30/475794.aspx
=============
Hi,
Thanks for the response. Because I need to interrogate the HTML document
generated by Word I thought I could save some effort if I could make
assumptions about the HTML generated by Word, like when does it generate a
new class and what name does it use, when does it put a style in the <style>
block, when does it use an inline style, when does it use the <em> tag, ..etc.
I guess I'm really after the algorithm/logic that Word uses to generate the
HTML document.
Failing that, the specific information I'm really after is, does Word always
put Font-Size and and Font-Family information in the <style> tag, if not when
does it make these inline styles.
====================
--
Bob Buckland ?
MS Office System Products MVP
*Courtesy is not expensive and can pay big dividends*