Parsing document structure

C

Cameron

Hi - I'm using VB.NET and a beta of Office 2003. I want my VB.NET application to be able to read any word document and to be able to determine its structure. Specifically, I'm after the information that normally goes in the table of contents; any lines with one of a specified list of style names (e.g. Header 1, Header 2, Header 3 etc.). If you can visualise the structure of a word document as a treeview:

Header 1
Header 2
Header 2
Header 1
Header 2
Header 3

..... I want to be able to construct that hierarchy within my own code. Determining the list of styles that actually exist in the document is fairly straightforward, as well as working out which ones are actually in use. I then ran into trouble when looking for the header lines. I started like this:

Dim sentences As Microsoft.Office.Interop.Word.Sentences
Dim range As Microsoft.Office.Interop.Word.Range
Dim style As Microsoft.Office.Interop.Word.Style

sentences = OpenDoc.Sentences ' Get all sentences
For Each range In sentences ' For each sentence
style = range.Style ' Find out which style it has
Debug.WriteLine("'" & range.Text & "' has style '" & style.NameLocal & "'")
For Each StyleName In StyleTexts ' Check all wanted styles (StyleTexts is set up elsewhere)
If style.NameLocal = StyleName Then ' if sentence is wanted
Dim Text As String = range.Text & " (" & style.NameLocal & ")"
StylesForm.StylesListBox.Items.Add(Text, True)
Exit For
End If
Next
Next

What I found was that I got 'object not set to an instance' errors in this code - on the Dim Text As String line I think - when I put arbitrary large documents through it. Small test documents worked fine. I then wondered if I should try using the XML support built into Word 2003, so I tried the following code:

Dim DocNodes As Microsoft.Office.Interop.Word.XMLNodes
Dim DocNode As Microsoft.Office.Interop.Word.XMLNode

DocNodes = OpenDoc.XMLNodes
Debug.WriteLine("XML Nodes : " & DocNodes.Count)
If DocNodes.Count > 0 Then
For Each DocNode In DocNodes
Debug.WriteLine("XML Node: " & DocNode.BaseName)
Next
End If

What I found was that I never got any XML Nodes back from the document. Even if I went into Word 2003 and saved the document as XML, and then ran the code on the XML document, I never got any nodes back. I noticed that in Visual Studio, if I typed in ... = OpenDoc.Nodes ( ... then I was prompted for an index - an integer. Putting 0 got me no nodes. Putting 1 got an error.

So my questions are two-fold. Firstly, for the purposes of parsing the structure of a Word document that has probably been created by a version of Word prior to 2003, should I use XML or not? Secondly, any ideas what might possibly be going wrong with the XML or non-XML version of the code?

Many thanks,

Cameron
 
W

Word Heretic

G'day "Cameron" <[email protected]>,

some tips:

use paragraphs instead of sentences
check the outline level of the paragraph

Steve Hudson
Word Heretic Sydney Australia
Tricky stuff with Word or words

Email: WordHeretic at tpg.com.au


Cameron was spinning this yarn:
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top