Whew, that's more than I'd care to tackle in Word -- the performance would
be horrible.
You can load a three-column array with the words in the document and the
range's start and end for each one with something like this (but with
several caveats explained below).
Sub demo()
Dim WordList() As Variant
Dim ListLen As Long
Dim idx As Long
Dim ThisWord As Range
Dim StartTime As Date
ListLen = 10000
ReDim WordList(2, ListLen)
StartTime = Now
For idx = 1 To ActiveDocument.Words.Count
If idx > UBound(WordList, 2) Then
ListLen = ListLen + 10000
ReDim Preserve WordList(2, ListLen)
End If
Set ThisWord = ActiveDocument.Words(idx)
WordList(0, idx - 1) = Trim(ThisWord.Text)
WordList(1, idx - 1) = ThisWord.Start
WordList(2, idx - 1) = ThisWord.End
Next
MsgBox "loaded " & idx - 1 & " words in " & _
Format(Now - StartTime, "N:S") & " sec"
End Sub
When the For loop finishes, each row in the array WordList contains the
text, start, and end of the corresponding object in the document. A couple
of things you should know about these objects:
- As explained in the VBA Help topic on the Words collection, the members of
that collection are Range objects; there is no "Word" object that would be
analogous to a Paragraph object.
- Each member of the Words collection includes any trailing space characters
(normally one character after each word except the last in a sentence or
paragraph, but more than one if they exist). In the code above, the Trim
function removes any trailing spaces.
- Each punctuation mark, paragraph mark, inline graphic, and some other
things are also members of the Words collection. Thus the following
paragraph in a document:
Two words.¶
appears in the collection as four "words", including one for the period and
one for the paragraph mark.
- The code shown above performs in _quadratic_ time -- that is, the run time
is a function of the square of the number of words in the document. The main
culprit is the Set ThisWord statement, which causes the VBA engine to count
from 1 to the current value of idx for each word in the collection. There
may be an algorithm that performs more nearly linearly, but I don't know
what it is. On my computer, running the sample code on 8000 words took
almost 6 minutes. This is likely to be a show-stopper.
Yes, Jay, those are a few of the features we can do with our document
comparison application. We can also compare Excel worksheets or
tables in Word documents. Since our comparison algorithm is quite
sophisticated (a lot more sophisticated than Document Comparision
feature in Word 2007 from Microsoft: e.g. we can detect any
paragraph/table is swapped, splitted, or merged with another
paragraph/table), we have to go word by word in a document (in a very
scalable way of course), and compare another document word by word,
regardless they are regular words or "stop" words.
This is why we have to go through the entire document and remember the
position of each word to gray out later, if we can do that.
Otherwise, do you have any suggestion on how we should do it? Again
many thanks in advance.
--
Regards,
Jay Freedman
Microsoft Word MVP
Email cannot be acknowledged; please post all follow-ups to the newsgroup so
all may benefit.