Search all the characters then get range for some of them

L

leverw

How do I programmatially search all the characters in a word document, then
get Range for some of them so I can change font, etc? I hate to call
Document.Characters because it will take forever. But if I call
Document.Range(ref 1, ref missing) to get all the text, then how do I know
the positions of some of the text since the document may contain
non-printable characters. For instance, I want to search for "Hello world"
in a document, set the text font color to red. Any help will be greatly
appreciated.
 
J

Jay Freedman

How do I programmatially search all the characters in a word document, then
get Range for some of them so I can change font, etc? I hate to call
Document.Characters because it will take forever. But if I call
Document.Range(ref 1, ref missing) to get all the text, then how do I know
the positions of some of the text since the document may contain
non-printable characters. For instance, I want to search for "Hello world"
in a document, set the text font color to red. Any help will be greatly
appreciated.

Do not attempt to get a range that way. Use the .Find method of a Range object
that is initialized to the document's range. If all you're doing is changing
characters and/or their formatting, you can do that by specifying the text or
formatting of the .Replacement property and executing with the wdReplaceAll
parameter. If you need to run more sophisticated logic on the found text, you
can use the Range object itself: When the .Find.Execute succeeds in finding the
search term, the Range object is automatically redefined to cover only the found
text.

As an example of the first scheme, start with a document in which there are
several occurrences of "Hello world" scattered about. Then run this macro to
change the color of all of the occurrences to red:

Sub demo1()
Dim oRg As Range
Set oRg = ActiveDocument.Range
With oRg.Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = "hello world"
.Replacement.Text = "^&" ' same as found
.Replacement.Font.Color = wdColorRed

.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWildcards = False

.Execute Replace:=wdReplaceAll
End With
End Sub

As an example of the second scheme, start with a document in which there are
several occurrences of red text scattered about. Then run this macro to change
the color of that text to blue only if the red text includes the word "blue":

Sub demo2()
Dim oRg As Range
Set oRg = ActiveDocument.Range
With oRg.Find
.ClearFormatting
.Text = ""
.Font.Color = wdColorRed

.Forward = True
.Wrap = wdFindStop
.Format = True
.MatchCase = False
.MatchWildcards = False

Do While .Execute
If InStr(LCase(oRg.Text), "blue") > 0 Then
oRg.Font.Color = wdColorBlue
End If
oRg.Collapse wdCollapseEnd
Loop
End With
End Sub
 
L

leverw

Thanks for your answer. But in our case, we have many words (>> 1000) that
we need to search and replace (actually just gray out the text). Doing it
one at a time is not very scalable, right? I thought I could get the entire
text and look for them myself. If some are positioned consecutively, I can
gray out several words at a time.

Thanks again.
 
D

Doug Robbins - Word MVP

It's a simple matter to load the words into an array and then use code
similar to Jays to interate through the array, and process each word in
turn.

You probably would not have time to get a cup of coffee while it was doing
it.

--
Hope this helps.

Please reply to the newsgroup unless you wish to avail yourself of my
services on a paid consulting basis.

Doug Robbins - Word MVP
 
L

leverw

I have only been doing this VSTO programming for the last 4 weeks. So how do
I "... load the words into an array" quickly as you describe? This is
exactly what I need to do. But when I call Document.Range, it gives me all
the word ranges in the document range, but I have to go through each
Word.Range call to get the actual word, which is not efficient to me, right?

Many thanks in advance again!
 
J

Jay Freedman

What Doug was suggesting was _not_ loading the range of the document into an
array, but loading an array with the list of words that you need to search
for. The "array" can be just a single Variant type variable. There are
several ways to load it. For a small amount of data you could use the Split
function to make an array from a string:

Dim WordsToFind As Variant
WordsToFind = Split("one,two,three,four", ",")

Once you have this array, you can rewrite the macro like this:

Sub demo2()
Dim WordsToFind As Variant
Dim idx As Long
Dim oRg As Range

WordsToFind = Split("one,two,three,four", ",")

For idx = 0 To UBound(WordsToFind) ' <===
Set oRg = ActiveDocument.Range
With oRg.Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = WordsToFind(idx) ' <===
.Replacement.Text = "^&" ' same as found
.Replacement.Font.Color = wdColorGray35

.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWildcards = False

.Execute Replace:=wdReplaceAll
End With
Next ' <===
End Sub

For the > 1000 words that you mentioned, I assume that you have a list
somewhere, maybe in a Word document or a text file. If you explain where the
list is and what separates the words from each other, we can suggest a good
way of getting the list into the array.

--
Regards,
Jay Freedman
Microsoft Word MVP
Email cannot be acknowledged; please post all follow-ups to the newsgroup so
all may benefit.
 
L

leverw

OK, Jay and Doug, I mis-understood your solutions. BUT, our application (a
document comparison between Office documents) requires we know all the words
in a document, so we can compare them against another document. We also need
to know the positions of each word in a document so we can gray it out later
if there is a matching phrase/paragraph in another document.

So to re-phrase my question better: how do I load all the words/characters
and their positions from a document, so later I can gray them out?

Again, millions of thanks in advance.
 
J

Jay Freedman

Ah, the description of the problem is now significantly different from the
original post. And I have a strong feeling that we're still not looking at the
"real" description.

What are you trying to accomplish, stated in plain English without reference to
Word and its features? Are you trying to determine (a) how one document was
edited to make another document, (b) whether one document might have been
plagiarized from another document, (c) what similarities exist in two possibly
unrelated documents, or something else?

How will you know what phrases or paragraphs to compare? I'm sure you don't want
to go word by word (and even less do you want to go letter by letter); you'd get
tons of hits on "the", "and" and similar common words.

The question of how you store words internally in the macro is completely
irrelevant until you know what you want to do with them.
 
L

leverw

Yes, Jay, those are a few of the features we can do with our document
comparison application. We can also compare Excel worksheets or tables in
Word documents. Since our comparison algorithm is quite sophisticated (a lot
more sophisticated than Document Comparision feature in Word 2007 from
Microsoft: e.g. we can detect any paragraph/table is swapped, splitted, or
merged with another paragraph/table), we have to go word by word in a
document (in a very scalable way of course), and compare another document
word by word, regardless they are regular words or "stop" words.

This is why we have to go through the entire document and remember the
position of each word to gray out later, if we can do that. Otherwise, do
you have any suggestion on how we should do it? Again many thanks in advance.
 
J

Jay Freedman

Whew, that's more than I'd care to tackle in Word -- the performance would
be horrible.

You can load a three-column array with the words in the document and the
range's start and end for each one with something like this (but with
several caveats explained below).

Sub demo()
Dim WordList() As Variant
Dim ListLen As Long
Dim idx As Long
Dim ThisWord As Range
Dim StartTime As Date

ListLen = 10000
ReDim WordList(2, ListLen)
StartTime = Now

For idx = 1 To ActiveDocument.Words.Count
If idx > UBound(WordList, 2) Then
ListLen = ListLen + 10000
ReDim Preserve WordList(2, ListLen)
End If

Set ThisWord = ActiveDocument.Words(idx)
WordList(0, idx - 1) = Trim(ThisWord.Text)
WordList(1, idx - 1) = ThisWord.Start
WordList(2, idx - 1) = ThisWord.End
Next

MsgBox "loaded " & idx - 1 & " words in " & _
Format(Now - StartTime, "N:S") & " sec"
End Sub

When the For loop finishes, each row in the array WordList contains the
text, start, and end of the corresponding object in the document. A couple
of things you should know about these objects:

- As explained in the VBA Help topic on the Words collection, the members of
that collection are Range objects; there is no "Word" object that would be
analogous to a Paragraph object.

- Each member of the Words collection includes any trailing space characters
(normally one character after each word except the last in a sentence or
paragraph, but more than one if they exist). In the code above, the Trim
function removes any trailing spaces.

- Each punctuation mark, paragraph mark, inline graphic, and some other
things are also members of the Words collection. Thus the following
paragraph in a document:
Two words.¶
appears in the collection as four "words", including one for the period and
one for the paragraph mark.

- The code shown above performs in _quadratic_ time -- that is, the run time
is a function of the square of the number of words in the document. The main
culprit is the Set ThisWord statement, which causes the VBA engine to count
from 1 to the current value of idx for each word in the collection. There
may be an algorithm that performs more nearly linearly, but I don't know
what it is. On my computer, running the sample code on 8000 words took
almost 6 minutes. This is likely to be a show-stopper.

Yes, Jay, those are a few of the features we can do with our document
comparison application. We can also compare Excel worksheets or
tables in Word documents. Since our comparison algorithm is quite
sophisticated (a lot more sophisticated than Document Comparision
feature in Word 2007 from Microsoft: e.g. we can detect any
paragraph/table is swapped, splitted, or merged with another
paragraph/table), we have to go word by word in a document (in a very
scalable way of course), and compare another document word by word,
regardless they are regular words or "stop" words.

This is why we have to go through the entire document and remember the
position of each word to gray out later, if we can do that.
Otherwise, do you have any suggestion on how we should do it? Again
many thanks in advance.

--
Regards,
Jay Freedman
Microsoft Word MVP
Email cannot be acknowledged; please post all follow-ups to the newsgroup so
all may benefit.
 
R

Ryan McGhee

I have implemented a similar solution in C# to the one shown here by Jay Freedman. This has been the only method I have seen that will allow me to extract text from a document, via the Word libraries.

As I need to perform this functionality in a batch process, and it takes a very long time to pull text from a single file, this will not be a feasible solution for my needs.

Outside of cracking open the OLE file format, does anyone know a faster method of extracting text via Word?
 
J

Jonathan West

Hi Ryan

Since your message hasn't been pasted as a response to Jay's, and you
haven't stated which specificv message of his you arew referring to, it
makes it a bit hard for others to know what you are talking about and
therefore to help you. Could you be a bit more specific?
 
C

Cindy M.

Hi Ryan,
I have implemented a similar solution in C# to the one shown here by Jay Freedman. This has been the only method I have seen that will allow me to extract text from a document, via the Word libraries.

As I need to perform this functionality in a batch process, and it takes a very long time to pull text from a single file, this will not be a feasible solution for my needs.

Outside of cracking open the OLE file format, does anyone know a faster method of extracting text via Word?
Unfortunately, you don't provide a link to Jay's suggestion, or a quote... That makes it impossible to offer an opinion.

Which version of Word is involved, here?

Cindy Meister
INTER-Solutions, Switzerland
http://homepage.swissonline.ch/cindymeister (last update Jun 17 2005)
http://www.word.mvps.org

This reply is posted in the Newsgroup; please post any follow question or reply in the newsgroup and not by e-mail :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top