Is there a way to speedup FindText?

M

murphy

Hi

I am writing a VB macro to scan through a story and highlight words that are
inappropriate for readers of certain ability levels. Appropriate words are
listed in a file and there over 60,000 of them for the highest level.

Basically I have an outerloop that selects each word in the story in order
and then I call my InWordList function to see if it is contained in the
wordlist. (If it isn't, I highlight the word, because it's inappropriate.)
For a wordlist of 60,000 words, this scan is taking over a second on average
for each word.

I have thought of sorting the wordlist by order of word frequency, and
several other ways I could massage the wordlist itself, but before I try
that, I'd like to make sure there isn't a way to speed up the search function
itself. Perhaps there is another command?

Thanks for any advice
 
M

murphy

And here is fcn that scans the whole story, word by word, and the fcn to
check the wordlist (below). Sorry, I should have included these above.

Public Sub AnalyzeWords()
Dim found, sampleWord, myRange, aWord
Dim curWord As Integer
Dim cleanedWord
curWord = 1
Set myRange = activeSample.Content
For Each aWord In myRange.Words
myRange.Words(curWord).Select
cleanedWord = Trim(Selection.Text)
If (InWordList(cleanedWord) <> True) Then
Selection.Font.Color = wdColorRed
missingPatternWords = missingPatternWords + 1
End If
curWord = curWord + 1
Next aWord
End Sub

Public Function InWordList(aWord) As Boolean
Dim myRange As Range
Set myRange = activeWordList.Content
myRange.Find.Execute FindText:=aWord, Forward:=True, MatchCase:=False
InWordList = myRange.Find.found
End Function
 
J

Jezebel

It seems to me that there are some problems with your current method --

1. The Words collection includes a lot of things that you don't want to
bother with, like punctuation, spaces, and numbers.
2. You'll be spending a lot of time processing common, repearted words (like
'the' and 'and').
3. Unless your checklist has been expanded to include them, you'll be
missing inflected forms of words (like plurals and past participles).

You'll need to research it, but it would likely be quicker to work with
plain text than with the original document (and you *definitely* don't want
to be messing with the Selection object). Maybe along these lines --

Dim pDoc as string
Dim pIndex as long
Dim pWordList() as string

.... read your wordlist document into pWordList()

pDoc = ActiveDocument.Content.Text

For pIndex = 1 to ubound(pWordList)
if instr(pDoc, pWordList(pIndex)) > 0 then
... find the word in the document and highlight it
end if
Next
 
M

murphy

Hey, Jezebel, you're awesome! Thanks for the speedy reply, and the nifty
solution. The search is easily a hundred times faster just from transferring
the wordlist to an array and then searching on that... But, now, I've got
another problem because loading the array is super slow. I know there must
be a nice way to cut a string up and assign it to an array, but I can't find
it in the help docs. Is there anything analagous to split() in VB which is
also cross-platform? On my mac VB doesn't seem to understand split().

You are right that it's ugly to have to weed through the punctuation and the
spacing when checking for words. Not knowing any better, I wrote a routine
to check each word and to ignore it if it's punctuation. It's messy, (and
slow!), but it works. Is there a standard way to ignore punctuation and
spacing? In terms of inflectional endings and such, I combed the internet
for a while and found a cool site with a pretty good set of American English
dictionaries. The one I'm using had just about all the inflectional endings
and past participles I could think of -- and many more. Here is a link in
case anyone else is looking to do something with wordlists in the future...

http://wordlist.sourceforge.net/

Thanks for your help!
--Murphy
 
J

Jezebel

One approach to managing your wordlistis to store it in Excel rather than
word. Then you can read the Excel vector directly into an array --

Dim pxlApp As Excel.Application
Dim pxlBook As Excel.Workbook
Dim pxlSheet As Excel.Worksheet
Dim pData() As Variant

Set pxlApp = Excel.Application
Set pxlBook = pxlApp.Workbooks.Open(FileName:="c:\...\Book1.xls")
Set pxlSheet = pxlBook.Worksheets("Sheet1")

pData = pxlSheet.Range("A1:A15000")

Note that pData always comes back as a two-dimensional array, even if one of
the dimensions has a range of one (eg in this case, pData(1, 1) to
pData(15000,1).


As for the inflected word forms: you could side-step a lot of the problems
if, instead of taking the words in your document and checking if they are in
your wordlist, you do it the other way round: iterate your wordlist, and
check if they are in the document. Then, in many (but obviously not all)
cases, you need only look for word stems.

Eg, if the various forms of 'redundant' are proscribed, searching for
'redund' will get a match on 'redunant', 'redundancy', 'redundancies', etc.

It might also be worth looking at some of the work that's been done on text
tagging and text mining (eg see
http://itre.cis.upenn.edu/~myl/languagelog/archives/003753.html and the
links therein). Academics are often very generous with their software; so
you might find that this task has already been dealt with.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top